Collecting data in threaded programs

Threads are separate, independent lines of control executing within a single process. Threads share the same address space but maintain separate execution stacks. Quantify collects performance data as each thread runs and, by default, reports the composite performance data over calls to all functions from all threads.

Quantify works with most popular threads packages. For a list of supported threads packages, see the README file.

Threads and stacks

Quantify maintains separate accumulators for each stack and combines them to form the composite data. To request that Quantify save the per-stack data in separate qv data files, use the option:

% setenv QUANTIFYOPTIONS -save-thread-data=stack,composite

This saves both the composite and the per-stack data. If you specify stack, each dataset is written to a separate file. Quantify names the file by appending the value of the %T character to the value of the -filename-prefix option, followed by the .qv extension. For example, if the -filename-prefix option is %v.%p.%n, data for each stack is saved to a file named according to the expansion of %v.%p.%n.%T.qv. For more information, read conversion characters

The data collection options and API functions described in this chapter affect data collection and saving for all threads, not just the thread that is currently executing. Thread-specific versions of the API functions are available. See the <quantifyhome>/quantify_threads.h file.

Typically, many threads reuse a single stack. This can happen if the thread is destroyed and the threads package recycles the stack for a later thread. Since Quantify detects stack creation and stack switches, not thread creation and destruction, the statistics it gathers reflect all of the threads that used a particular stack.

Quantify detects the use of a new stack by monitoring the stack pointer and comparing it against its table of known stack areas. If the stack pointer is not close to the stack of a known thread, Quantify assumes that a new stack has been created.

Use the -thread-stack-change option to specify how large, in bytes, a change to the stack pointer must be before Quantify recognizes a new stack. You might need to increase this value for programs that allocate large data structures on the stack using alloca, and decrease this value for programs that create threads with stacks very close to one another.

Quantify accesses its own internal data structures in a thread-safe way. Quantify uses mutual exclusion (mutex) locks to accomplish this. The exact implementation of mutex locks depends on the thread library being used. The mutex locks used by Quantify work properly with the scheduling code in your threads library.

By default, Quantify assumes that your application creates no more than 20 stacks during a run. To increase this number, use the -max-threads option.

Solaris lightweight processes and threads

On Solaris, threads can be assigned to different lightweight processes (LWPs), which, in turn, can be assigned by the operating system to different processors if they are present.

Quantify, however, cannot easily determine when an LWP has been assigned to a different processor (since it can happen at any time) and hence cannot determine whether two LWPs (and therefore threads) might be running at the same time. Quantify therefore cannot account for true concurrency in such an application, and reports the sum of all counted times as though there is only a single processor. In this regard, Quantify's times are pessimistic for applications running on symmetric multi-processor (SMP) machines.

Since Quantify cannot determine if several LWPs are running simultaneously, it does not time system calls that cause LWP scheduling to occur. These system calls look as though the calling thread blocked for a long time, but in fact the process was doing useful work in another thread in the same process, which Quantify is counting. If these system calls were timed, the time spent in other threads would be double-counted in the elapsed time recorded for the system call.

Quantify assumes that there are several CPUs and by default does not time system calls associated with lightweight processes. By default, -never-record-system-calls=:SYS_sigtimedwait,SYS_lwp_sema_wait,SYS_lwp_create, SYS_lwp_kill,SYS_lwp_mutex_unlock,SYS_lwp_mutex_lock,
SYS_lwp_cond_wait,SYS_signotifywait

If you are running only one processor on the system, you can increase the accuracy of Quantify's times by having Quantify time the LWP system calls. You can remove specific system calls from the list specified to -never-record-system-calls, or specify a null string to have Quantify time all of the system calls listed above.

This list is subject to change for different release levels of the Solaris operating system. Consult the installation notes and README for the most recent list.

Analyzing data from threaded programs

The call graph looks different for multi-threaded applications.

With multi-threaded applications, there are additional functions emanating from .root. corresponding to the unique starting points for the different threads started during the dataset. Some thread packages place their own routines on those stacks, which then call your functions. Other packages arrange for your functions to be called directly. You are also likely to see functions emanating from .root. reflecting the activity on any scheduler and signal handling threads the thread library sets up.

Since threaded programs must lock accesses to shared data, you see support functions calling various different locking primitives supplied by the threads package. On Solaris, even under a single-threaded application, the linked libraries are MT-safe (multiple-thread safe). This means they call locking primitives as well, even though they are stubbed out in a single-thread application. These locking calls increase the complexity in the call graph. You can use the call graph pop-up menu to collapse the subtrees under these locking primitives.