Hilfsnavigation

RZ-Gebäude
RZ-Logo

Zielgruppennavigation

Inhalt

Performance Tuning - A Short Introduction


Dieter an Mey
Aachen, February 2008

 

High Performance Computing (HPC)

As described in the short introduction into HPC the focus of HPC is to reduce the time to solution of computational problems. The target is an efficient implementation of a suitable algorithm in a suitable programming language leading to an executable program which exploits the suitable machinery efficiently. Today, this usually results in a cache friendly parallel program written in C, C++ or Fortran.

Currently there are  two dominating parallelization paradigms:

  • OpenMP for shared-memory-parallelization on multiprocessor computers (SMP), with all processors accessing a common memory
  • Message Passing with MPI for distributed-memory parallelization for clusters of computers connect via a fast network
As current supercomputers typically consist of a cluster of multiprocessor computers (SMP cluster), MPI and OpenMP can nicely be combined, a strategy which is called hybrid parallelization. Also MPI can be used to run multiple processes on each multiprocessor cluster node, avoiding the effort to combine both parallelization techniques.
 

Performance Tuning Targets

What can be accomplished by parallelization and performance tuning? In an ideal case parallelization would lead to a speed-up which scales linearly with the number of processors employed compared to the original serial program running on a single processor.

  • A serial program is expected to run close to the processor's theoretical peak performance.
  • An OpenMP program running on an SMP with M processors (or processor cores) is expected to run M times faster with M OpenMP threads compared to the serial program running on one processor (core).
  • An MPI program running on an SMP cluster consisting of N nodes with each node having M processors (or processor cores) is expected to run N*M times faster with N*M MPI processes
  • A hybrid (OpenMP + MPI) parallel program running on an SMP cluster consisting of N nodes with each node having M processors (or processor cores) is expected to run N*M times faster with N MPI processes and M OpenMP threads per MPI process (or with for example N*2 MPI processes and M/2 OpenMP threads per MPI process).

What if a program's performance does not meet these expectations? Indead, there are good reasons why these expectations most likely will not be met:

  • While today's processor typically have a very high peak performance, memory performance lags behind and in many cases the processors cannot be fed with data fast enough to maintain a high computational speed. As a rule of thumb, modern processors typically achieve some 5 percent of their peak performance. But it may get worse, if the way data is accessed in memory is very unfortunate. In a lucky case, if a program displays a nice (temporal and spatial) locality and thus is very cache friendly performance may get close to some 20 - 50 percent of theoretical peak performance.
  • Multiprocessor computers, multicore processors and even more multithreaded processor cores share a lot of resources. In an SMP the processors share paths to the common memory, cores in a multicore processor frequently share caches and pins, multithreaded processor cores share virtually everything except for a couple of state registers. All this resource sharing may lead to considerable conflicts limiting the speed-up of parallel programs.
  • In a compute cluster the network between these nodes may be a bottleneck, if the MPI program is very communication intensive. In this case the performance of the network, its latency, bandwidth and topology, will be an important factor.

Thus, performance tuning has 3 major aspects:

  • designing and tuning the sequential program to exploit the single thread or process performance
  • parallelization for shared memory to exploit the node performance
  • parallelization using message passing to exploit the cluster performance

Now, what if a program has been parallelized using OpenMP and/or MPI but it still runs too slow? What if adding resources, increasing the number of OpenMP threads and/or MPI processes even slows down the program?

Parallelization always introduces synchronization and/or communication overhead. Additionally, there may be leftover code regions, which have not been parallelized and thus can easily become a performance bottleneck. As an example, if 50 percent of the runtime has originally been spent in code regions which have been parallelized, the parallel code version can optimally only run twice as fast, no matter how many processors are employed.

Performance tuning tools can be a big help to learn more about the runtime behavior of the program and to detect performance problems.
 

Tuning Sequential Programs

In order to improve the performance of a single thread or process, the hardware architecture of a processor has to be taken into account. Over the last decades the processor speed has been improved considerably, whereas memory has primarily increased its capacity but not the access time (latency). As a consequence, processors frequently have to wait for data coming from memory and in many cases cannot run at full speed. In order to overcome this bottleneck, chip designers have added fast but small buffers (caches) early on. These caches work transparently to the user and can speed up programs considerably. Still many compute intense applications only run at some 5 percent of the theoretically peak performance of modern processors, leaving considerable headroom for tuning efforts. The advent of multicore processors aggravates this memory bottleneck furthermore.

In order to improve the cache performance it is beneficial to reuse data residing in a cache in a timely manner (temporal locality). As caches are organized in so-called “cache lines” of some 32 or 64 bytes and data is transported between memory and processors in chunks of cache lines, it is profitable to really use all of the data of such a chunk once it resides in a cache (spatial locality).

Hardware Counters can provide statistical information about the distribution and frequency of cache misses or the time it takes to fetch data from memory for certain parts of a program. The rate of executed processor instructions and particularly the rate of executed floating point instructions can be measured, which is a suitable measure for program performance.
 

Tuning OpenMP Programs

As all threads of an OpenMP program access a common memory, tuning the single thread’s memory performance is essential for improving the performance of an OpenMP program as well.
A nasty aspect of bad performance of OpenMP programs which is particularly hard to detect is called “false sharing”. Because of the way caches are organized, cache lines may be forced to travel between the caches if different threads update different parts of the same cache line.
Adding OpenMP constructs also introduces administrative overhead to the original program. If this overhead is too large compared to the amount of parallelized work which is distributed to multiple threads, then using OpenMP may be counterproductive and will slow down the application.
Furthermore, critical regions may have to be introduced into the program in order to update shared memory locations by multiple threads and avoid data races. These critical regions serialize the program flow and may seriously decrease performance.
Load imbalances may cause threads to idle or wait at synchronization points (barriers). OpenMP provides easy-to-use loop scheduling policies to overcome these problems in many cases, which may in turn insert additional overhead.
These handicaps are examples of performance bottlenecks which may need to be detected and resolved to obtain a reasonable performance gain.
Tuning tools provide statistics and visual representations of the program execution revealing such performance obstacles.
 

Tuning MPI Programs

Again, single processor (core) performance is the sound foundation for parallel performance. (Ironically, MPI speedup looks better, if the performance of the single MPI process is worse, because then the granularity of the work chunks is automatically larger.) The major source of inefficiency of MPI programs is communication overhead. How much time is spent in sending messages compared to useful calculation? Often there is an opportunity of hiding communication by overlapping communication and computation, which is well supported by MPI.
Global communication involving many or all MPI processes - for example calculating reductions, residuals, or error estimations - may include costly synchronizations. Sometimes the agglomeration of such reductions can reduce the overhead.
Load imbalances may easily become a serious performance problem. Tuning tools providing statistical and visual information about the program execution can easily detect these problems, but solving them may require a lot of program changes.
 

Performance Tuning Procedure

Typically the program is instrumented, which means it is populated with measuring instruments (which may in turn have a negative impact on performance) to collect information about the program execution. The program is then executed with a reasonable, carefully selected set of input data to make it behave representatively but still not take too much time. The instruments probably may have to be calibrated, adjusted, and parameterized to collect a usefull set of performance data which then is post-processed and presented to the user to provide a deeper insight in the program’s behavior and the cause of bad performance.
A major challenge for performance analysis is the amount of data which can easily be collected. Therefore it is important to decide which information is really of interest and how superfluous data can be filtered out.
 

The application of tuning tools consists of 3 major steps:

  • Instrumentation or modification of the program to generate performance data
  • Collecting runtime information
  • Evaluation, analysis and presentation of the collected data (This typically happens after the program run, but some tools also present performance data dynamically at runtime).
What kind of performance information can be collected?
  • timing information (CPU-time and/or real-time) on a statement, loop, routine, or program level,
  • hardware performance counter information (cache hits and misses, various kinds of instruction counters, memory access counters, network accesses etc.),
  • number of executions on a per instruction, per loop, or per routine basis,
  • specific wait times when threads or processes are synchronized, waiting for messages etc.,
  • program counter (PC) and sometimes callstack information to relate performance data to the program location at the analysis phase.

Program Instrumentation

Instrumentation of the program can be done on various levels:

  • On the source code level,

    • the programmer or a source-to-source pre-processor can add statements to inquire and record timing or hardware counter information and call library routines belonging to some measurement system in order to collect additional application specific information.

    • The compiler can instrument the code, if the programmer adds suitable compiler options.

  • At link time profiling versions of libraries can be used, or routines can be wrapped by adapter routines collecting performance information.

  • The executable can be enhanced through binary instrumentation.

  • The binary program can be executed under the control of some measuring tool, as simple as a timer and as complex as a tracing utility which has access to a comprehensive dynamic tracing framework provided by the operating system (like Solaris dtrace)

Collecting runtime information

How can performance information be collected? Two different methods can be distinguished:

  • Profiling updates summary statistics of execution when an event occurs. It uses the occurrence of an event to keep track of statistics of performance metrics. These statistics are maintained at runtime as the process executes and are stored in profile data files when the program terminates.
    In profiling based on sampling, a hardware interval timer periodically interrupts the execution. Such interrupts can as well be triggered by performance counters, that measure events in hardware.

  • Tracing , on the other hand, records a detailed log of timestamped events and their attributes. It reveals the temporal aspect of the execution. This allows the user to see when and where routine transitions, communication and user defined events take place.

 

The program’s image file and symbol table cam be used after the program run to calculate the time spent in each routine as the number of samples in a routine’s code range times the sampling period. Instead of the program counter, the callstack of routines can be sampled, too. In this case, a table that maintains the time spent exclusively and inclusively (of called child routines) for each routine, is updated, when an interrupt occurs. Then, the exclusive time of the currently executing routine and the inclusive time of the  other routines that are on the callstack are incremented by the interrupt time interval. Additionally,each time a parent calls a child function, a counter for that parentchild pair can be incremented to provide call-graph information and frequency of routine evokations.

 

Evaluation, analysis and presentation of the runtime information and performance properties

The output of performance tools can be as simple as a textual summary of a small amount of runtime statistics and as overwhelming as a huge amount of data to be presented by graphical user interfaces. These user interfaces provide different views on the data including timeline visualization revealing the behavior of the application over time.

The purpose of a performance tool is not to overwhelm the user with the sheer amount of data but to pinpoint the most important performance problems.

Sometimes employing a tool presenting only a few lines of text can be more economical, as it can be applied more frequently in scripts and even in batch mode. On the other hand, the costly presentation of the time flow of the application can be a very helpful for the programmer to give a feeling for what is going on at runtime and may even lead to the detection of program errors.

Outlook

Today, performance analysis of "regular" parallel programs is well understood. But analyzing "irregular" parallel programs applying nested, hybrid or even recursive parallelization, changing the number processes or threads on-the-fly is much harder and the subject of active research. On the other hand, the number processor cores employed in large applications is constantly increasing leading to enormous amounts of performance data, causing problems when storing, retrieving, analyzing and presenting them in an adequate and efficient way.

 

References

  1. Sameer Shende
    Profiling and Tracing in Linux
    http://www.cs.uoregon.edu/research/paraducks/papers/linux99.pdf
  2. Sun Studio Performance Analyzer
    http://developers.sun.com/sunstudio/overview/topics/analyzer_index.html
    http://developers.sun.com/solaris/articles/perftools.html
  3. Bernd Mohr
    Performance Analysis of MPI Programs with Vampir and Vampirtrace (pdf)

Abschlußinformationen