In order to evaluate the suitability of the UltraSPARC T2 processor for high performance computing we selected a bunch of application codes which reflect the variety of programs typically executed on our compute cluster and also the SpecOMP benchmark.
In the Laboratory for Machine Tools and Production Engineering of the RWTH Aachen University, the contact of bevel gears is simulated and analyzed in order to e.g. understand the deterioration of differential gears as they are used in car gearboxes. These simulations usually run for a few days when using the original serial code.
The program was parallelized using OpenMP and it turned out that it scales quite well on multicore architectures, as it is very cache friendly. The parallel code versions consists of some 90,000 lines of Fortran90 code containing 5 parallel OpenMP regions and 70 OpenMP directives.
Allthough the parallelization speedup on the UltraSPARC T2 processor based Sun T5120 is over 15, it cannot catch up with the other machines.
In a project sponsored by the German Research Council (DFG), scientists of the Laboratory of Mechanics of RWTH Aachen University simulated PHOENIX, a small scale prototype of the Space Hopper, a space launch vehicle designed to take off horizontally and glide back to earth after placing its cargo in orbit. The corresponding Navier-Stokes Equations are solved on a block structured grid with FLOWer, a flow solver developed at the German Aerospace Center (DLR).
FLOWer is parallelized with MPI. In addition many loop nests can been parallelized automatically by the Fortran compiler. Thus on each platform the question arises, what is the optimal combination of number of MPI processes and threads per process? The following table compares the runtime for 10 iterations on the SFE2900 and the ST5x20 ("Niagara 2") for various combinations of process and thread counts.
When choosing the optimal combination, the S T5x20 outperforms the 24 core SFE2900 by a factor of 1.27.
NestedCP is written in C++ and computes critical points in multi-block CFD datasets by using a highly adaptive algorithm which profits from the flexibility of OpenMP to adjust the thread count on all three parallel levels and to specify loop schedules on these parallel levels. In order to interactively analyze results of large-scale flow simulations in a virtual environment, different features are extracted and visualized from the raw output data. One feature that helps describing the topology is the set of critical points, where the velocity is zero.
The code scales very well on the SFE6900 and the SFV40z. On the SFT5x20 the speedup levels off with 16 and more threads.
The Navier-Stokes Solver TFS developed by the Institute of Aerodynamics of the RWTH Aachen University is currently used in a multidisciplinary project to simulate the air flow through the human nose. TFS uses a multi-block structured grid with general curvilinear coordinates. OpenMP is employed on the block and also on the loop level. This application puts a high load on the memory system and thus is quite sensitive to ccNUMA effects
The optimal number of threads for each of the parallelization levels and optimal strategy for distributing the work to the threads differs between the platforms. The following table contains the best efforts on several machines. As the code orginially has been developped for vector computers, it still performs quite well on the NEC SX-8 (thanks for granting access to the machine at the HLRS).
| Machine | Serial runtime [s] | #total | #threads block level | #threads | Parallel runtime | Speed-up | Efficiency [%] | remark |
|---|---|---|---|---|---|---|---|---|
| SFESFE25K 72 US IV | 342 | 32 | 8 | 4 | 20 | 17 | 53 | Sorted blocks, |
| SFE25K | 342 | 64 | 8 | 8 | 18 | 20 | 31 | Sorted blocks, random placement |
| SFE25K | 342 | 128 | 16 | Balanced | 14 | 25 | 39 | Thread affinity |
| SFE25K | 342 | 128 | 16 | 8 | 13 | 27 | 21 | Thread affinity Sorted blocks |
| SFE6900 | 312 | 48 | 8 | 6 | 19 | 16 | 33 | Sorted blocks |
| SFV40z | 148 | 8 | 8 | 1 | 26 | 5.6 | 70 | Block groups, binding, migration |
| NECSX8 | 15.7 | 8 | 8 | vector | 5.8 | 2.7 | 34 | Dynamic schedule |
| #total threads | #threads on block level | #threads on loop level | UltraSPARC T2 |
|---|---|---|---|
| 1 |
|
| 475.7 |
| 8 | 2 | 4 | 62.0 |
| 8 | 4 | 2 | 61.9 |
| 8 | 8 | 1 | 63.8 |
| 16 | 4 | 4 | 34.7 |
| 16 | 8 | 2 | 35.2 |
| 32 | 4 | 8 | 26.2 |
| 32 | 8 | 4 | 26.7 |
| 64 | 4 | 16 | 24.4 |
| 64 | 8 | 8 | 25.0 |
We use the TFS code in a different version which has been parallelized with MPI on the block level and with OpenMP on the loop level for benchmarking, too. When only activating the loop level parallelization the code scales only to a modest number of threads as can be seen in the next graph. Thus, the UltraSPARC T2 cannot catch up with the other machines, when employing a high number of threads.
When MPI and OpenMP are both employed there is enough scalability for the UltraSPARC T2 to catch up. Only the SFE6900 performs better with 16 or more threads.
With a larger dataset the UltraSPARC T2 turns out to perform better the SFE2900 in both cases: If only loop level parallelization is activated the factor is 1.04, if MPI and OpenMP is employed the factor is 1.36.
The Spec benchmarks are very popular for comparing machines. The manufactorers put an extremely high effort in presenting the optimal results for their machines. We took the OMP2001base benchmark suite and tried to behave just like a normal user would do: Turn on a reasonable set of compiler flag and then let it go - no profile-feedback optimization, no experiments with all kind of well hidden compiler options, no special setting of system tunables. On the SFE2900 the difference between our performance results and the manufacturer's is considerable: We are a factor of 1.6 away from the optimum!
Still, we think that it is reasonable approach to compare machines with a "standard setting". Now, comparing a SFE2900 with a ST5x20 beta machine, the later wins by a factor of 1.1.
| Threads | Sun Fire E2900 | UltraSPARC T2 |
|---|---|---|
| 64 |
| 9605 |
| 32 |
| 8783 |
| 24 | 8675 |
|
| 16 | 7874 | 6261 |
| 12 | 7748 |
|
| 8 | 6005 | 4063 |
| 4 | 3501 | 2174 |
| 2 | 1924 |
|
| 1 | 1048 |
|
XNS is a finite element flow solver used for simulating the blood damage and cluttering caused by blood pumps. The code is developped by the Chair for Computational Analysis of Technical Systems (CATS). It has been very well parallelized with MPI and scales to thousands of processors on the IBM Blue Gene/L for large datasets. (Thanks to the Research Center in Jülich for granting access to their BG/L machine).
Here we only look at a small test case. The UltraSPARC T2 scales well up to 16 processes and then scalability levels off. The BlueGene/L processor performs similar to an UltraSPARC IV processor, but scales a little better, presumably the SFE6900 runs into memory bandwidth shortage with many processes. 16 processes on one UltraSPARC T2 chip perform like 8 BlueGene/L processes on 4 chips. It would be nice to try this code on a bunch of UltraSPARC T2 machines connected by a fast network ...