The Intel processor based Dell PowerEdge 1950 systems reveal the highest single thread performance because of their high clock rate and their large on-chip caches. But the memory system is a bottleneck, which is a severe hindrance for obtaining a good scalability for memory intense programs, unfortunately quite some HPC applications belong to this category.
The single thread performance of the Opteron processor based Sun Fire V40z is a little below, mainly because of the slower clock cycle. Also the smaller cache may have an impact in some cases. The main characteristic is the ccNUMA architecture. On hand it delivers a higher memory bandwidth, but on the other hand ccNUMA effects can easily spoil the scalability, if the programmer does not care for proper memory placement and the operating system does not support thread and data affinity optimally. (see "Affinity Matters" (pdf) a presentation during the recent Parco conference).
The UltraSPARC IV based Sun Fire E2900 and E6900 midrange servers exhibit a nice flat memory and thus good scalability. The memory bandwidth is limitted by the snooping protocol, though the large caches help a lot in many cases. As the machines are getting old their power consumption and their space and cooling requirements are quite high, which is a main reason for aiming at a soon replacement at our site.
Sun's new UltraSPARC T2 processor provides full compatibility to the preceeding Sun Fire midrange servers and thus the T5x20 servers are good candidates for their replacement. They exhibit the lowest power consumption but also the lowest single thread performance. Parallel applications will scale well on the flat memory, so employing many threads efficiently is easy. But all threads have to share a single 4 MB L2 cache. So L2 cache misses will be frequent. On the other hand the memory bandwidth provided by 4 on-chip memory controllers is excellent.
It turns out the the UltraSPARC T2 processor is ideally suited for scalable applications which consume a high memory bandwidth, former vector codes falling into this category. This has been shown by our sparse matrix vector multiplication case study, which is a compute intense kernel of many PDE solvers, and some of our application benchmarks. As a consequence, there is an important class of applications for which the performance per power ratio is clearly optimal on the UltraSPARC T2 processor. On the other hand, applications which are very cache friendly, profit from the high clock rate and the large on-chip caches of the recent Intel processors. If throughput is the ultimate goal, again the UltraSPARC T2 processor is very competetive as has already been demonstrated by Sun's impressive SPECint_rate2006 and SPECfp_rate2006 measurements.
The Opteron and Xeon based machines in this comparison all had multiple sockets and thus were superior in performance for a couple of applications. When taking a look at the per chip performance, the UltraSPARC T2 processor does quite well in many cases. It will be very interesting to see how Sun's upcoming multi-chip servers will perform in the near future ...