The first graph depicts the performance of the sMxV kernel for the small matrix on different machines in MFlop/s over the number of OpenMP threads.
The small and cache-friendly memory footprint gives rise to a rather high absolute performance of well over 2 GFlop/s on the DPE1905W with 4 threads, on the SFV40z machine with 8 threads and on the SFE2900 with 16 threads, whereas the ST5x20 scales well up to 32 threads, but only obtains some 1.345 GFlop/s because of the single L2 cache of 4 MB which all threads have to share.
Now the next graph depicts the performance of the sMxV kernel for the large matrix, which is much more important for production.
Here the caches of all machines are too small to hold a reasonable portion of the matrix and the memory bandwidth becomes the limiting factor. Now the SFV40z outperforms the DPE1905W because of the superior memory bandwidth and because the code takes care of proper placement of the data. Both machines clearly outperform the SFE2900, even if more threads are employed.
Here the single UltraSPARC T2 chip clearly wins. The program scales extremely well up to 64 threads obtaining 2.56 GFlop/s. Surprisingly, when overlaoding that machine with up to 112 threads, the performance even increases further up to 3.216 GFlop/s !.