23 Providence Rd
Westford, MA 01886
(978)828-0944
 Introduction
Polybus Systems develops custom FPGA designs and IP for FPGAs and ASICs. All of our work is on Linux systems using Verilog simulators, primarily NCSim from Cadence, and Xilinx's FPGA development tools. We own a range of systems and we were curious as to their relative performance in CAE applications. The benchmarks on most hardware sites are almost entirely oriented towards Windows games and as such aren't very useful to anyone using Linux or to anyone using their system to do real work. This page compares a number of systems doing various common CAE tasks based on actual designs.
 Systems
We tested four systems, two servers a desktop and a laptop. The base line system is Yorktown, a Dual 1GHz PIII server based on an Intel Motherboard. The second system is Enterprise, a dual 2.66GHz Xeon server based on a Supermicro motherboard. Enterprise was tested with hyperthreading both disabled and enabled. The third system is Ranger, a Compaq R3000z laptop with a 754 pin Athlon 64 3400+. The final system is Wasp, a Compaq GX5050 desktop with an 939 pin Athlon 64 3800+ on an MSI K8N Neo2 Platinum motherboard. The MSI motherboard provides considerable flexibility in it's BIOS which allowed us to run Wasp's memory system at 266MHz, 333MHZ and 400MHz. Linux also allows us to explicitly set the clock rate of an Athlon 64 system. We ran Wasp at both 2.4GHz (the 3800+ speed) and at 2.2GHz (the clock speed of both the 3500+ and the 3400+).
The two servers use 7200RPM ATA 100 drives, the laptop has a 4400RPM ATA drive and the GX5050 desktop a 7200RPM SATA drive.
All four systems were running Fedora Core 3 with a 2.6.10 kernel. The Athlon 64 systems were run on both 32 and 64 bit kernels. Not surprisingly there wasn't any significant difference in performance between the 32 and 64 bit modes because the applications tested here were all 32 bit.
Machine
|
Processor
|
Cache K
|
# CPUs
|
Kernel Bits
|
Memory
|
Mhz
|
Hyperthreading
|
YORKTOWN
|
PIII
|
256
|
2
|
32
|
2xSDR133
|
1000
|
 |
ENTERPRISE
|
Xeon
|
512
|
2
|
32
|
2xDDR266
|
2600
|
On
|
ENTERPRISE
|
Xeon
|
512
|
2
|
32
|
2xDDR266
|
2600
|
Off
|
RANGER
|
A64-754
|
1024
|
1
|
64
|
1xDDR333
|
2200
|
 |
WASP
|
A64-939
|
512
|
1
|
32
|
2xDDR400
|
2400
|
 |
WASP
|
A64-939
|
512
|
1
|
64
|
2xDDR266
|
2400
|
 |
WASP
|
A64-939
|
512
|
1
|
64
|
2xDDR333
|
2400
|
 |
WASP
|
A64-939
|
512
|
1
|
64
|
2xDDR400
|
2400
|
 |
WASP
|
A64-939
|
512
|
1
|
64
|
2xDDR400
|
2200
|
 |
 Tests
The test suite was based on real world tasks common to FPGA and ASIC design. C compilation, Verilog simulation and Xilinx synthesis, map, place and route. We used Polybus's InfiniBand cores and testbenches for all of the simulation and build tests.
The simulation tests were based on both the Polybus's InfiniBand system testbench for the Link layer Core and the unit testbench for a complete Target Channel adapter based on Polybus's cores.
The simulator used was Cadence's NCSim version 5.3. Version 5.4, which is somewhat faster than 5.3, was also tested but the results are not presented here as the relative performance of the various systems is not dependent on the simulator revision.
The Xilinx tools used were the Linux native 6.3SP3 toolset.
 GCC make - j 1
The GCC make tests time a make of HDLmaker, a free structural verilog generation tool available from us at http://www.polybus.com/hdlmaker/users_guide/. The makes were run single threaded, -j 1, dual threaded, -j 2, and quad threaded, -j 4. The 2.66GHz Xeons are a little more that twice as fast as the 1GHz PIIIs when running single threaded. The Athlon 64 systems, both the 3400+ and the 3800+, are approximately 3.5 times the speed of the PIIIs. When running dual threaded the speed up on the dual processor Xeon systems is less 2 resulting in a system speed for the dual Xeon that's approximately equal to the performance of a single Athlon 64 when running dual threaded. When doing a make -j 4 the hyperthreading on the Xeons helps a little but it's hardly significant.
In all of the charts the performance is relative to the 1GHz PIII system. There are three bars for each machine, representing the time results for user, system and real time.
 GCC Make - j 2
 GCC make - j 4
 XST
The Xilinx tool tests synthesize, map, place and route the Polybus InfiniBand Target channel adapter which consists of the InifiniBand Link Layer and InfiniBand Transport Layer cores. The target device is a XC2VP20-6 which is 60% utilized. XST performs best on the 3800+.
 PAR
Xilinx Map, Place and Route performance offers no surprises, the fastest system was the Athlon 64 3800+ desktop running with DDR400 memory. The performance of the 3400+ and the 3500+ are nearly identical indicating that for PAR an 1M cache + a single memory channel equals the performance of 1/2M cache with dual channel memory system.
 NCSim with no recordvars and no $display statements.
In this benchmark NCVerilog was run with no recordvars and only a Pass/Fail $display statement at the end resulting in virtually no I/O. This purely CPU bound test yields a surprising result, the Athlon 64 3400+ is much faster then any of the other processors. The user times are nearly twice as fast as the Athlon 3800+. Even though the 3800+ has a higher clock rate, 2.4GHz vs 2.2GHz, and a much faster main memory system, dual DDR400 vs single DDR333, the larger cache of the 3400+ (1M for the 3400+ vs 512K for the 3800+) is the dominating factor. This may be peculiar to NCSim, Cadence may have taken special care to make good use of caches. It's also possible that there is some structure that thrashes in a small cache. In any event it's clear that for simulation purposes having at least a megabyte of cache is very important.
 NCSim with Recordvars
This benchmark ran NCVerilog using recordvars to dump the state of all of the nodes in the design. The resulting .trn file is 1.7GBytes. In this test the A64 3400+ laptop and the A64 3800+ desktop perform nearly identically, clearly the much faster disk on the desktop system allows it to compensate for the poorer CPU performance.
 NCsim, regression
The regression test is the cumulative time to run 50 tests from the Polybus InfiniBand Link Layer core testbench. Recordvars was turned off however the tests make substantial use of the $display statements, dumping log files that consume a total of 1.38GBytes of space. The highest performance system in this test is the 3400+ laptop although due to the large amount of I/O the 3800+ desktop performs nearly as well.
 Conclusion
The bottom line is that cache matters. The 754 pin Athlon 64 3400+ with 1M of cache is the bargain of the century. The laptop 3400+ system outperformed every other system in this group even though it has a pathetic 4400RPM laptop drive and only a single DDR333 memory channel. In purely CPU bound simulations the 3400+ was almost twice as fast as the 3800+ even though the 3800+ has a faster clock and a much faster main memory system. For CAE work the two best choices would be a 754 pin Athlon 64 with 1M of cache and a fast drive or a dual Opteron server system which combines the 1M cache with a fast memory system. The Athlon 64FX also has a 1M cache but the price of the FX processors is so high that a dual Opteron system can be had for not much more that an Athlon64FX system.
|