Forgot your password?
typodupeerror
Technology (Apple) Technology Hardware

Big Mac Benchmark Drops to 7.4 TFlops 417

Posted by CowboyNeal
from the number-adjusting dept.
coolmacdude writes "Well it seems that the early estimates were a bit overzealous. According to preliminary test results (in postscript format) on the full range of CPUs at Virginia Tech, the Rmax score on Linpack comes in at around 7.4 TFlops. This puts it at number four on the Top 500 List. It also represents an efficiency of about 44 percent, down from the previous result of 80 achieved on a subset of the computers. Perhaps in light of this, apparantly VT is now planning to devote an additional two months to improve the stability and efficiency of the system before any research can begin. While these numbers will no doubt come as a disappointment for Mac zealots who wanted to blow away all the Intel machines, it should still be noted that this is the best price/performance ratio ever achieved on a supercomputer. In addition, the project was successful at meeting VT's goal of developing an inexpensive top 5 machine. The results have also been posted at Ars Technica's openforum."
This discussion has been archived. No new comments can be posted.

Big Mac Benchmark Drops to 7.4 TFlops

Comments Filter:
  • by daveschroeder (516195) * on Wednesday October 22, 2003 @02:10PM (#7283274)
    It's worth noting a few important things:

    First, from a an Oct 22 New York Times [nytimes.com] story:

    Officials at the school said that they were still finalizing their results and that the final speed number might be significantly higher.

    This will likely be the case.

    Second, they're only 0.224 Tflops away from the only Intel-based cluster above it. So saying "all the Intel machines" in the story is kind of inaccurate, as if there are all kinds of Intel-based clusters that will still be faster; there is only one Intel-based cluster above it, and with only preliminary numbers for the Virgina Tech cluster at that.

    Third, this figure is with around 2112 processors, not the full 2200 processors. With all 1100 nodes, even with no efficiency gain, it will be number 3, as-is.

    Finally, this is the a cluster of several firsts:

    First major cluster with PowerPC 970

    First major cluster with Apple hardware

    First major cluster with Infiniband

    First major cluster with Mac OS X (Yes, it is running Mac OS X 10.2.7, NOT Linux or Panther [yet])

    Linux on Intel has been at this for years. This cluster was assembled in 3 months. There is no reason for the Virginia Tech cluster to remain at ~40% efficiency. It is more than reasonable to expect higher than 50%.

    It's still destined for number 3, and its performance will likely even climb for the next Top 500 list as the cluster is optimized. The final results will not be officially announced until a session on November 18 at Supercomputing 2003.

  • by daveschroeder (516195) * on Wednesday October 22, 2003 @02:19PM (#7283366)
    See http://www.netlib.org/benchmark/performance.pdf [netlib.org] page 53.

    Since yesterday's release at 7.41 Tflop, the G5 cluster has already increased almost a Tflop, and is now ahead of the current #3 MCR Linux cluster, and about 0.5 Tflop behind a new Itanium 2 cluster.
  • by daveschroeder (516195) * on Wednesday October 22, 2003 @02:21PM (#7283397)
    So, yes, these numbers are preliminary, and yes, they WILL increase - they already are. See http://www.netlib.org/benchmark/performance.pdf (the official source of preliminary numbers), page 53.
  • Not really (Score:5, Informative)

    by daveschroeder (516195) * on Wednesday October 22, 2003 @02:26PM (#7283442)
    The preliminary performance report at http://www.netlib.org/benchmark/performance.pdf contains the new entries for the upcoming list as well (see page 53).
  • by Carnildo (712617) on Wednesday October 22, 2003 @02:26PM (#7283449) Homepage Journal
    The number dropped because they used a better benchmark (testing all the nodes, rather than a subset). It'll probably go up because now they'll be able to tune the system to get around bottlenecks.
  • by humpTdance (666118) on Wednesday October 22, 2003 @02:30PM (#7283481)
    Until these applications are written in 64 bit code, it won't matter. Smeagol and Panther will still have to cross that bridge so old utilization rates will continue to apply.

    From: http://www.theregister.co.uk/content/39/31995.html [slashdot.org]

    The PowerPC architecture was always defined as a true 64-bit environment with 32-bit operation defined as a sub-set of that environment and a 32/64-bit 'bridge', as used by the 970, to "facilitate the migration of operating systems from 32-bit processor designs to 64-bit processors".

    The 'bridge' technology essentially allows the 970 to host 32-bit operating systems and apps that have been modified to support 64-bit addresses and larger files sizes as both Smeagol and Panther have. Adding 64-bit address support to existing applications lies at the heart of the optimisations for the Power Mac G5 that Apple suggests developers make.

  • Also Important? (Score:3, Informative)

    by ThosLives (686517) on Wednesday October 22, 2003 @02:34PM (#7283514) Journal
    If you read the fine print, the Nmax for the G5 was 100,000 higher than for the Linux cluster. Now, that's kind of interesting, because the G5 cluster was then only slightly slower doing a much bigger (450,000 Nmax vs 350,000 Nmax on the Xeons) problem. I wonder why they don't somehow scale the FLOPs to reflect this fact.

    Anyone know how much merit there is to using Nmax (or N1/2) to compare different systems?

  • by hackstraw (262471) * on Wednesday October 22, 2003 @02:36PM (#7283544)
    FWIW here are the efficiencies for the top 10 on www.top500.org:

    87.5 NEC Earth-Simulator
    67.8 Hewlett-Packard ASCI Q
    69.0 Linux Networx MCR Linux Cluster Xeon
    59.4 IBM ASCI White
    73.2 IBM SP Power3
    71.5 IBM xSeries Cluster
    45.1 Fujitsu PRIMEPOWER HPC2500
    79.2 Hewlett-Packard rx2600
    72.0 Hewlett-Packard AlphaServer SC
    77.7 Hewlett-Packard AlphaServer SC
  • Re:Big mac cluster.. (Score:4, Informative)

    by zulux (112259) on Wednesday October 22, 2003 @02:43PM (#7283598) Homepage Journal
    since the coke is only 300ish calories in the first place...

    For consumers, food calories are really kilo-calories. So in this case, you coke has 300,000 physic-style calories.

    If you look at a euopean food-labels, sometime you can seem them writen as kcal.

  • by Anonymous Coward on Wednesday October 22, 2003 @03:04PM (#7283789)
    The Linpack benchmark, as compiled to the G5, is not utilizing the processor to its fullest. The school is still in the process of adding Altivec compiler optimizations, which should drastically improve the results.

    Right now, the processor is behaving essentially as a G4 with a bigger fan and more memory addresses. Rumor has it that tweeking the compiler to abuse the Altivec unit may push the system above the theoretical limit in some calculations.
  • Re:Big mac cluster.. (Score:5, Informative)

    by Graff (532189) on Wednesday October 22, 2003 @03:06PM (#7283809)
    The original poster was wrong when he said:
    1 Cal (uppercase C) is the amount of heat required to raise the temperature of 1g of water 1 degree celsius

    A Calorie (the one used on food labels) is actually a kilocalorie. A Calorie is therefore 1000 calories. 1 calorie is basically the amount of heat needed to raise 1g of water 1 degree celsius. (A calorie is actually 1/100 of amount of heat needed to get 1 gram of water from 0 degrees C to 100 degrees C, but that works out almost the same.)

    This is explained a bit on this web page. [reference.com]

    So warming a 4 degrees C, 350mL Coke to 37 degrees C would take (37 - 4) * 350 = 11550 calories. This is 11.55 kilocalories or 11.55 Calories. The Coke has around 300 Calories in nutritive value therefore you would gain 300 - 11.55 = 288.45 Calories of energy from a 4 degrees C, 350mL can of Coke.
  • Scalability (Score:5, Informative)

    by jd (1658) <{moc.oohay} {ta} {kapimi}> on Wednesday October 22, 2003 @03:16PM (#7283935) Homepage Journal
    First, scalability is highly non-linear. See Amdahl's Law. Thus, the loss of performance is nothing remarkable, in and of itself.


    The degree of loss is interesting, and suggests that their algorithm for distributing work needs tightening up on the high-end. Nonetheless, none of these are bad figures. When this story first broke, you'll recall the quote from the top500 list maintainer who pointed out that very few machines had high performance ratings, when they got into the large numbers of nodes.


    I'd say these are extremely credible results, well worth the project team congratulating themselves. If the team could open-source the distribution algorithms, it would be interesting to take a look. I'm sure plenty of Mosix and BProc fans would love to know how to ramp the scaling up.


    (The problem of scaling is why jokes about making a Beowulf cluster of these would be just dumb. At the rate at which performance is lost, two Big Macs linked in a cluster would run slower than a single Big Mac. A large cluster would run slower than any of the nodes within it. Such is the Curse that Amdahl inflicted upon the superscaler world.)


    The problem of producing superscalar architectures is non-trivial. It's also NP-complete, which means there isn't a single solution which will fit all situations, or even a way to trivially derive a solution for any given situation. You've got to make an educated guess, see what happens, and then make a better informed educated guess. Repeat until bored, funding is cut, the world ends, or you reach a result you like.


    This is why it's so valuable to know how this team managed such a good performance in their first test. Knowing how to build high-performing clusters is extremely valuable. I think it not unreasonable to say that 99% of the money in supercomputing goes into researching how to squeeze a bit more speed out of reconfiguring. It's cheaper to do a bit of rewiring than to build a complete machine, so it's a lot more attractive.


    On the flip-side, if superscaling ever becomes something mere mortals can actively make use of, understand, and refine, we can expect to see vastly superior - and cheaper - SMP technology, vastly more powerful PCs, and a continuation of the erosion of the differences between micros, minis, mainframes and supercomputers.


    It will also make packing the car easier. (* This is actually a related NP-complete problem. If you can "solve" one, you can solve the other.)

  • by mz001b (122709) on Wednesday October 22, 2003 @03:29PM (#7284100)
    On the other side of the issue is that it places 4th in the current Top 500 list, which was released in June. We won't really know where it places on this "moving target" until the next list is released in November.

    The deadline for submission to the Nov 2003 Top 500 list was Oct. 1st (see call for proposals) [top500.org], so it has already passed. Any further improvements that they make to the scalability of the cluster should not be included. This is true for all the machines.

  • by Troy Baer (1395) on Wednesday October 22, 2003 @03:40PM (#7284204) Homepage
    The Linpack benchmark, as compiled to the G5, is not utilizing the processor to its fullest. The school is still in the process of adding Altivec compiler optimizations, which should drastically improve the results.
    The AltiVec instructions support only single precision (32-bit) floating point operations, and the core routine in the Parallel Linpack Benchmark is DGEMM() which is double precision (64-bit). The G5 already has two double precision FPUs, each of which can do a multiply/add op every clock cycle.

    My feeling is that the ~40% efficiency seen on the larger scale run is an indication that either VA Tech spent very little time tuning the problem size or they didn't design their InfiniBand fabric to really handle 1100 nodes hammering away at Parallel Linpack. (Given that they've been extremely vague about how their IB network is structured, I fear it may be the latter.)

    Right now, the processor is behaving essentially as a G4 with a bigger fan and more memory addresses. Rumor has it that tweeking the compiler to abuse the Altivec unit may push the system above the theoretical limit in some calculations.
    I doubt that's true, especially if they're using the IBM PPC compilers. The G4 has both significantly less memory bandwidth and a single double-precision-capable FPU, whereas the G5 is basically a single-core Power4 with an AltiVec unit in place of some cache. IBM's compilers (despite being a little wonky as far as naming and argument syntax) generally produce pretty fast code.
    --Troy
  • by blackSphere (641407) on Wednesday October 22, 2003 @04:05PM (#7284482)
    Efficiency of a parallel computer considered to be

    E=Ts/(n*Tp)

    where Ts is the time to perform the computations serially, Tp is the the total time to perform the computations on the parallel machine and n is the number of parallel processing units.

    It wouldn't take much to get a drastic improvement in efficiency simply by improving the time slightly for each parallel processer, especially for 1100 nodes.

    I don't know how the benchmark program runs, but improving the communication time would imrove the efficiency as well.

    It shouldn't take much to boost this by a few million flops.
  • by Knobby (71829) on Wednesday October 22, 2003 @04:09PM (#7284535)

    Grumble... Go take a look at Apple's description of the G5 architecture [apple.com] before spouting.. Here's the relevant lines:

    • Each PowerPC G5 processor has its own dedicated 1GHz bidirectional interface to the system controller for a mind-boggling 16GB per second of total bandwidth -- more than twice the 6.4-GBps maximum bandwidth of Pentium 4-based systems using the latest PC architecture
    • 800MHz HyperTransport interconnects for a maximum throughput of 3.2GB per second.
    Apple uses the same basic memory set-up as the AMD Opteron.
  • Re:Big mac cluster.. (Score:3, Informative)

    by lostchicken (226656) on Wednesday October 22, 2003 @04:46PM (#7284858)
    A different unit, though. 1 kcal = 4.187 kilojoules. (1 calorie (not kcal) = energy to raise 1 gramme of water one degree c, 1 joule is the work done in countering one newton of force for one meter.)
  • Re:facts, please? (Score:3, Informative)

    by penguin7of9 (697383) on Wednesday October 22, 2003 @05:41PM (#7285355)
    Thanks for the pointer. Now, about that "most cost effective" bit? Compared to what? At retail prices?
  • by BWJones (18351) on Wednesday October 22, 2003 @07:48PM (#7286363) Homepage Journal
    Besides, performance per CPU doesn't matter much in these benchmarks, what matters is total bang for total buck, at the prices at which regular folks can get these machines (no special "we need a showcase" kind of deals). I suspect the 2.4GHz-based clusters are still a better deal than either the G5 or a 3.2GHz cluster, more CPUs or not.

    Actually, if you read back a little bit, you will find that the contract was awarded to Apple because they gave the best bang for the buck and it turns out that Dell optioned clusters would have been more expensive.

  • by Anonymous Coward on Wednesday October 22, 2003 @09:02PM (#7286904)
    yea, and he was selling the athlon for 5 mil, not buying it asshat...

    you understand the definitions, you just cant read.
  • by Hoser McMoose (202552) on Wednesday October 22, 2003 @09:30PM (#7287114)
    Err, Apple's G5 and the AMD Opteron don't have an even remotely related memory setup. The G5 looks a lot more like the AthlonXP and AthlonMP setups. The Opteron has an integrated 128-bit wide DDR memory controller, connects multiple CPUs directly through cache-coherent Hyptertransport links, and uses additional 32-bit, 1600MT/s HT links (3.2GB/s in each direction) to connect the CPU directly to the I/O chips.

    The Powermac G5 uses up to 1GT/s, 64-bit wide version of IBM's Elastic I/O bus to connect each processor to a memory controller chip, which in turn has a pair of 64-bit wide DDR memory controllers. These buses are also shared for the processors I/O needs, which are passed over a 800MT/s, 16-bit wide hypertransport link to the PCI-X controller.

    As for the width and speed of the Hypertransport links, Apple is very confusing on this front. In the document you linked they say "two bidirectional 16-bit, 800MHz HyperTransport interconnects for a maximum throughput of 3.2GB per second." In their PowerMac G5 Tech Specs PDF they say "two bidirectional 800MHz HyperTransport interconnects for a maximum throughput of 1.6 GBps." So which is it? And just what bandwidth are they measuring?

    The PowerMac does indeed have two separate bi-directional Hypertransport links, the first connects the memory and processor controller chip to the PCI-X controller, and the second goes from the PCI-X controller to the extra I/O chips. It seems to me like the page you quoted is ADDING the bandwidth of the two daisy-chained hypertransport links, which would be TOTALLY incorrect.

    My numbers came from the fact that a 16-bit (8-bits per direction) 800MT/s hypertransport link gets you only 800MB/s in each direction. Of course, it could really indeed be a "800MHz" hypertransport link, ie a 1600MT/s link since Hypertransport is a DDR protocol, but I highly doubt that since every other specification they mention just doubles the "MHz" number anytime they encounter a DDR bus (not that Apple is the only one to do this, Intel's "800MHz" bus runs at either 200MHz or 400MHz, depending on which clock you look at).
  • by tmattox (197485) <tmattox@@@ieee...org> on Wednesday October 22, 2003 @10:55PM (#7287607) Homepage
    I have yet to find a satisfactory description of the network topology they are using. The specs on the Infiniband switches they are using are quite impressive for latency and bandwidth numbers, but without knowing how they are interconnected, its' hard to say if it's latency or maybe bisection-bandwidth issues limiting their efficiency. From the early report of 80% efficiency on 128 CPUs (or was it 128 nodes?) would seem to indicate the problem is with the switch fabric in some way. With ~1100 nodes, communications are having to cross through mutliple switches in any traditional network topology, resulting in higher latency, and possibly bandwidth bottlenecks.

    I saw some indication that they were using a Fat-Tree topology, which would eliminate any bandwidth bottlenecks between switches, but the number of switches used didn't seem large enough for a fat-tree. But again, VT just hasn't, as of the last time I looked, released enough information about the cluster to tell.

    BTW - My thesis work on Flat Neighborhood Networks (FNNs) [aggregate.org] used in the KLAT2 [aggregate.org] and KASY0 [aggregate.org] supercomputers is finding better ways to interconnect the nodes, given a particular set of network components.

  • by Helter (593482) on Thursday October 23, 2003 @01:03AM (#7288164)
    You're forgetting the AC costs... If you've ever worked in a DC you know that the room itself can get mighty toasty, and toasty air leads to cooked systems.

    Each processor, drive, and switch generates heat which is dissipated into the air. Untouched that heat accumulates and will kill the entire thing. With 1100 dual processor nodes running (and you can be they'll each be running at pretty close to full tilt) constantly that's a hell of a lot of heat that needs to be removed from the air.
  • by StewedSquirrel (574170) on Thursday October 23, 2003 @02:29AM (#7288435)
    The G5's memory controller is built into the U3 IC, which is essentially the "north bridge"- it is NOT built into the CPU.

    It connects to the CPU via the "Apple Processor Interface" NOT via hypertransport. It connects to it's memory controller at 1/2 the CPU speed, unlike Opteron and Athlon 64 which connect to the memory controller at FULL CPU SPEED.

    Documentation:
    developer.apple.com [apple.com]
    apple.com [apple.com] (thanks for the link)

    From the U3 Northbridge, G5 uses hypertransport to connect to the other peripherials at 3.2GB/s.
    Opteron supports a hypertransport rate of 6.4 GB/s [tomshardware.com] directly from the CPU.

    The Opteron 4xx and 8xx models also happen to have THREE of these hypertransport channels connected in a cross-bar configuration for SMP systems, giving EACH CPU a dedicated 6.4GB/s connection, rather than the G5 architecture which much share that connection (since there is only one U3 chip in a dually G5).

    Support for PCI-X in the G5 by standard is a great thing. I wish more AMD systems contained it... I appreciate their native support of firewire and gigabit eithernet. But seriously... do you really want to argue architecture against a workstation class CPU? I'm a bit dissapointed by the Athlon 64, but the Athlon 64 FX (desktop version of Opteron) and Opteron lives up to most of my expectations and I expect to see more speeds out in the near future.

    Stewey

Little known fact about Middle Earth: The Hobbits had a very sophisticated computer network! It was a Tolkien Ring...

Working...