FASTRA II: the world’s most powerful desktop supercomputer

microarchitect · on Dec 10, 2011

Calling this a "supercomputer" is a bit disingenuous. The most powerful supercomputer today can do 10.5 Petaflops on the Linpack benchmark [1]. The 500th most powerful supercomputer can do 51 Teraflops.

If the authors actually tried to measure Linpack scores on their cluster, they'd find that peak performance would be much worse than the 12 teraflops they're claiming as "peak" performance. This link [2] seems to indicate that one S2050 along with a fairly fast CPU gives you about 0.7 teraflops. Even if they get perfect scaling across GPUs, which they won't, and if they could solve all the heating issues with their design, they won't get more than about 0.7*6.5 = 4.55 TFLOPS peak on linpack.

In any case, many (39 out of 500 to be precise) of the most powerful supercomputers in the world also use GPUs to accelerate certain applications so there's simply no way this thing would be faster than those computers.

[1] http://www.top500.org/lists/2011/11/press-release [2] http://hpl-calculator.sourceforge.net/Howto-HPL-GPU.pdf

tycho77 · on Dec 10, 2011

One S2050 has four M2050s, not two (I inferred this assumption from your 6.5 multiple).

I'd like more information on the CPUs and CPU/GPU bus of this FASTRA II. CPUs are really gaining back a lot of their performance relative to GPUs - an M2090 is only about 5-10x faster than the newer Intel Xeons in a properly multithreaded program that pays attention to the NUMA nodes. This is efficient enough that for large simulations which do not fit on one GPU, running portions of the simulation on CPU easily overcomes the CPU/GPU communication overhead in terms of performance.

VladRussian · on Dec 10, 2011

>I'd like more information on the CPUs and CPU/GPU bus of this FASTRA II.

"the graphics cards, which are connected to the motherboard by flexible riser cables." - so this sounds like typical PCIe based design. Usually the bundle of GPU cards on PCIe backplane is placed into separate case which connected back to the CPU host, and in this case they engineered it into one big case thus each card getting it's own x16/x8 pipe.

In general such "supercomputer" is cheaper, hardware-wise, than equivalent CPU-based, yet it is still more expensive human labor (ie. programming) -wise.

willvarfar · on Dec 10, 2011

cough 12 terraflops, not petra

microarchitect · on Dec 10, 2011

Oh right. Thanks for catching that typo.

scott_s · on Dec 11, 2011

Agreed, thank you. "Supercomputer" is a continually moving goal-post that basically means "more computing power than you can achieve with personal computers." Supercomputers push the limits of the performance that we can achieve. If you can fit it inside of a PC case, then you're not pushing the limits of the kind of performance we can achieve.

wunki · on Dec 10, 2011

Although the computer itself is reason enough to be impressed, I must admit I more impressed with the use of the "super" computer. Beautiful that us geeks can contribute so much to every other discipline.

Immediately, back in my head, I got this gnarling feeling that I should also use my skills in such life improving projects instead of building the next social _______ life drain.

h0h0 · on Dec 10, 2011

there are several, relatively well-paid, phd positions available at the vision group ;)

DiabloD3 · on Dec 10, 2011

How is it the world's most powerful if its loaded with Nvidia cards?

200 series era Geforce cards were slower per watt and per dollar and per slot (all important here) than 58xx series Radeons on both single precision and double precision math (and integer as well, but integer is only useful for crypto and Radeons have native single cycle instructions common for crypto which makes them about 4-5x faster than Geforces, but not a typical use case).

And if we're comparing series 400/500 to 69xx, the beating is much worse.

sp332 · on Dec 10, 2011

Nvidia and ATI cards have (approximately and with a few exceptions) the same performance per dollar when it comes to 3D graphics applications. But the tradeoffs made internally mean that they have radically different performance in other areas. For example, ATI chips are faster then Nvidia for bitcoin mining (edit: lots of sha256 hashes), but Nvidia is generally faster than ATI at physical modeling.

http://ewoah.com/technology/a-very-good-guide-to-building-a-...

http://stackoverflow.com/a/4642578/13652

All this is moot if the application they're running only supports CUDA - then they're stuck with Nvidia.

DiabloD3 · on Dec 10, 2011

Author of DiabloMiner, the worlds most popular GPU miner, here... I think you mean SHA256.

If they're stuck on CUDA, I feel sorry for them. They'd be better off rewriting the app to use OpenCL to take advantage of AMD's superior chip design and not be stuck on Nvidia/Windows (Nvidia Linux support, especially under CUDA, is horrible; rather depressing once you consider most CUDA clusters run Linux).

supar · on Dec 10, 2011

Can you give us some thoughts on how (why?) bad is nvidia/linux support is?

As a linux developer doing scientific visualization, and having tested (and still testing, btw) how bad still is the AMD driver status on linux, I see basically nvidia as my only choice if I want to get anything done with decent (and stable) performance.

I wouldn't run a cluster on anything but .nix, so I really see no alternative on nvidia by my reasoning here, even if nvidia's architecture is actually inferior.

But I'd really like to have some opinions on that front. I really want alternatives, and the closed nature of nvidia's drivers is a problem for me. So, can you elaborate?

I also programmed on w2k (from around 2001-2004) and I used to target both nvidia and ati, and I still remember the sheer number of bugs I had with GL in general with the radeon drivers. But I cannot really speak anymore on that front.

DiabloD3 · on Dec 10, 2011

I have answered most of your question here already: http://news.ycombinator.com/item?id=3337682

Nvidia driver quality continues to be subpar on Linux, yet I have zero issues with Catalyst. And even if Catalyst had a problem, as I described in the above URL, AMD is shelling a lot of money out to make the FOSS driver their one and only driver which already is shaping up to be a lot better than Catalyst.

I just don't see a point in doing business with a company that is clearly trying to lock me in with closed source vendor specific APIs if they have no interest in supporting me after the sale of the hardware.

In addition, their OpenCL implementation (which is shared across both Windows and Linux) seems to run code much slower than it should, which I suspect (but have no way of proving) they are doing it to make CUDA look better. The same code ran on AMD hardware has zero issues, and the equivalent code in CUDA runs about as fast as I think it should.

And also, re: Catalyst vs FOSS, in about 2 years the Mesa/Gallium OpenCL stack should be finished, so not only will I be able to quit using Catalyst altogether, Nvidia users will be able to make the better use of their hardware and ditch the broken drivers Nvidia keeps forcing on people.

Now, who knows, maybe Nvidia will change tactics and quit screwing over customers and developers, and release the programming manuals for their hardware so FOSS can develop a better driver faster... I just don't see it _ever_ happening.

Even if AMD hardware had a higher total cost of ownership, the fact AMD actually cares about me as a Linux customer is why for the past 5 video cards I've owned they have gotten my business.

re: Your programming experience, 2004 predates AMD taking over ATI. Driver quality pre-AMD was worse than Nvidia was at the time, but it has greatly changed. AMD has vastly increased the quality of the Windows driver, and the original Linux driver was thrown out in exchange for one that shares almost all the code with the Windows one (day and night kind of change in quality).

AMD didnt finalize their purchase of ATI until 2006, and it wasn't until 2007 or 2008 that driver quality started taking off.

As a statement of driver quality, people who use my software regularly report 30+ day uptimes using Catalyst on Linux. If it was so buggy and unstable, this would probably be impossible.

supar · on Dec 12, 2011

I was hoping for some more meaty details about what exactly is "subpar" about nvidia on Linux (except the closed nature of the driver).

In terms of GL implementation, the nvidia driver has everything you would expect. If you use GL directly, it is one of the most conforming implementations I've been using (supporting "deprecated" stuff like bitmap operations, line and polygon aliasing, imaging extensions, all GL revisions, pretty much all ARB extensions, etc). Everything works from Cg to CUDA to OpenCL to VDPAU. There is really no feature gap between windows and linux.

I've never compared OpenCL to CUDA performance really, but switching to OpenCL for new projects is something I'm planning to do.

My mayor problem, really, is that CUDA is locked to specific versions of GCC, and that is even of a bigger issue than the closed nature of the driver itself.

microarchitect · on Dec 10, 2011

When I was working for AMD, I heard from the GPU guys that one of big the reasons they weren't seeing a lot of HPC-space adoption for the Radeons was because they didn't (yet) support ECC for the DRAM.

vilya · on Dec 10, 2011

Because no-ones bothered to build one using the ATI cards, maybe?

I would guess the main attraction of Nvidia cards was probably CUDA, which is still a nicer environment to program for than OpenCL and seems to have a more lively ecosystem of libraries, tools, etc.

Also, floating point & integer math aren't the only relevant performance metrics. For their application domain, memory bandwidth & latency would probably be at least as important. Kernel switching time may be a factor too - there are many things you can't do with a single kernel. I don't know (but would be interested to hear) how the cards stack up against each other in those terms.

DiabloD3 · on Dec 10, 2011

I'm not sure how you can claim CUDA is nicer. CUDA is not a public specification, it has poor documentation, and it only works on Nvidia.

I am a software engineer (or whatever the popular term this week is) who uses OpenCL frequently. I couldn't imagine being stuck on something like CUDA, between the C++ obsession and the fact it follows Nvidia Cg's way of syntax design instead of GLSL (OpenGL Shader Language) makes it difficult to use.

In addition, there never will be a native Linux implementation that is open source as Nvidia has basically tried to screw over Mesa/Gallium repeatedly with vague IP rights threats (see also, Nouveau, the X/Gallium driver for Nvidia cards, and how they unofficially threatened to sue if Nouveau continued reverse engineering hardware interfaces; they backpedaled when it got media coverage).

So, I dunno about you, but I don't like depending on companies that continuously screw developers who made the mistake of supporting their hardware.

OTOH, AMD has paid over $2 million in their open source efforts including having their legal department clear the programming manuals for use by X/Mesa/Gallium and also hiring multiple full time developers to work on the open source drivers, and has clearly supported FOSS at every turn.

vilya · on Dec 11, 2011

I don't think language flamewars are very interesting - especially ones based on political rather than technical grounds - so let's just say that I should have qualified my statement with "in my opinion" and leave it at that.

Re: the original point, I have the impression that Nvidia has been more active in engaging universities and that's probably a better explanation for why their cards were used rather than AMDs.

sausagefeet · on Dec 10, 2011

Can you speak to Microsoft's AMP framework at all? I believe AMD has stated they will support it.

DiabloD3 · on Dec 10, 2011

I can't find anything named Microsoft AMP. However, it sounds like you're talking about DirectCompute, the compute API for DirectX.

AMD will support DirectCompute because its the native API on Windows, the same way they continue to support Direct3D. This doesn't mean that Direct3D is more viable of an API than OpenGL, but it means its popular on one platform.

Impossible · on Dec 10, 2011

Relevant link to the C++AMP announcement (http://channel9.msdn.com/posts/AFDS-Keynote-Herb-Sutter-Hete...). It's basically OpenMP for GPGPU, magic C++ macros that convert your code to DirectCompute shaders (although in theory OpenCL or Cuda could be targets as well). Like OpenMP, I'm skeptical whether or not it will actually lead to a any kind of speed up of random C++ code, even random C++ code that is data parallel, but its an interesting idea and like OMP is very low cost for developers to try out on their existing code base.

maximilianburke · on Dec 10, 2011

C++ AMP is a framework for acceleration of general purpose C++ code with heterogeneous processors. It's not the same as DirectCompute which in this case is more like the assembly language of GPU programming on Windows as Microsoft is building C++ AMP on top of it.

DiabloD3 · on Dec 10, 2011

No, DirectCompute is more like the C++ of GPU programming. The syntax style of the compute shaders are the same C++-ish that Cg and HLSL uses.

In comparison, OpenCL's compute shaders use the same C-ish that OpenCL does for GLSL.

OpenCL and DirectCompute are pretty high level considering they're running on what equates to massively parallel DSPs.

Oh, and OpenCL already runs on heterogeneous setups. There are OpenCL compilers for three different GPU archs, x86, PPC, Cell SPEs, a few DSPs, and ARM. An OpenCL program can use all of them at the same time if available.

maximilianburke · on Dec 10, 2011

I meant that it was the assembly language because DirectCompute is the target of the C++ AMP compiler, similar to how some people call JavaScript the assembly language of the web. It's the lowest level language that those developing in that environment can target.

DiabloD3 · on Dec 10, 2011

If you can find a url about AMP, I'd like to see it. Google seems to think it doesn't exist.

It almost sounds like AMD's project that turns Java bytecode into GPU-able code (although it is limited to what you can do normally, such as no allocation of memory/no object construction, only certain type primitives, and it should be largely branchless/loopless).

See: http://developer.amd.com/zones/java/aparapi/Pages/default.as...

sausagefeet · on Dec 12, 2011

http://blogs.msdn.com/b/vcblog/archive/2011/06/15/introducin...

http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/01...

risratorn · on Dec 10, 2011

Probably because some hasn't yet assembled a similar system with ATI cards? They claim they have the fastest desktop system and it probably is at the time of writing but as far as i can tell they didn't say it couldn't possibly be better/faster. There's always room for improvement no? :)

ansgri · on Dec 10, 2011

A bit offtopic, but we of image processing do use integers a lot: they're fast, single SSE instruction processing whole 16 uint8s at once.

DiabloD3 · on Dec 11, 2011

Except video cards do image processing every single day. Modern game engines use floating point pipelines the whole way through, the only integers in the GPU typically being texture data and the final output to the screen.

SSE2 integer usage may be fast, but GPUs are faster, typically by several magnitudes.

Quequau · on Dec 10, 2011

They've had enough time to build a FASTRA III since this was released...

jvdb · on Dec 10, 2011

(keep in mind, 2009)

ksadeghi · on Dec 10, 2011

It looks like they have 7 cards in there. If all the GPU's in that system where pegged at 100% for 30 minutes that system would overheat. There isn't enough space between the cards for free air flow.

7 x 5970's would give you just under 35 Teraflops instead of 12.

DiabloD3 · on Dec 10, 2011

To be fair, 5970s are dual GPU cards and would probably overheat packed in like that. 5970 have two 5870s, but are downclocked so the performance is closer to two 5850s.

8 5870s packed into two cases would survive if you keep the cases closed and put 3 Delta AFB1212s in the front pushing all the air through the cards and out their rear vents, this is how most Bitcoin miners build cases.*

If done right, 8 5870s should easily be kept under 85c even when marginally overclocked.

* Large scale Bitcoin miners have a habit of not using cases and use flexible risers to keep the cards away from each other out in the open. I do not recommend this as you lose out on high pressure cooling, a staple of enterprise/high performace computing.

SilasX · on Dec 14, 2011

FWIW, I set up a bitcoin mining rig with four 5870s by using liquid cooling. Full story here: http://silasx.blogspot.com/2011/12/exclusive-silass-bitcoin-...

hippich · on Dec 10, 2011

I used water cooling. With 5970 it do not adds too much to price, but helps keep cards cool in tight environment. Specifically, I had microATX case with two 5970 at 850 Mhz (it could handle more, but just wanted to feel safer). GPUs were staying at 55-65 C. I bet you can do better if you route separate water tubes to each card (which I did not because of tight environment and being lazy :))

Before adding water cooling - it would lock after 30 minutes of mining at standard 725 Mhz.

This system worked very well for ~4 months (and then I sold it to guy from Alaska, since it became too hot in Texas :))

DanBC · on Dec 10, 2011

They have thermal image camera video:

(http://fastra2.ua.ac.be/?page_id=53)

They claim "slightly over 60 degrees".