The Nvidia DGX-1 Deep Learning Supercomputer in a Box

mortenjorck · on April 6, 2016

Just for some perspective, a little over 10 years ago, this $130k turnkey installation would sit at #1 in TOP500, easily beating out hundred-million-dollar initiatives like NEC's Earth Simulator and IBM's BlueGene/L: http://www.top500.org/lists/2005/06/ (170 TFLOPS vs. 137 TFLOPS)

At the other end, even a single GTX 960 would make it onto the list, placing in the 200s.

trsohmers · on April 6, 2016

The 170 TFLOPs number that NVIDIA gives out is for FP16, while the Top 10 list gives its number for for FP64. The P100 that makes up this NVIDIA box gives about 5.3TFLOPs per card, or a total of 42.4TFLOPs for the whole box.

Sure, you can say that deep learning doesn't need FP64, but it is REALLY unfair to compare this to anything on the TOP500 list, especially when you consider the fact that this is not balanced in terms of memory size or bandwidth (in relation to the number of FLOPs) when you compare it to any real supercomputer class system.

0x07c0 · on April 6, 2016

Was thinking the same, but look at the memory bandwidth of this thing, 720GB/sec.* That is the number you should look at, and that is a sweet number... Also the NVLink tech looks nice for multi-GPU/heterogeneous computing (I guess the latency is the most important, but no idea how that is ). Do's some one know if the Xeons are connected to the GPUs with NVLink or are they on PCIE ? (I know the new POWER chips have NVlink but haven’t read that Intel supports it.)

*http://www.anandtech.com/show/10222/nvidia-announces-tesla-p...

trsohmers · on April 6, 2016

PCIe is the huge bottleneck... The two Xeon's in the box are 2698v3's, which each have only 40 PCIe lanes, meaning they are restricted to using 8x PCIe3 lanes per card, which would net you a whopping 8GB/s between each CPU and GPU. EDIT: Oh, and no, Intel does and probably never will support NVLink. I will eat my words (type this post up on paper and eat it) if they do in the next 5 years.

When I talk about balanced (which is a huge influence in my architectural and system level designs), I want to ideally be able to hit theoretical throughput. If we look at FP64 as an example, if I want to have sustained throughput of fused multiply adds (which is how NVIDIA always advertises their theoretical FLOP numbers as), I would be needing to move 196 data bits (three 64 bit floating point operands) in to each of my FPUs every cycle, and 64 bits out. 256 bits per cycle in a fully pipelined situation to be able to do 2 FLOPs/cycle. So if our ideal bandwidth is 16 Bytes for every 1 FLOP, if you have almost 10x more floating point capability than memory bandwidth, you are going to have a bad time (and GPUs very well reflect this on memory intensive workloads... take a look at GPUs on HPCG, they only get ~1-3% of their theoretical peak).

I'm working on my own HPC targeted chip, so obviously have some bias there, but 720GB/s memory bandwidth for a chip that is that large and using that much power isn't that impressive to me. Obviously I should wait to boast until I have my silicon in hand, but getting more than 3/4ths of that bandwidth in less than 1/10th of the power. Add in some fancy tricks and our goal is having our advertised theoretical numbers be pretty damn close to real application performance for memory intensive workloads.

0x07c0 · on April 6, 2016

>PCIe is the huge bottleneck... The two Xeon's in the box..

It's a wast to put Xeon's on this things if they use the PCIe, you end up in a loot of cases only using them to drive the GPU's.

>When I talk about balanced (which is a huge influence in my architectural and system level designs)...

The DP performance on Tesla's is ridiculous, think it is a marketing ploy. People talk of buying gaming cards.., as you are almost always memory bound..

>I'm working on my own HPC targeted chip..

Looks nice, you are throwing out all HW bloat and doing everything in software? Are you planing to have some form of OS running on this chips?

jlebar · on April 6, 2016

Still, how many of these boxes are we talking about to match the performance of the #1 from the top 500 in 2005? 10? 20? That's still under $3m for 20, which is pretty impressive to me.

mochapixel64 · on April 6, 2016

Out of curiosity, what are some problems/solutions that require FP64?

zevets · on April 6, 2016

Any finite element/volume problem. Anything integrated.

lern_too_spel · on April 6, 2016

That is 170 TFLOPS Rpeak (theoretical performance assuming you could find a workload that doesn't need to wait for data movement) at half precision vs. 137 TFLOPS Rmax (usable performance on a dummy linear algebra problem) at double precision. No, it would not top the list.

KKKKkkkk1 · on April 6, 2016

Theoretical peak flops rate is useless for indicating performance nowadays. There are new benchmarks that take memory and network performance into account such as HPCG and HPGMG. On these benchmarks, throughput-oriented machines such as the ones Nvidia sells do not look good at all.

ssh42 · on April 7, 2016

already mentioned 16FP 170 TFLOPS (that is 64 FP 42.5TFLOPS) of DGX-1. There is also issue of GPU vs CPU: basically you couldn't directly compare these operations on same scale. You could easily drop 100x of your GPU performance at bad case scenario. Basic idea of GPU that you could possible gain sometimes extra 1000s times performance

aconz2 · on April 5, 2016

Check out the specs here: http://images.nvidia.com/content/technologies/deep-learning/...

though I'm most curious about what motherboard is in there to support NVLink and NVHS.

Good overview of Pascal here: https://devblogs.nvidia.com/parallelforall/inside-pascal/

1 question: will we see NVLink become an open standard for use in/with other coprocessors?

1 gripe: they give relative performance data as compared to a CPU -- of course its faster than a CPU

dgacmu · on April 5, 2016

You mean you're not surprised that a machine with 8 GPUs, apparently costing $129k USD (from comment below), can outperform a single CPU? :)

(Of course, a better metric is that it's getting ~56x the performance at probably ~10x the TDP, but that's not surprising for a GPU with the current state of deep learning code.)

To their credit, the thermal and power engineering needed to get that dense a compute deployment is challenging. (bt, dt, have the corpses of power supplies to show for it.) But the price means that it's going to be limited to hyper-dense HPC deployments by companies that don't have the resources to engineer their own for substantially less money, such as Facebook's Big Sur design: https://code.facebook.com/posts/1687861518126048/facebook-to... . And, of course, the academics and hobbyists will continue to use consumer GPUs , which give much better performance/$ but aren't nearly as HPC-friendly.

aconz2 · on April 5, 2016

To be fair, they are comparing it to a dual-socket CPU; which is twice as fair as comparing to a single!!

What I was getting more at was: I want to know the relative performance compared to another 8 Tesla box. I know comparing apples isn't good marketing, but c'mon.

cowsandmilk · on April 5, 2016

They gave a pseudo-comparison to other GPUs in the keynote

http://images.anandtech.com/doci/10225/DGX-1Speed.jpg

astrodust · on April 5, 2016

$129K buys you a lot of dual 22-core servers.

pinewurst · on April 6, 2016

What kind of server pricing are you getting? Base servers are cheap, but add high-end Xeons and memory, not to ignore interconnect and I get something like 7 ok configured 1U servers for $129K (2 20 core w lots of RAM, 10GbE NICs and mirrored boot/swap). No interconnect switching. That's for 20 core Haswell because I don't yet have discount pricing for Broadwell Xeons. I'm sure one could do better at hyperscaler discount but this is startup low-ish quantity.

Keyframe · on April 6, 2016

Not really. Not a lot at least.

rbranson · on April 6, 2016

if you include space, power, HVAC, and networking?

virtuallynathan · on April 5, 2016

It looks like it uses a separate daughterboard that houses the GPUs + NVLink, connected to the main motherboard using quad Infiniband EDR (400Gbps) + RDMA. http://images.anandtech.com/doci/10225/SSP_85.JPG

pinewurst · on April 5, 2016

The diagram is confusing, but the GPUs are connected to the NVLink matrix which is connected to the motherboard via the PLX PCIe switches. The quad IB/dual 10GbE are separate IO attached to the motherboard.

https://devblogs.nvidia.com/parallelforall/inside-pascal/

virtuallynathan · on April 6, 2016

That would make much more sense. Thanks! The PCI bandwidth must be fairly limited. 4x 100G Infiniband is 64x PCIe lanes, out of 80x lanes available.

phelm · on April 5, 2016

I am looking forward to OpenCL catching up with CUDA in maturity and adoption, so that NVidia's monopoly in Silicon for deep learning will come to an end.

blt · on April 5, 2016

I'm at GPU Technology Conference, where this computer was announced this morning. The amount of "wood behind the arrow" NVidia has for AI is insane. Even though the current demographic of GPU development is full of HPC simulations, physics, graphics... it's obvious that their biggest thrust is in machine learning. I don't think OpenCL can compete with this amount of money and enthusiasm. NVidia is rich and their engineers are very good. Some big changes would need to happen before OpenCL catches up to CUDA.

dharma1 · on April 5, 2016

I think it's just really bad management from AMD. Took them ages to wake up, and now they have what looks like a relatively small team on their Boltzmann initiative. Remains to be seen what happens to it.

How much do you think it would really cost to develop an OpenCL equivalent of CuDNN (even a stripped down version, just fast)? I know AMD are struggling but we are talking about allocating a handful of talented engineers

pjmlp · on April 6, 2016

For that to happen OpenCL has to be at the same level as CUDA in language support and tooling.

Having C only wasn't a good idea. NVidia was quite clever in giving first class treatment to C++, Fortran and any compiler vendor that wished to target PTX.

Also the visual debugging tools are quite good.

Khronos apparently needed to be hit hard to realise that not everyone wants to be stuck with C for HPC in the 21st century.

Also although Apple is the creator of OpenCL, they don't seem to give much love to it.

Then you have Google caring about it's Renderscript dialect, which doesn't help to the overall uptake in OpenCL.

There isn't a monopoly, rather vendors that lacked the perception to appeal to the developers wanted to have as tooling and performance.

Anyone is free to go use OpenCL, use C or a language with a compiler with a C target, do printf debugging and feel free.

Are any vendors already doing SPIR support?

bgalbraith · on April 5, 2016

What monopoly? You totally have a choice, it's just that NVIDIA made a large bet on GPGPU and it is paying off for them. You don't see AMD heavily pushing their cards for compute purposes or developing computational developer relations.

RussianCow · on April 5, 2016

You often don't have a choice because a large amount of GPGPU software is written using CUDA, which is Nvidia-specific.

bgalbraith · on April 5, 2016

NVIDIA does not have a monopoly in the traditional sense. But yes, the have a de facto one because there is no viable competition.

It's like saying MATLAB has a monopoly in academic research because so much of the code is written in it. That is slowly changing and moving over to Python now, which is great. Maybe OpenCL will get there someday, but I don't see it happening any time soon.

make3 · on April 5, 2016

this is wrong. No mainstream deep learning library uses openCL, and the non-mainstream ones that do are much much slower. I remember reading up to 10x slower, but I can't seem to find the reference right now.

bgalbraith · on April 5, 2016

You are correct. My initial response was a pedantic point about the semantic use of monopoly in this context, which isn't helpful.

I would love it if AMD would care more about GPGPU, but they don't, and NVIDIA has little incentive to make their OpenCl drivers equal to their CUDA ones.

jlebar · on April 6, 2016

clang now has a mostly-working CUDA frontend (disclaimer, I work on it). And it has an AMD GPU backend (whether this is in a good state I don't know). I don't expect that putting these pieces together would be a huge project.

DeepYogurt · on April 5, 2016

Me too. I really want to see some benchmarks between cuda code and opencl code generated from cuda with AMDs compiler. Actually if anyone has a geforce/tesla get on this!

Robadob · on April 5, 2016

I haven't seen any recent benchmarks, but ones from 2011 all seemed to show CUDA and OpenCL on open footing in terms of performance when optimised properly.[1][2] CUDA simply had better library support, and a more well defined and uniform architecture to target. Whereas OpenCL is likely to require more programming to fill in the gaps for library support, and different optimisations depending on the architecture you wish to target. I'm guessing since then, the CUDA compiler may have improved somewhat in terms of optimisation based on some micro-benchmarking research I was looking the at the other day.

There's also Intel's MIC to consider now to, although that has a vastly different architecture to GPU. Again performance was similar between MIC and GPU in 2013[3], each performing better where their architecture was more suited, GPUs were capable of providing double the bandwidth for random access data.

In terms of AMD vs NVIDIA, I've not looked into it, I doubt AMD has anything to really compete with NVIDIAs current GPU accelerated compute lines. However again there was always that distinction (re bitcoin?) that AMD cards have better integer arithmetic and NVIDIA better float arithmetic.

Disclaimer: I use CUDA in my research, never tried OpenCL.

[1] http://arxiv.org/abs/1005.2581

[2] http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=604719...

[3] http://arxiv.org/abs/1311.0378

DeepYogurt · on April 5, 2016

Would you mind trying the AMD compiler? http://gpuopen.com/compute-product/hip-convert-cuda-to-porta... I'd be interested in seeing a benchmark between some original cuda code and the opencl output of this compiler on the same gpu.

dharma1 · on April 5, 2016

It doesn't look like AMD's HIP effort compiles to OpenCL, but to C++, that works on specific AMD GPUs (Fiji R9 Fury X, R9 Fury, R9 Nano).

Also, much of the speed gains for ML on NVidia hardware come from CuDNN - there is no equivalent for OpenCL or AMD hardware

mon_insider · on April 6, 2016

I don't think CUDA vs OpenCL is the real issue. It's the libraries that come with them. It matters little in which language they were written, since they are closed source for the most part.

AMD's Boltzmann initiative won't solve the lack of libraries.

nightski · on April 5, 2016

What hardware could OpenCL even run on that would come remotely close to what this system has to offer?

olympus · on April 5, 2016

OpenCL runs on Nvidia GPUs, so you could do an apples to apples comparison on this system.

nightski · on April 6, 2016

Fair enough but how does that further the goal of having alternatives to Nvidia? The truth is the problem isn't cuda vs. opencl. There simply aren't good alternatives to the hardware Nvidia provides. If the hardware was there, you'd see people switch to OpenCL.

dogma1138 · on April 5, 2016

You can, but IIRC Nvidia actually compiles OpenCL to CUDA in the Driver.

nl · on April 5, 2016

The problem is that no one[1] uses OpenCL because the performance isn't there. There is little sign of that changing, too.

pjmlp · on April 6, 2016

Another reasons are language support and tooling.

CUDA had Fortran and C++ since day one and thanks to PTX was quite easy to add support for other languages.

Whereas OpenCL was stuck on "C only" model from Khronos, which forced everyone to use C or generate C code and be constrained to the device drivers.

This has been seen as such a big issue that SPIR and C++ SPIR got introduced with OpenCL 2.0.

Another very important one is debugging support. Last time I checked no one had visual tooling at the same level as NVidia's one.

partycoder · on April 5, 2016

Costs $129,000 and needs 3.2 kilowatts to run.

dogma1138 · on April 5, 2016

3.2KW isn't that insane for a server, you can buy high end desktop PSU's of 1.6KW (I'm running a 1200W one) if you are using multiple GPU's, a high end CPU, 32-64GB of memory and loads of storage coupled with overclocking and the substantial cooling required it's not that hard to get to around 1KW power consumption on a high end gaming rig these days.

olympus · on April 5, 2016

To me, $129k isn't surprising since it is only going to be bought by researchers with big budgets. Small-timers will still build 3x GTX980 systems for under $5k.

3.2 KILOwatts sounded insane to me, but I suppose you'll have your own server rack to put it in if you can afford to buy one of these.

TkTech · on April 5, 2016

3.2kw isn't that insane considering what you're getting out of it. A coffee pot is 1kw, a toaster is 1.2kw, an electric broiler is 3.6kw. Running costs would be a very tiny part of any budget. Ends up being $9.216/day assuming peak costs, peak usage, and 24h operation.

astrodust · on April 5, 2016

If that sounds insane, you're going to lose your mind when you realize how many KILOwatts your oven uses.

3.2KW is less than a dishwasher.

nkurz · on April 6, 2016

Are you sure your numbers are right? What kind of dishwasher do you have? And what kind of oven? For the US, at least, most dishwashers are well under 1600W, and few ovens exceed under 3200W.

https://www.daftlogic.com/information-appliance-power-consum...

witty_username · on April 6, 2016

Yeah, 3.2 KW would mean the dishwasher's heating the dishes to high temperatures, so it'd be more of an oven than a dishwasher.

astrodust · on April 6, 2016

Dishwashers not only heat the water, they also heat up the entire compartment if you have the drying feature turned on. They are basically an oven.

You can cook in them as well: http://www.thekitchn.com/can-you-really-cook-salmon-in-a-dis...

patcheudor · on April 6, 2016

Forget that, the HVAC unit for my home is rated at 24.6kW.

Fomite · on April 6, 2016

"To me, $129k isn't surprising since it is only going to be bought by researchers with big budgets" Yeah, this is essentially "Big chunk of a computational researcher's startup budget" or an infrastructure grant.

jra101 · on April 5, 2016

More detail on the GPUs in the system:

https://devblogs.nvidia.com/parallelforall/inside-pascal/

Robadob · on April 5, 2016

Have they published a copy of the video of the autonomous car trained with unsupervised learning from the keynote anywhere?

I'd love to show it to my father.

madengr · on April 5, 2016

Note the P100 is 20 Tflops for half precision (16 bit). For general purpose GPU (I use them for EM simulation) I assume one would want 32-bit, which is 10 Tflops. But still looks much much better for 64-bit computations than the previous generation

pareci · on April 5, 2016

Curious. Why do you post here when every other comment is random posturing?

madengr · on April 5, 2016

They were touting 20 Tflops, but that's only for FP16, which isn't useful for many engineering computations that use GPU. I already can hit 2 Tflop F32 with two K20. It's a nice improvement over what I have now, but nothing astronomical.

sp332 · on April 5, 2016

Wow, I didn't realize they were shipping HBM2 already. 720GB/s - with only 16GB of RAM, you can read it all in 22 milliseconds!

DeepYogurt · on April 5, 2016

They're not. All that was mentioned in the talk was that this chip is coming soon.

sp332 · on April 5, 2016

There's a big green "Order Now" button about 1/3 of the way down the page.

DeepYogurt · on April 5, 2016

And no delivery date.

nickpeterson · on April 6, 2016

I have to wonder about intel and their Xeon Phi range. Last I checked they were supposed to launch a followup late last year that never manifested. Now we're 4 months in 2016 and still no new phi's.

Couple that with the fact that they want you to use their compilers (extremely expensive), on a specialized system that can support the card, and you get a platform that nobody other than supercomputer companies can reasonably use. Meanwhile any developer who want to try something with cuda can drop $200 dollars on a GPU and go, then scale accordingly. I think intel somewhat acknowledged this by having a firesale on phi cards and dev licenses last year but it was only for a passively cooled model (really only works well in servers, not workstations).

Intel do this:

  - Offer a $200-400 XEON PHI CARD
  - Include whatever compiler needed to use it with the card
  - Make this easily buyable
  - Contribute ports of Cuda-based frameworks over to Xeon Phi

I feel like they could do this pretty easily, even if it lost money, it's pennies compared to what they're going to lose if nvidia keeps trumping them on machine learning. They need to give dev's the tooling and financial incentive to write something for Phi instead of cuda, right now it completely doesn't exist and frameworks basically use Cuda by default.

If you're AMD, do the same thing but replace the phrase Xeon Phi with Radeon/Firepro

manav · on April 5, 2016

$129k for this machine. In the keynote its interesting that they mentioned the product line being: "Tesla M40 for hyperscale, K80 for multi-app HPC, P100 for scales very high, and DGX-1 for the early adopters".

The GP100/P100 with the 16nm process probably gives a considerable performance/power advantage over the Tesla... but this gives me the feeling that we may not see consumer or workstation-level Pascal boards for a while.

svensken · on April 6, 2016

I was wondering about this too, the way they plugged old K80's at the end for non-deep-learning applications. Either they're clever about keeping multiple product lines alive (more profits!) or it's a big cop-out (they're hiding something about P100 that makes it a bad choice for GPGPU - maybe price?)

blakes · on April 5, 2016

$129k seem extremely fair for what you get actually, in my experience.

Coding_Cat · on April 5, 2016

Wait, how many chips did they cram in there that they're getting 170 TFlops. Even at a very generous 10 TFLOP per chip that would be 17 chips.

krasin · on April 5, 2016

NVIDIA Tesla P100 has 21 TeraFLOPS of FP16 performance by their words. So they got 8 chips there.

jsheard · on April 5, 2016

Yep, they showed a diagram of how it fits together: http://i.imgur.com/xk1daFG.jpg

aconz2 · on April 5, 2016

https://devblogs.nvidia.com/parallelforall/wp-content/upload...

source: https://devblogs.nvidia.com/parallelforall/inside-pascal/

cptskippy · on April 5, 2016

I wish that made that information more accessible. I wasn't able to find it on the site and it was all I really cared about.

Coding_Cat · on April 5, 2016

Ah, half-floats. That explains it. Still pretty high but realistic at least.

intrasight · on April 6, 2016

Is also fun to contemplate that in about five years you'll likely be able to buy one of these on eBay for about $10K.

dougmany · on April 5, 2016

This announcement reminds me of the part of Outliers that spelled out how Bill Gates and others became who they are because they had access to very expensive equipment before anyone else did (and spent 10K hours on it).

AndrewKemendo · on April 5, 2016

How does this compare to some of the systems provided by cloud providers? Seems like requiring an on-site capability is a hurdle for integration if you already have your data on a cloud provider.

[1] https://aws.amazon.com/machine-learning/ [2] https://azure.microsoft.com/en-us/services/machine-learning/

jdcarter · on April 5, 2016

I would argue that this box is probably targeted at cloud providers. The Nvidia GRID boards are similar--they're not for consumers, but for GPU/Gaming-as-a-service providers.

ansible · on April 5, 2016

The unified memory architecture with the Pascal GP100 is pretty sweet. That will make it easier to work with large data sets.

visarga · on April 6, 2016

It's good to see powerful machine learning hardware come out. Much of the progress in ML has come from hardware speedup. It will empower the next years of research.

bpires · on April 5, 2016

I wonder how much faster the new Tesla P100 is compared to the Tesla K40 in training neural networks. The K40s were the best available GPUs for training deep neural networks.

aperrien · on April 5, 2016

Does anyone know if the Pascal architecture is built using stacked cores? Or is this one of those applications where thermal problems keep that technique from being used?

wmf · on April 5, 2016

No, the Pascal GPU itself is not stacked. Die stacking makes almost no sense for processors.

pmorici · on April 6, 2016

Anyone have any idea of how the GPUs in this machine compare to the GPUs in their high end gaming products?

0x07c0 · on April 6, 2016

Tesla's has more Double Float cores compared to gaming cards.

nshm · on April 5, 2016

Looks like a research in machine learning will only be done in huge corporations. You'll need an amount of funding comparable to LHC.

Time to use better models like kernel ensembles, maybe they are not that accurate, but they are easier to train on a single CPU.

Houshalter · on April 5, 2016

You can already do deep learning on cheap consumer hardware. And $100k is expensive, but it's nowhere near LHC levels.

Fomite · on April 6, 2016

This price point is extremely accessible to most major research universities as well.

caycep · on April 6, 2016

Does that mean Pascal release is just around the corner?!?

-unreformed box builder

chm · on April 5, 2016

Any idea how much this costs?

dazzeruk · on April 5, 2016

The Nvidia slides had it at $129,000 a pop

rckclmbr · on April 5, 2016

Wow, cheap, great for startups!

showerst · on April 5, 2016

As of last year prices for general HPC resources were running around $3/GFLOP[1], or about $500,000 for 170TFlops if my math is correct.

Sounds like this is a significant cost savings if it fits your use case.

maaku · on April 5, 2016

Uh, using what hardware? The 980 Ti is about 11 TFLOP in half-precision (apples to apples). So 16x 980 Ti cards would take up twice as much rack space for $11k. Your estimate (and NVIDIA's pricing) is off by more than an order of magnitude...

mon_insider · on April 6, 2016

A 980 Ti doesn't have FP16 hardware. The only Maxwell based component with such support is their Tegra part.

wmf · on April 5, 2016

Isn't the ECC tax around 10x?

marshray · on April 6, 2016

So, just do the computation twice and compare the results?

OK, so there's twice the power to pay for but it seems like at $129k acquisition cost per 3.2KW consumption you could run for tens of years before break-even.

dharma1 · on April 5, 2016

I'll take 5

dharma1 · on April 5, 2016

was hoping they would announce Pascal GTX's. Oh well. Computex I guess

agumonkey · on April 6, 2016

What a peculiar pascaline.