Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Under the Hood of Google’s TPU2 Machine Learning Clusters (nextplatform.com)
127 points by Katydid on May 22, 2017 | hide | past | favorite | 37 comments


Some serious tea leaves reading going on here.

Regarding the power consumption of the TPU2 racks and whether or not Google will continue to use them: as long as they're more power efficient than GPUs of course they will continue to be used. That is most likely the only reason the TPU exists in the first place. Running the same workload on GPUs would consume a few Mega Watts if the TPUs consume 500 KW. And Google has lots of workloads like that so the savings must be enormous.

But GPUs are a moving target and even if for now the TPU seems to have the edge on power efficiency that could easily change.

Another thing that does not make much sense to me is that if you were to use systolic array processors with a certain limit that it would make very good sense to figure out a way to daisy chain them so that multiple units can be combined into larger arrays.

That may very well be the function of that chip in the center, some kind of crossbar to allow for easy linkage of single TPUs into larger fabrics.


The other reason (maybe the dominant reason) to prefer a TPU is to avoid paying for all the GPU functionality that is irrelevant to these workloads (FP32, FP64, graphics-specific functionality, costs of writing graphics drivers that gets bundled into hardware unit prices...)

If I look at a GTX 1080 Ti and optimistically assume that I can keep it completely busy 24/7, operating at the 250 watt TDP, it would take 4.5 years for the retail electricity cost to match the initial purchase cost of $699. (I pay a bit under 7 cents/kWh). Now I do live in one of the cheapest-electricity regions of the US, but I would also assume that big data center operators are building facilities where electricity is cheap. And big customers get lower rates than households. I wouldn't be surprised if a GPU cluster's hardware is considered obsolete before its electricity costs match its initial hardware costs.


>The other reason (maybe the dominant reason) to prefer a TPU is to avoid paying for all the GPU functionality that is irrelevant to these workloads (FP32, FP64, graphics-specific functionality, costs of writing graphics drivers that gets bundled into hardware unit prices...)

Often bundling decreases cost. Bundling also has the additional advantage of producing more demand when selling excess capacity a la AWS.


Not in this case.

Google wants a die as efficient as possible All of those take up die space, creating a larger processor. That increases the cost of defects, as well as overall manufacturing costs.

For Amazon, you may be right, but for Google, they only want a specific function.


Hopefully, for customers, it also means not having to deal with proprietary Linux drivers and associated headaches.


You know that NVidia produce special cards for compute work?


> costs of writing graphics drivers that gets bundled into hardware unit prices

If Nvidia is smart, they consider only marginal, not fixed costs when pricing their products.


If Nvidia is smart, costs have nothing to do with prices except as a lower bound.


Let Q(p) be quantity demanded at price p, and c be marginal cost. Then profit is Q(p)*(p-c).

If e.g. Q(p) = 1/p^2 (price elasticity of demand of -2), then profit is 1/p - c/p^2, which is maximized at p = 2c.

Note how optimal price depends on cost.


I would add that another reason to develop your own deep learning Hardware if you are a big like Google is in order to prevent Nvidia from completely owning and dominating this segment.

Failing math, let's hope AMD'em Vega GPUs don't completely suck.


When the author made the mental connection on the matching labeling color from the TPU2 front panel connections the nerd in me did a little jig, I LOVE the level of detail and authority this article conveys.

Despite my general server / hardware comprehension being on the low end of intermediate this article was extremely digestible and helped close some gaps for me about Google's strategy and how/where to think about deploying cloud-ML in the future - love seeing content like this on HN.


That article reads just like a SIGINT intelligence report on some foreign country's new radar system, inducing capabilities, limitations, and specifications from clues visible in photographs.


I think you mean deduce or infer


You make a good point. I meant inductive in the sense that the author of this article really is operating like an intelligence analyst, working from available but limited information and drawing conclusions like `the white cables are probably for management' (judging from the number of them and the type of connectors seen on the ends). It's a statement of probability, based on what we can see, what we know from experience ought to be present, and what else we don't see. More interesting is the author's prediction that maps Xeon cores 1:1 with TPU chips (again, by counting boards in the CPU racks on the ends); that one is a combination of inductive and deductive logic.

The article is a great piece of analysis.


Another potential source of friction I've wondered about regarding getting research workloads running on TPU2 is the heavy reliance of most of the popular DL frameworks on CUDA and CuDNN; I don't understand how one could port those libraries to a drastically different architecture such as this. Some DL frameworks have made overtures to OpenCL but it has not typically been a top priority [0].

Could this instead imply Google is working on its own library of low-level math primitives just for Tensorflow, and if so, how long would that take before it's competitive performance-wise? At any rate, having to support a different computing platform / API would be another blocker to general adoption.

[0] https://github.com/tensorflow/tensorflow/issues/22


Would this not be solved by TensorFlow XLA?

https://www.tensorflow.org/performance/xla/


Per my understanding of XLA, it provides the right high-level abstraction for compiling the tensorflow computation graph under a range of architectures, CPU, NVIDIA GPU, mobile, etc. but it still delegates to a lower level domain-specific API like CUDA.

This undertaking isn't something to be taken lightly; CuBLAS has some pretty cutting edge architecture-specific optimizations for batching operations for matrix multiplication that came from several years of research - and is arguably a massive competitive advantage of NVIDIA over AMD. Depending on the development state of such an API internally within Google, it could mean that the Cloud TPU isn't going to be ready for wide-spread commercial use for a good while, and is very much still in the research-and-development phase (which could explain why they're only opening it up for the research community right now).


From what I gather from reading AMD's recent AMA is that the key is getting hardware and software guys [working together](https://lists.freedesktop.org/archives/dri-devel/2016-Decemb...). For AMD the issue is that they can't afford to have hardware guys dedicated to helping optimize the software since they need to build the next and next next generation of hardware. On the other hand, Google surely has the pockets to get man hours and knowledge to (a) figure out what kind of software is needed (b) design from hardware level to meet their software needs.

And if you caught any of their talks on [Tensorflow at Google IO](https://www.youtube.com/watch?v=5DknTFbcGVM), it seems like their goal is to provide ultra-high level APIs like POST/GET, and high level python ones like Tensorflow pre-built and trained models. At that point Google can control the engineering, and as long as you run on their cloud platform you just have to worry about the high-level. I'm definitely not a fan of vendor lock-in, but it seems like a interesting product, and I'm curious to see what Google does in this space.


You could specify computation in high level API like TensorFlow, and then have the framework pick the best implementations available (ie, CuDNN for GPU, MKL for CPU, something custom for TPU)


"something custom for TPU" is what seems interesting here. Researchers working on the TF research cloud probably wouldn't learn much about the sorts of workloads this architecture is suited to, if all they have access to is the high-level API. Would whatever iteration of the TPU API they have internally be something they'd release soon, at least to partners?


So Google still won't reveal what kind of floating-point TPU 2 actually runs. That is really really really strange. I cannot help but suspect it's some sort of non-standard floating point but that's easy to clear up, no?


I'd use one of the debugged/tested open-source designs. Its so easy to make a mistake with complicated specifications of IEEE standards.


Pretty sure it's a mini float of some sort.


Back in the day it was fairly common to not be compliant with IEEE-754 to cut down gate count. Like just giving the wrong results on de-normals, fixed to one rounding mode, etc. I bet they're doing something like that.


Stochastic rounding is also important for doing these algorithms in very low precision. Without stochastic rounding, even very small rounding errors accumulate enough to destroy the accuracy.


Stochastic rounding shouldn't be entirely necessary for floating point though, no?


How does floating point change anything? You still only have so many bits of precision, and the lower bits need to be rounded. In fact floating point has a lot less precision, since many of the bits are needed to store the exponent.


The range of the exponent. Stochastic rounding is intended to prevent bias towards zeroes which paralyze learning, but with floating point this is less of an issue since you only round the mantissa.


The issue is when you add a very small number to a larger number. This occurs in neural networks when accumulating gradients during learning. Many of the gradients are very small, and when adding them repeatedly they get rounded down. There was a paper studying the effect, and I think it was after about 14 bits of precision the accuracy is very diminished. But with stochastic rounding they could get down to very few bits.


All of this is fantastic if you are a deep learning Jedi Master. But there's a very small supply of those sorts.

I really worry about all these reduced floating-point representations when they are made use of by people who mostly understand deep learning through tinkering with existing tensorflow tutorials.

FP32 seems like a relative sweet spot with sufficient dynamic range to let most amateurs avoid getting trapped in the weeds. I could probably be persuaded to believe that FP24 is sufficient as well.

But I suspect that once you get down to 16 bits or so you have to do all sorts of stochastic rounding / dithering that is beyond the skill set of most data scientists. And that's because I suspect many of them don't really get dynamic range.

Which then leads me to believe that what we really need here is the equivalent of R for machine learning.


Range.


Maybe a paper is forthcoming? At least Jeff Dean has confirmed that it's some kind of FP, not e.g. decimal arithmetic. :-)


Some details of this analysis seem off to me. The pink QSFP cables are more likely to be PCIe than Omni-Path. The orange/yellow/green cables are likely to be some kind of network (torus? hypercube?) but I doubt Google would use IBM IP when they could design their own.


Certain Google employees won't shut up about Clos networking, but I donno if it applies here: https://en.wikipedia.org/wiki/Clos_network


Their first (more indepth) post on the TPU2:

https://www.nextplatform.com/2017/05/17/first-depth-look-goo...


If google builds 90 of these, then they sort of have the first Exoflop supercomputer. The supercomputer rankings are based on how fast they can run some standard matrix math tests. TPU may not be general purpose enough to run these tests.


If anyone from the web team at Next Platform are reading this: please, please consider fixing the weirdness of browsing your site on iPad (and maybe this affects other tablets too).

In portrait mode, the text scrolls off the side of the screen so I have to continually pan back and forth to read an article. In landscape mode, a very thin column of text is used. It's as if the landscape res media query is being applied to portrait view and vice versa.

Edit - oh hmm looks like manually zooming out a bit in portrait will fit the whole article. Was sure that didn't used to be the case in earlier days. Maybe it got fixed!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: