> When you have large amounts of data, your appetite for hypotheses tends to get...

laichzeit0 · on Oct 21, 2014

> When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.

This is not necessarily a bad thing. Take the domain of application performance management. You're collecting hundreds of thousands of metrics from all over the place, OS, network, middleware, end user. Occasionally there is a performance problem that is non-obvious. You go through the obvious metrics and find nothing. It is a great thing at this point to just throw all this data at some algorithm and let it come back to you with "metric X, Y, Z looks related". This gives me some hypothesis I can go check that I would probably never have thought of on my own. And I have a direct way of verifying if it was a correct hypothesis: oh, it looks like there's 2 disks in this cluster, 1 is running at 100% the other at 0% so the overall utilization only shows 50%, I didn't think that was a problem. Investigate. Oh this disk has compression enabled, the other doesn't, turn it off, the application runs fast now.