> When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.
It's actually worse than that. What I see is that when companies have the ability to store and "analyze" large amounts of data, their appetite for data tends to increase. So they seek to take in as much data as they can find. More often than not, the quality of the data is mixed at best. Frequently, it's horrible, and because the focus is on data acquisition and not data quality, nobody notices the bad data, missing data and duplicate data.
The result: even if you manage to come up with decent hypotheses, you can't trust the data on which you test them.
> When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.
This is not necessarily a bad thing. Take the domain of application performance management. You're collecting hundreds of thousands of metrics from all over the place, OS, network, middleware, end user. Occasionally there is a performance problem that is non-obvious. You go through the obvious metrics and find nothing. It is a great thing at this point to just throw all this data at some algorithm and let it come back to you with "metric X, Y, Z looks related". This gives me some hypothesis I can go check that I would probably never have thought of on my own. And I have a direct way of verifying if it was a correct hypothesis: oh, it looks like there's 2 disks in this cluster, 1 is running at 100% the other at 0% so the overall utilization only shows 50%, I didn't think that was a problem. Investigate. Oh this disk has compression enabled, the other doesn't, turn it off, the application runs fast now.
It's actually worse than that. What I see is that when companies have the ability to store and "analyze" large amounts of data, their appetite for data tends to increase. So they seek to take in as much data as they can find. More often than not, the quality of the data is mixed at best. Frequently, it's horrible, and because the focus is on data acquisition and not data quality, nobody notices the bad data, missing data and duplicate data.
The result: even if you manage to come up with decent hypotheses, you can't trust the data on which you test them.