Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you do that in your paper, you better write next to the graph that you did that.


Perhaps I expressed myself poorly, and left room for misunderstanding, because I cannot possible imagine that we have any real disagreement on how to compute quartiles.

Any set of numbers I give you, you can compute quartiles for it. There is no algorithm for doing that that breaks down if the numbers don't follow a normal distribution.


Look at this SVG from wikipedia: https://upload.wikimedia.org/wikipedia/commons/1/1a/Boxplot_...

When you calculate the box plot using normal distribution parameters, the outliers are outside the outer bracket.

If you split the dataset into 4 equal parts, the bracket will be larger because the outliers are still inside it.

The methodologies are not equal.

This thread is the first time i heard people do the "split dataset into 4 quarters" and using that for box plots.


For what it's worth, you've convinced me that my beloved box plots need to be explained if I want to use them again.

The SVG you've provided clearly shows that the box plot splits the data in 4. The interquartile range (IQR) is clearly marked and it even has a comparison for what the standard deviation (variance) measure would be.

Secondly, if the data truly came from a normal distribution, there are no outliers. Outliers are data points which cannot be explained by the model and need to be removed. Unless you have a good reason to exclude the data points they should be included. This is why I like the IQR and the median, they are not swayed by a few wide valued data points. The 1.5*IQR rejection filter I think is lazy and unjustified. Happy to discuss this point further as it is a bug bear of mine.


When i said "splitting", i meant it like my parent explained: Basically sorting your datasets and then splitting into quarters.

What you want to explain to me (IMHO to the wrong person) is the correct approach of calculating a mean and standard deviation and drawing the box from that. Lets stay with that (and thats what i said earlier in the thread)

After i wrote the post you replied to, i realized that the pure "splitting" method for box plots is nonsensical since the outer brackets interval is determined by the two most extreme values. They are too random to be meaningful. It does not make sense to draw a box plot from that.


The quartiles are defined by doing the sorting and splitting algorithm. So if you want quartiles (or any other quantile generally) you need to calculate it that way. The mean and standard deviation (sigma) are fundamentally different, which is why the image you linked shows them to contrast against the quantiles.

If you want to represent the standard deviation with your box plot, you can calculate it using standard formulas, many maths libraries have them built in. I don't know how to plot it using any graphing package though. ggplot, plotly and matlab all use the quantiles (the ones I have experience with). Perhaps where ever you learned to read them as mean and standard devation has a reference you could use?

> They are too random to be meaningful. It does not make sense to draw a box plot from that.

This can be a problem. In practice, the distributions I see don't go too crazy and are bounded (production rates can't be negative and can't be infinite). I prefer to use the 10th and 90th percentiles which are well defined and better behaved for most distributions. I do make sure it's very clearly marked on each plot though as it's not standard. Using the 1.5 x IQR cutoff is no better though as when you have enough samples you find that the whiskers just travel out to the cutoff.


As I'm sure you know, there are a lot of variations on how quantiles are calculated in various software. The 25th percentile, e.g., doesn't always line up with a value in the dataset, so sometimes nearest rank methods are used, otherwise a linearly interpolated data point, where interpolation is done in various ways.

In any event, none of these methods assume normality, or rely on CDFs of a normal curve.

If they did, every box plot would be symmetric.

The fact some people think that boxplots are constructed in such a way is a pretty good reason to take the author's article seriously as for how boxplots are confusing.


As a first pass definition it does well to explain the concept. Even if you're interpolating you will need to rank the samples and find the two nearest neighbours to interpolate between.

It serves to distance it from the moment-based statistics like mean and variance at least.


Arguing that nobody who might be professionally expected to look at a box plot can be reasonably expected to understand how box plots are defined doesn't make a compelling case that using them is a good idea.


It is actually a fascinating argument that shows how little of what is being decided is based on actual data ( or at least our understanding of it ), but rather that data visualization is being used to push already pre-approved decisions with data being used merely as a 'for' argument.

I agree that if there is an indication that if most professionals don't really know what boxplot is supposed communicate, maybe it should not be used.


If the method how the plot boxes are calculated is not clear (this thread references at least two different methods), you'll need to explicitly write it down which methods you did use.


> this thread references at least two different methods

No, as the sidethread comment notes, there is only one way you can compute quartiles. You seem to be arguing that the correct thing to do is to impute them, and that calculating them is such a deviant practice that it would need to be specially remarked on.


Isn't this what i was saying from the beginning?

  Box plots are made for visualizing generalized normal distributions and nothing else.
And now people in this thread argue you can calculate them from something else. Not sure if you are replying to the right post.


That might be what you were saying from the beginning, but the only thing that that would establish is that you're completely out of touch with reality. Box plots are made for visualizing quartiles.

Your theory would imply, among other things, that the median line going through the box part of a box plot always divides it in half, which obviously is not the case.


No? Exponential Gaussian?

Whatever you do, you should explain first what you do that your whiskers stay meaningful and are not just whatever randomness your outliers produced.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: