Very cool effort. That said, and it's probably because of the kind of work that I do, but I have almost never found the four challenges to be any kind of a problem for me. Although I do think there is some kind of contradiction there. Plotting (exploratory data analyis ("EDA"), really) is all about distilling key insights and finding features hidden in data. But you have to some kind of intuition about where the needle in the haystack is. IME, throwing up a ton of plots and being able to scrub around in them never seems to provide much insight. It's also very fast---usually the feedback loop is like "make a plot, go away and think about it for an hour, decide what plot I need to make next, repeat". If there is too much data on the screen it defeats the point of EDA a little bit.
For me, matplotlib still reigns supreme. Rather than a fancy new visualization framework, I'd love for matplotlib to just be improved (admittedly, fastplotlib covers a different set of needs than what matplotlib does... but the author named it what they named it, so they have invited comparison. ;-) ).
Two things for me at least that would go a long way:
1) Better 3D plotting. It sucks, it's slow, it's basically unusable, although I do like how it looks most of the time. I mainly use PyVista now but it sure would be nice to have the power of a PyVista in a matplotlib subplot with a style consistent with the rest of matplotlib.
2) Some kind of WYSIWYG editor that will let you propagate changes back into your plot easily. It's faster and easier to adjust your plot layout visually rather than in code. I'd love to be able to make a plot, open up a WYSIWYG editor, lay things out a bit, and have those changes propagate back to code so that I can save it for all time.
(If these features already exist I'll be ecstatic ;-) )
I have to agree with your point about EDA. The library is neat, but even the example of covariance matrix animation is a bit contrived.
Every pixel has a covariance with every other pixel, so sliding though the rows of the covariance matrix generates as many faces on the right as there are pixels in a photograph of a face. However the pixels that strongly co-vary will produce very similar right side "face" pictures. To get a sense of how many different behaviours there are one would look for eigenvectors of this covariance matrix. And then 10 or so static eigenvectors of the covariance matrix (eigenfaces [1]) would be much more informative than thousands of animated faces displayed in the example.
Some times a big interactive visualisation can be a sign of not having a concrete goal or not knowing how to properly summarise. After all that's the purpose of a figure - to highlight insights, not to look for ways to display the entire dataset. And pictures that try to display the whole dataset end up shifting the job of exploratory analysis to a visual space and leave it for somebody else.
Hi, one of the other devs here. As the poster below pointed out what you're missing is that in this case we know that an eigendecomposition or PCA will be useful. However if you're working on matrix decomposition algorithms like us, or if you're trying to design new forms of summary matrices because a covariance matrix isn't informative for your type of data then these types of visualizations are useful. We broadly work on designing new forms of matrix decomposition algorithms so it's very useful to look at the matrices and then try to determine what types of decompositions we want to do.
ok, different libraries have different use cases, the type of data we work with absolutely necessitates dynamic visualization. You wouldn't view a video with imshow would you?
Every time I've needed to scrub through something in time like that, dumping a ton of frames to disk using imshow has been good enough. Usually, the limiting factor is how quickly I can generate a single frame.
It's hard for me to imagine what you're doing that necessitates such fancy tools, but I'm definitely interested to learn! My failure of imagination is just that.
The example from the article with the subtitle "Large-scale calcium imaging dataset with corresponding behavior and down-stream analysis" is a good example. We have brain imaging video that is acquired simultaneously with behavioral video data. It is absolutely essential to view the raw video at 30-60Hz.
Aren't you missing the entire point of exploratory data analysis? Eigenfaces are an example of what you can come up with as the end product of your data exploration, after you've tried many ways of looking at the data and determined that eigenfaces are useful.
Your whole third paragraph seems to be criticizing the core purpose of exploratory data analysis as though one should always be able to skip directly to the next phase of having a standardized representation. When entering a new problem domain, somebody needs to actually look at the data in a somewhat raw form. Using the strengths of the human vision system to get a rough idea of what the typical data looks like and the frequency and character of outliers isn't dumping the job of exploratory data analysis onto the reader, it's how the job actually gets done in the first place.
> Using the strengths of the human vision system to get a rough idea of what the typical data looks like and the frequency and character of outliers isn't dumping the job of exploratory data analysis onto the reader, it's how the job actually gets done in the first place.
Yup this is a good summary of the intent, we also have to remember that the eigenfaces dataset is a very clean/toy data example. Real datasets never look this good, and just going straight to an eigendecomp or PCA isn't informative without first taking a look at things. Often you may want to do something other than an eigendecomp or PCA, get an idea of your data first and then think about what to do to it.
Edit: the point of that example was to show that visually we can judge what the covariance matrix is producing in the "image space". Sometimes a covariance matrix isn't even the right type of statistic to compute from your data and interactively looking at your data in different ways can help.
As a whole, of course you have a point - big visualisations when done properly should help with data exploration. However, from my experience they rarely (but not never) do. I think it's specific to the type of data you work with and the visualisation you employ. Let me give an example.
Imagine we have some big data - like an OMIC dataset about chromatin modification differences between smokers and non-smokers. Genomes are large so one way to visualise might be to do a manhattan plot (mentioned here in another comment). Let's (hypothetically) say the pattern in the data is that chromatin in the vicinity of genes related to membrane functioning have more open chromatin marks in smokers compared to non smokers. A manhattan plot will not tell us that. And in order to be able to detect that in our visualisation we had to already know what we were looking for in the first place.
My point in this example is the following: in order to detect that we would have to know what to visualise first (i.e. visualise the genes related to membrane function separately from the rest). But then when we are looking for these kinds of associations - the visualisation becomes unnecessary. We can capture the comparison of interest with a single number (i.e. average difference between smokers vs non-smokers within this group of genes). And then we can test all kinds of associations by running a script with a for-loop in order to check all possible groups of genes we care about and return a number for each. It's much faster than visualisation. And then after this type of EDA is done, the picture would be produced as a result, displaying the effect and highlighting the insights.
I understand your point about visualisation being an indistinguishable part of EDA. But the example I provided above is much closer to my lived experience.
Yeah, I agree with the general sentiment of what you're saying.
Re: wtallis, I think my original complaint about EDA per se is indeed off the mark.
Certainly creating a 20x20 grid of live-updating GPU plots and visualizations is a form of EDA, but it seems to suggest a complete lack of intuition about the problem you're solving. Like you're just going spelunking in a data set to see what you can find... and that's all you've got; no hypothesis, no nothing. I think if you're able to form even the meagerest of hypotheses, you should be able to eliminate most of these visualizations and focus on something much, much simpler.
I guess this tool purports to eliminate some of this, but there is also a degree of time-wasting involved in setting up all these visualizations. If you do more thinking up front, you can zero in on a smaller and more targeted subset of experiments. Simpler EDA tools may suffice. If you can prove your point with a single line or scatter plot (or number?), that's really the best case scenario.
Eigendecomposition of the covariance matrix, essentially PCA, is probably the first non-trivial step in the analysis of any dataset. The idea in the comment above seems to be that it's more useful to combine some basic knowledge of statistics with simpler visualisation techniques, rather than to quickly generate thousands of shallower plots. Being able to generate thousands of plot is useful, of course, but I would agree that promoting good data-analysis culture is more beneficial.
> Eigendecomposition of the covariance matrix, essentially PCA, is probably the first non-trivial step in the analysis of any dataset
For a sufficiently narrow definition of "dataset", perhaps. I don't think it's the obvious step one when you want to start understanding a time series dataset, for example. (Fourier transform would be a more likely step two, after step one of actually look at some of your data.)
For me, one of the most annoying things in my workflow is when I'm waiting for the software to catch up. If I'm making a plot, there's a lot of little tweaks I want to do to visually extract the maximum amount of information from a dataset. For example, if I'm making a histogram, I may want to adjust the number of bins, change to log scale, set min/max to remove outliers, and change the plot size on page. For the sake of the argument, let's say I'm working with a set of 8 slices of the dataset, so I need to regenerate 8 plots every time I make a tweak. My workflow is: Code the initial plots with default settings, run numpy to process the data, run matplotlib to display the data, look at the results, make tweaks to the code, circle back to step 2. In that cycle, "wait for matplotlib to finish generating the plots" can often be one of the longest parts of the cycle, and critically it's the vast majority of the cumulative time that I'm waiting rather than actively doing something. Drawing plots should be near instantaneous; there's an entire industry devoted to drawing complicated graphics in 16ms or less, I shouldn't need to wait >100ms for a single 2d grid with some dots and lines on it.
Matplotlib is okay, but there's definitely room for improvement, so why not go for that improvement?
I think this varies a lot depending on what you're doing.
I agree 100% that matplotlib is really slow and should be made to run as fast as humanly possible. I would add a (3) to my list above: optimize matplotlib!
OTOH, at least for what I'm doing, the code that runs to generate the data that gets plotted dominates the runtime 99% of the time.
For me, adjusting plots is usually the time waster. Hence point (2) above. I'd love to be able to make the tweaks using a WYSIWYG editor and have my plotting script dynamically updated. The bins, the log scale, the font, the dpi, etc, etc.
I think with your 8 slices examples above: my (2) and (3) would cover your bases. In your view, is the rest of matplotlib really so bad that it needs to be burnt to the ground for progress to be made?
Yeah, I'd love it if mpl could be optimized. I do think that it has a lot of weird design decisions that could justify burning it down and starting from scratch (e.g. weird mix of stateful and stateless api), but I've already learned most of its common quirks so I selfishly don't care anymore, and my only significant complaint is that I want it to be faster :)
edit: regarding runtime, I'm sure this varies a lot based on usecase, but for my usual usecase I store a mostly-processed dataset, so the additional processing before drawing the data is usually minimal.
I'd be curious to hear more about your EDA workflow.
What I want for EDA is a tool that let's me quickly toggle between common views of the dataset. I run through the same analysis over and over again, I don't want to type the same commands repeatedly. I have my own heuristics for which views I want, and I want a platform that lets me write functions that express those heuristics. I want to build the inteligence into the tool instead of having to remember a bunch of commands to type on each dataframe.
For manipulating the plot, I want a low-code UI that lets me point and click the operations I want to use to transform the dataframe. The lowcode UI should also emit python code to do the same operations (so you aren't tied to a low-code system, you just use it as a faster way to generate code then typing).
I have built the start of this for my open source datatable UX called Buckaroo. But it's for tables, not for plotting. The approach could be adapted to plotting. Happy to collaborate.
At least I usually do prefer to do the EDA plotting by writing and editing code. This is a lot more flexible. It's relatively rare to need other interactivity than zooming and panning.
The differing approaches probably can be seen in some API choices, although the fastplotlib API is a lot more ergonomic than many others. Having to index the figure or prefixing plots with add_ are minor things, and probably preferable for application development, but for fast-iteration EDA they will start to irritate fast. The "mlab" API of matplotlib violates all sorts of software development principles, but it's very convenient for exploratory use.
Matplotlib's performance, especially with interaction and animation, and clunky interaction APIs are definite pain points, and a faster and better interaction supporting library for EDA would be very welcome. Something like a mlab-type wrapper would probably be easy to implement for fastplotlib.
And to bikeshed a bit, I don't love the default black background. It's against usual conventions, difficult for publication and a bit harder to read when used to white.
Writing and editting code is a lot more flexible, but it gets repetitive, and I have written the same stuff so many times. It's all adhoc, and it fixes the problem at the time, then it gets thrown away with the notebook only to be written again soon.
As an example, I frequently want to run analytics on a dataframe. More complex summary stats. So you write a couple of functions, and have two for loops, iterating over columns and functions. This works for a bit. It's easy to add functions to the list. Then a function throws an error, and you're trying to figure out where you are in two nested for loops.
Or, especially for pandas, you want to separate functions to depend on the same expensive pre-calc. You could pass the existing dict of computed measures so you can reuse that expensive calculation... Now you have to worry about the ordering of functions.
So you could put all of your measures into one big function, but that isn't reusable. So you write your big function over and over.
I built a small dag library that handles this, and lets you specify that your analysis requires keys and provides keys, then the DAG of functions is ordered for you.
I work with R and not python, so some things might not apply, but this:
> [...] it fixes the problem at the time, then it gets thrown away with the notebook only to be written again soon.
Is one of the reasons I stopped using notebooks.
One solution to your problem might be to create a simple executable script that, when called on the file of your dataset in a shell, would produce the visualisation you need. If it's an interactive visualisation then I would create a library or otherwise a re-usable piece of code that can be sourced. It takes some time but ends up saving more time in the end.
If you have custom-made things you have to check on your data tables, then likely no library will solve your problem without you doing some additional the work on top.
And for these:
> Or, especially for pandas, you want to separate functions to depend on the same expensive pre-calc. [...] Now you have to worry about the ordering of functions.
I save expensive outputs to intermediate files, and manage dependencies with a very simple build-system called redo [1][2].
For larger datasets, real scripts are a better idea. I expect my stuff to work with datasets up to about 1Gb, caching is easy to layer on and would speed up work for larger datsets, but my code assumes the data fits in memory. It would be easier to add caching, the make sure I don't load an entire dataset into memory. (I don't serialize the entire dataframe to the browser though).
Usually I write scripts that use function memoization cache (to disk) for expensive operations. Recently I've also used Marimo sometimes, which has great support for modules (no reloading hacks), can memoize to disk and has deterministic state.
I agree with you sfpotter, very interesting. Looks in some ways similar to PyQtGraph regarding real time plotting.
I agree with you regarding matplotlib, although I find a lot of faults/frustration in using it. Both your points on 3D plotting and WYSIWYG editor would be extremely nice and as far as I know nothing exists in python ticking these marks. For 3D I typically default to Matlab as I've found it to be the most responsive/easy to use. I've not found anything directly like a WYSIWYG editor. Stata is the closest but I deplore it, R to some extent has it but if I'm generating multiple plots it doesn't always work out.
I'm surprised by what you said about "EDA". I find the opposite, a shotgun approach, exploring a vast number of plots with various stratifications gives me better insight. I've explored plotting across multiple languages (R,python,julia,stata) and not found one that meets all my needs.
The biggest issue I often face is I have 1000 plots I want to generate that are all from separate data groups and could all be plotted in parallel but most plotting libraries have holds/issues with distribution/parallelization. The closest I've found is I'll often build up a plot in python using a Jupyter notebook. Once I'm done I'll create a function taking all the needed data/saving a plot out, then either manually or with the help of LLMs convert it to julia which I've found to be much faster in loading large amounts of data and processing it. Then I can loop it using julia's "distributed" package. Its less then ideal, threaded access would be great, rather then having to distribute the data, but I've yet to find something that works. I'd love a simple 2D EDA plotting library that has basic plots like lines, histograms (1/2d), scatter plots, etc, has basic colorings and alpha values and is able to handle large amounts (thousands to millions of points) of static data and plot it saving to disk parallelized. I've debated writing my own library but I have other priorities currently, maybe once I finish my PhD.
For point (2), have you tried the perspective-viewer library? You can make edits in the UI and then use the "debug view" to copy and paste the new configuration back into your code.
I work on solving 3D problems: numerical methods for PDEs in R^3, computational geometry, computational mechanics, graphics, etc. Being able to make nice 3D plots is super important for this. I agree it's not always necessary, and when a 2D plot suffices, that's the way to go, but that doesn't obviate my need for 3D plots.
3D plots might be neat if there was some widespread way of displaying them. Unfortunately we can only make 2D projections of 3D plots on our computer screens and pieces of paper.
For me, matplotlib still reigns supreme. Rather than a fancy new visualization framework, I'd love for matplotlib to just be improved (admittedly, fastplotlib covers a different set of needs than what matplotlib does... but the author named it what they named it, so they have invited comparison. ;-) ).
Two things for me at least that would go a long way:
1) Better 3D plotting. It sucks, it's slow, it's basically unusable, although I do like how it looks most of the time. I mainly use PyVista now but it sure would be nice to have the power of a PyVista in a matplotlib subplot with a style consistent with the rest of matplotlib.
2) Some kind of WYSIWYG editor that will let you propagate changes back into your plot easily. It's faster and easier to adjust your plot layout visually rather than in code. I'd love to be able to make a plot, open up a WYSIWYG editor, lay things out a bit, and have those changes propagate back to code so that I can save it for all time.
(If these features already exist I'll be ecstatic ;-) )