Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Tell me about modern databases
25 points by thom on Sept 10, 2016 | hide | past | favorite | 16 comments
Day after day, I sit here crafting data for some vague analytics with Postgres, close to despair. All I want is for the query planner to take a view, go away for five or ten minutes, and then run it optimally, or at least suggest some indexes, but this isn't the way it seems to want to allocate resources.

I suppose I've always been slightly suspicious of modern database technology beyond stodgy stuff like Oracle, SQL Server, MySQL and PostgreSQL, but I'm at the point when I would like to know what my options are.

I don't think my workload is particularly outlandish - 35m events with around 100 attributes, timestamped, some GIS data. Some typical workloads include things like sessionization, grouping events together, and asking questions like "how long after events of type X did an event of type Y happen?", "what was the last event in session Z?", "how long is the path from event A to B to C?" On top of that, all sorts of aggregates get computed over events. Training data gets extracted and predictions fed back in via bits of Clojure or R code.

I am reasonably au fait with modern SQL, I write a lot of window functions, I avoid some common pitfalls, I can generally index my way out of most performance bottlenecks, but my word it feels like a slog some days. So, I suppose my question is this: what am I missing? What fancy new technology exists that can take complex queries, optimise them and give me answers. I don't care about scaling to millions of requests, I don't care about latency, I care about how much time I'm losing looking through EXPLAIN ANALYZE outputs instead of doing real work.



Dell Software's Toad Query Optimizer (for Oracle and SQL Server) has an engine that does exactly that. It takes your SQL, tries rewriting it a bunch of different ways, throws in different compiler hints, and measures the response time/reads/CPU/etc for each execution, then gives you a set of graphs and recommendations.

I wanna say it was around $1500/seat last time I looked at it. It's worth every penny if you have to write a lot of complex queries. I've bought it for just one senior member of a dev team, and everybody brings their worst queries to that person for tuning.

Here's the trick, though: after it optimizes a query for you, keep your original query around. You're not going to want to try to edit the query that Toad produces - it can be horrifically complex. Instead, edit your original query, then run it through Toad again.


I will look into it, sounds like the kind of magic I'm after, and I'm not against hopping between platforms to save future pain. Thanks!


Sounds like a mistake to only have one member of team have it. Why does this member get special treatment?


It sounds to me that while one license could be totally worth it, getting it for every developer would not be.


Why does it have to be a special treatment?


By definition, this person gets special treatment. I don't know why it has to be.


It seems like the first question to ask is, where is the query planner falling down? What are you finding in those EXPLAIN ANALYZEs that is wrong? When you say "that isn't the way it seems to want to allocate resources," what does that mean - do you mean you want it to auto-create indexes? SQL Server used to have a tool to suggest indexes based on a query workload, is that what you mean? Have you used https://explain.depesz.com/ ?


Yes, I spend a large part of my day looking at explain output, often on that site. The SQL Server Database Engine Tuning Advisor is reasonably good, and something like that for Postgres would be great, but ultimately it doesn't seem to look at a query and come up with a semantically identical but structurally better way of phrasing it - it will never fix unnecessary big-O performance issues (in the way that, for example, posting a single query on Stackoverflow might). But even if such a tool exists for Postgres, I'd be running it _all the time_. I'm just naively wondering if technology has moved on. I have basically no writes outside an ETL process, much of the time this is effectively single-user ad-hoc stuff. I realise I might be being hopelessly naive, but it feels like I can clearly specify a query, and I have complete control over my schema. Something must exist that requires me to pay more money but ultimately care less about endless performance tweaks to create a view that is fast?


So your queries are very diverse in their nature in an unpredictable way? I mean I'd assume you'd more or less know the big-Os for the stuff you're doing and have a vague idea if there's really room for improvement? At just 35m records I'd assume you're purely cpu bound with vanilla high-order polynomial queries. Maybe what you really want is to rethink what you want the db to do. Perhaps you can take a bit more of a soft computing approach and settle for imprecise answers.


If you're doing sufficiently focused sorts of queries - the question mentioned "find the path from A to B to C" - then you might get value out of a DB specifically focused on that question (a graph DB, or a time-series one). The Postgres FDW stuff may help you make integrating one of those less painful.

Before you do that, though: have you thrown RAM at the thing? 35m events sounds like a lot to humans, but not when you can rent machines with 1.2TB of RAM by the hour...


I was more thinking of spatial stuff when I talked about paths, and PostGIS is a pretty productive platform for me so far. Example calculations would be taking the geodesic distance moved over a series of events over the total length of the path to calculate a score for 'directness'. PostGIS has aggregates to create lines from points, measure the total length etc.


I'd suggest the little red book (Stonebraker's not Mao's):

http://www.redbook.io/


You're missing something like this: https://www.monetdb.org/


I seem to be the exact target market for this sort of thing, thank you! The SQL implementation seems reasonably modern, and while the GIS features are a little rudimentary there's at least something to work with.

My worry about column stores is that I can't guarantee my workload is optimised for that paradigm either - imagine a window function with multiple partition criteria etc etc, but I suppose it's worth testing.


SQL is horrible for many use-cases. have a look at http://datomic.com


Is Datalog naturally fast for certain types of queries, or do performance gains for Datomic over RDBMSs come from being able to scale up a cluster? I don't have a feel for what Datomic's sweet-spot is, despite having much love for Clojure and its ecosystem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: