Hosting files in S3 is much, much cheaper than running a live instance of PostgreSQL. Incomparably so. We do it specifically to move large, mostly cold, read-only data out of the main RDBMS and into S3 in a way that it can still be queried on-the-fly* without having to be fully downloaded and reconstituted on a server first. Works great and we get to pay S3 prices for storage and nothing for compute. As a kicker, because it's S3 we can access it from any availability zone without paying for cross-AZ data transfer. For the size and coldness of this data, the juice is worth the squeeze. None of the downsides matter for our use case at all.
* Using indexes and only touching the pages you need. This is specifically much better than S3 Select, which we considered as an alternative.
On the S3 side, you need to factor in Intelligent Tiering. We're not paying S3 Standard prices for this--recall that this is a mix of mostly cold data. It's a 10x difference per GB on average between EBS and our mix of S3 Intelligent Tiering.
Add in that your RDS instance needs to be high availability like S3 is (and like our RDBMS is). That means a multi-AZ deployment. Multiply your RDS cost by two, including the cost of storage. That still isn't as good as S3 (double price gets you a passive failover partner in RDS PostgreSQL; S3 is three active-active AZs), but it's the best you can do with RDS. We're north of $500/month now for your example.
Add in the cost of backups, because your RDS database's EBS volume doesn't have S3's durability. For durability you need to store a copy of your data in S3 anyway.
Add in that you can access S3 without cross-AZ data transfer fees, but your RDS instance has to live in an AZ. $0.02/GB both ways.
Add in the personnel cost when your RDS volume runs out of disk space because you weren't monitoring it. S3 never runs out of space and never requires maintenance.
$500/month for 1TB of cold data? We were never going to pay that. I won't disclose the size of the data in reality but it's a bunch bigger than 1TB. We host an on-prem database cluster for the majority of things that need a RDBMS, specifically because of how expensive RDS is. Things probably look different for a startup with no data yet, blowing free AWS credits to bootstrap quickly, but we are a mature data-heavy company paying our own AWS bills.
As a final summary to this rant, AWS bills are death by a thousand papercuts, and cost optimization is often a matter of removing the papercuts one by one. I'm the guy that looks at Cost Explorer at our company. One $500/month charge doesn't necessarily break the bank but if you take that approach with everything, your AWS bill could crush you.
Indeed, our RDBMS is on-prem; we'd never actually use RDS. This data's pages were hogging the cache leading to reduced performance for other unrelated tables, and SAN storage isn't particularly cheap or flexible. We wanted to get it out of our on-prem RDBMS. If you're over a hundred billion rows, it's time to think about whether your data belongs there. Maybe it does, maybe it doesn't. This data had an alternative and I took it.
> If your app runs on cloud, too, then you already are paying for the cloud compute so you can just fire up an VM and install Postgres on that.
This part doesn't make sense. If your app is on the cloud, you're paying for the cloud compute for the app servers. "Firing up a VM" for PostgreSQL isn't suddenly free.
You can absolutely do all of those things, but there is an intrinsic cost for someone to configure, manage, and monitor those things. SQLite will (generally speaking) have far less management overhead because of its relatively limited surface area (e.g., there is no database service).
Absolutely. I’ll go further and say that you must specifically design the database schema knowing that it’s going to be used this way. Your pages need to be packed full with the data you need and nothing you don’t. Spatial locality matters bigtime since “seeks” are so expensive (additional S3 requests), when in a traditional db it matters much less. Wide tables with a lot of columns that might not be used in a query are a bad idea here.
Here’s an in-the-weeds tip for anyone attempting the same: your tables should all be WITHOUT ROWID tables. Otherwise SQLite sprays rows all over the place based on its internal rowids, ruining locality when you attempt to read rows that you thought would be consecutive based on the primary key.
A few days ago, I tried to use the linked library (sql.js-httpvfs) for a graph network visualization, which went about as well as you'd expect given the poor spatial locality. Do you have any tips for optimizing spatial locality with more complex queries? Can you manually cluster data for some given properties in SQLite?
For my project I ended up just exporting the graph edges as JSON, but I'm curious if it would still be possible to make work.
In a WITHOUT ROWID table, you have control over the order of the rows. Make an id column as your primary key, and set the id appropriately so that rows accessed together will be next to each other in the file. This is how I manually cluster the data.
Aside from that, I use an entity-attribute-value model. This ensures that all the tables are narrow. Set your primary key (again, with WITHOUT ROWID) to put all the values for the same attribute next to each other. That way, when you query for a particular attribute, you'll get pages packed full with nothing but that attribute's values in the order of the IDs for the corresponding entities (which you manually clustered).
It's worth repeating one more time: you must use WITHOUT ROWID. SQLite tables otherwise don't work the way you'd expect from experience with other DBs; the "primary" key is really a secondary key if you don't use WITHOUT ROWID.
Thanks for the info! In my case there's not really one primary key that would guarantee good clustering for my query, so I guess there's not much that can be done to optimize here.
I'm trying to find all the ancestors of a node in a DAG, so the optimal clustering would vary depending on the node I'm querying for
> I’ll go further and say that you must specifically design the database schema knowing that it’s going to be used this way.
If it ever turns out, at some point in the future, that you do need features from a standard RDBMS after all, you are going to regret not using Postgres in the first place, because re-engineering all of that is going to be vastly more expensive than what it would have cost to just "do it right" from the start.
So it seems that Cloud SQLite is basically a hyper-optimization that only makes sense if you are completely, totally, 100% certain beyond any reasonable doubt that you will never need anything more than that.
I can’t reveal too many internal details but this data lived in the RDBMS for years. Its access patterns are well understood. That’s exactly when you start cost optimizing. If this didn’t work out we’d just move back to the old DB schema that we were already using and pay for a bigger server. If we wanted, we could keep the schema as-is and just move it into SQL Server. That would work just fine, too. No re-engineering required.
Don’t know how else to say “we were not born yesterday; we thought of that” politely here. This definitely isn’t something to have your junior devs work on, nor is it appropriate for most DB usage, but that’s different than it not having any use. It’s a relatively straightforward solution to a niche problem.
If you mentioned Athena maybe I could see how this follows what the earlier comment says, but as is your usecase doesn't really overlap with why people are building cloud sql products
I built a SQLite VFS module just like the one linked here, and this is what I use it for in production. My use case obviously does not preclude other people’s use cases. It’s one of many.
GP asked whether this is a toy and what the upsides might be. I answered both questions with an example of my production usage and what we get out of it.
A SQLite VFS module isn't a cloud SQL product, my comment refers to companies like fly.io and MotherDuck that are actually selling "embedded database in the cloud"
It's entirely predicated on developer experience, otherwise there's no reason to specifically reach for an embedded database (in-process doesn't mean anything when the only thing the process is doing is running your DB)
Okay, sure. I’m satisfied that I answered GP’s questions about this VFS module. This HN post is about the VFS module, posted by a notable SQLite extension developer (nalgeon).
The ability to use an index to seek directly to a handful of consecutive rows without processing the whole file was very important for our use case. Athena doesn't support indexing like this; it only has partitioning on a single column. It has to scan whole partitions every time. Both S3 Select and Athena are more useful when you want to aggregate massive data sets, but that's not what we're doing. We want to jump in and pull out rows from the middle of big data sets with reasonably low latency, not aggregate the whole thing.
* Using indexes and only touching the pages you need. This is specifically much better than S3 Select, which we considered as an alternative.