Everyone seems to want to compare against MongoDB, but when I think NoSQL I think about Google Cloud Datastore and Amazon DynamoDB. Databases which are fully hosted, infinitely scalable, zero-maintenance, transactional, reliable, and - at least with Google's offering - scales down to a free tier. They aren't perfect or applicable in every situation, but they're cheap and easy enough to allow a one- or two-programmer team to achieve massive scale without hiring devops.
MemSQL is so expensive you have to call for a quote.
If it's just "the ability to scale" then sure, SQL is back on the menu. MemSQL, Spanner, CockroachDB are leading the charge and that's great! But you have to pay a pretty penny for it (in TCO). There's still a lot of value in a cheap fire-and-forget scalable database, and there are not currently any SQL options there.
Regarding schemaless: I think this is a divide between people who work in dynamic languages vs people who work in statically typed languages. Schemaless databases are just fine in languages like Java; your classes define the schema with enough rigidity to keep you out of trouble. They wouldn't be my choice for Javascript though, that's for sure.
I don’t understand the “call to get a quote” sales cycle. I bet memsql has a sales time the size of their engineering team trying to get those fat Enterprise deals but in the grand scheme of things why should I care?
It doesn’t have a free tier, I can’t use it easily and just seems like a lot of hype without any 3rd party to back up their claims. Its the kind of things mongodb used to say to gain mindshare.
Most popular databases like MySQL and Postgres gained its mindshare because they were open source and anyone could verify they did what they said. Only a couple closed source databases have won it big. Behind them were huge monoliths like Microsoft and Oracle with armies of sales teams.
I’m not saying MemSQL will die. They are probably quite profitable. I work for an analytics company with a custom database. It’s just the average joe doesn’t get much from it. The cloud based dbs gets the average joe very far.
> I don’t understand the “call to get a quote” sales cycle.
It's about maximizing revenue from enterprise clients - when the customer LTV is huge but the volume is low, it's smart to focus your resources on those dollar amounts. When the volume is high but the dollar amounts are low, it's smart to focus on optimizing the volume funnel.
More specifically, it's about price discrimination.
If you publish a price, you need to publish a price that's going to be palatable to your smallest clients. But the moment you do so, your most deep-pocketed clients will demand the same price. This could be disastrous, because those same clients also tend to be more demanding, so a profitable price for one client could be a disastrous money-loser for another.
If you're the kind of person who needs to know the price up-front, it's probably safe to interpret this as a way of saying, "If you have to ask, this isn't for you." At the very least, it's probably the case that you have a limited appetite for products where success requires significant assistance from support engineers to help you deal with all the critical details that typically aren't in the documentation for these kinds of products.
If you're having to hide ALL pricing data, you're doing it wrong.
What's the difference between the small-business customer and the enterprise one? I'd expect it boils down to three big things:
* Quantities of resources, where you usually expect the per-unit price to go down for the big enterprise customer. The 5-million-request-per-month plan is usually cheaper per-request than the 100k-requests-per-month one.
* SLAs, which can be done as tiered add-ons.
* Support services tiers, which can be graduated sensibly.
I'll grant that there's a place for a "super-whale" tier where you're probably buying custom infrastructure solely to support that client, but there's room for that within a pricing page without scaring off a large number of customers. "For plans over n billion queries per month, please contact sales" means you can still make a reasonable direct sale for everyone else.
Even within the context of "we're a galactic enterprise ready to write a blank check", many firms will still want to buy a small off-the-shelf package as part of the validation process.
This could be to give the developers a sandbox to ensure it meets their expectations, or just to migrate a small portion of their systems before going head-first.
> allow a one- or two-programmer team to achieve massive scale without hiring devops
Know what isn't massive scale? Subjecting yourself to the off switch of a single company. So in that light, AWS-specific or Google-specific options are the worst scale you could ever have.
a meteor could also hit your datacenter, neutrinos can fly through your RAM, a person with malicious intent could detonate an EMP next to your servers, etc etc etc.
Probability is not binary. The probability that you will want to move your app, cross-cloud distribute your app, or get it turned off is much higher than those things you mentioned. This is also true about the contents of the data instead of just the ifaces since using NoSQL databases often means distributed in lieu of backups.
It's not that hard to run a few Cassandra machines, or if you can accept relational, RDS or other managed Postgres options are fine. It's just this vendor lock in stuff gets really annoying to refactor/migrate IME and in many cases if your time to market is that important, use a cloud-provided RDBMS that will let you be portable. I'd rather go from PG to Cassandra than Google data store to AWS data store to whatever other provider data store du jour.
While you're busy building your platform-agnostic wondertrap with Cassandra nodes and Docker images and Chef scripts and monitoring and all the other bells and whistles, I'm sitting down at my keyboard and implementing customer-facing features.
The probability that you will want to cross-cloud distribute your app is infinitesimally small compared to the probability that you will never get the right set of features to make your product take off. Platform portability is fake work that makes developers feel productive about doing something that their customers couldn't give a rat's ass about.
There's a reason Pivotal doesn't let you point chores - your velocity is determined only by how fast you make the customer happy. Focus on developing features, and use whatever tools get you to product-market fit fastest.
> While you're busy building your platform-agnostic wondertrap with Cassandra nodes and Docker images and Chef scripts and monitoring and all the other bells and whistles, I'm sitting down at my keyboard and implementing customer-facing features.
I'm not sure if you're responding to a different comment or missed where I suggested just use the cloud Postgres. I specifically mentioned time to market and acknowledge it. I am saying you don't have to sacrifice portability for time to market, and what is terrible advice is suggesting that you do.
I love Postgres, I really do. And Google's Cloud Postgres is my "default" axe.
It will not easily give you scale. It will not give you the ability to "not think about your database". You will not be able to go fishing (or wherever) with confidence that your app will just keep running without you.
I find there are two sweet spots for GCD (and probably DynamoDB, although I haven't used it in anger):
1) Persistence for really small and/or simple systems. Services that need a tiny bit of persistent state and really could do without the fuss of setting up databases and running migrations.
2) Persistence for big systems that need absolute rock solid availability, and can relax (or shunt off) analytical needs. Postgres is always one poorly-thought-out migration script away from downtime, and how many organizations test their migration scripts against full copies of the production database? Even Stripe has database-driven downtime, with their armies of developers and devops. Your two-person startup hasn't got a chance.
For everything else (and I'll be the first person to say that includes the majority of web applications!) there's Postgres.
I am an AWS true believer but I hate DynamoDB. It is far more limiting than Mongo DB.
MongoDB works great with C#. Using a JSON based database with a statically typed object oriented language goes hand to hand. You don’t have the object relational mismatch and the languages type system helps enforce a “schema” on the database.
With C# you work with a MongoCollection<T> And all of your inserts, Linq queries, etc are statically checked by the compiler.
I could see that. But I can’t tell from briefly perusing the docs, does Typescript give you transpile time checking of objects that it will allow you to put in Mongo?
Using a dynamically type language with a schemaless database isn’t something I would want to do for a large project.
For example, I tend to use Mongoose as the interface to Mongo, with which you're required to set a Schema for each collection of documents. You can easily set every single field type to `Mixed` (analogous to `any`, `auto`, etc), but if you're the least bit disciplined you can set any fields to any variation of:
* String
* Number
* Date
* Buffer
* Boolean
* Mixed
* ObjectId
* Array
* Decimal128
* Map
And there are TypeScript typings supplied for Mongoose, so I typically recreate the models as TS interfaces for wherever I use them in other parts of my code. It's extremely intuitive and quick to develop. It's a tad redundant defining the Schema types, and then a TS interface, but I usually don't mind. I include them in the same file for coherence so that I'm always importing from the same module for anything relating to the model.
† NOTE: there are a couple of quirks. Mongoose uses constructors to define types whereas TypeScript types are a different symbol understood only by the compiler— so in this case they use different aliases of the same object...
So when I move to make any queries via the ODM, I import and use the associated types. The types won't allow you to reference fields that don't exist on the schema that way. You can force it by cheating the compiler and casting any arbitrary non-existent field to `<any>`, but I try to reserve that for hungover POC cases. It does suggest immutability of the object received from the query, but again— there are many ways to ignore that if you wanted to, or just didn't care.
Example ODM query:
/* MyController.ts */
import * as mongoose from 'mongoose';
import {
MyCollection,
IMySchemaFields
} from '../models/MyCollection.ts'
/* ... pretend there's relevant stuff happening here */
MyCollection.findOne({ _id: '436rgerr25gw4gfdhdsf', },
(error: Error, collection: IMyCollectionFields, }) : any => {
// Do something with your `collection` object.
// Just don't try to access, modify, or add fields
// that don't exist on the referenced interface.
});
/* More things probably happen here */
As for TypeScript for a large project... I think we get into different kinds of discussions here (eg, is Node or Deno the right tool, etc etc). The kind of applications I've used this design on have not had to face significant scaling or load problems. They're very applied circumstances that face small-to-medium/large traffic. I haven't benchmarked for anything that operated on a real significant scale. We will be testing for around the ~1k-2k concurrent user mark later this summer. But that doesn't involve consistent read/writes— it will involve regular reads, but seldom writes. I might try to write about it later, but it's kind of a fast-track thing so I'm a bit buried in it.
Other projects following a similar design tend to be internal tooling for editorial and other teams in media and so don't see significant traffic or structures/changes beyond initial versions.
I can understand many peoples' reservations and concerns. I definitely think it's a matter of what's right for the job, considering the circumstances and resources available.
Outside of very niche use cases pushing enforcing consistency to the app layer adds enough complexity to dev process that a one/two programmer team will not generally handle it well. That's why even Google ended up building Spanner.
Because for avg. developer reasoning about eventual consistency and handling all the edge cases is fairly hard.
If it's easy for you I guess you are far above avg. dev.
There really aren't that many edge cases. If you're changing data, you're in a transaction and that's always strongly consistent. If you're querying data, nobody really cares if it's 10s behind (and usually it was 1-2s).
On the other hand, counting a few thousand things is a pretty major PITA. And doing bulk updates requires map/reduce. So there's a cost... but it's not what you seem to think it is.
Some of your projects looks like are in e-commerce space so say if I look at my cart and it is missing products I added say 5 sec ago that would be an issue.
It's trivial to define your entity groups in Google Cloud Datastore such that you have strongly consistent carts. It's how you would naturally build an app, and it doesn't require http sessions (which I've never used on GAE and never expect to in the future). I've built multiple ecommerce sites this way and they work great.
What's hard is to do strongly consistent queries across carts. But that isn't something anyone cares about.
That's what session affinity is for. I'm pretty ok with 5 seconds delay between adding something to the cart on my laptop and it appearing on my phone. Or, you may do user affinity and I'll have both the same second, but my latency will go bad the first day I travel overseas.
I think NoSql, especially things like Mongo, got popular because it is super easy to program with javascript. While scaling is one of the advantages, I'd be super surprised if many people actually need scaling capabilities ( other than because their design is super inefficient ).
Recently I've been inspired to play around with kicking out as many layers between a relational db and a REST Api, largely because I've been watching https://www.twitch.tv/nybblesio ( who I first came across here on HN on a thread about live coding via twitch streaming ). One of his projects he works on is a SASS product which is largely done in postgresql leveraging https://github.com/PostgREST/postgrest . Seems pretty good, going to be interesting to see how it turns out ( though I haven't seen him work on it lately )
One of the problems of many stacks is that the frameworks wrap general purpose languages over SQL, which, is not really a good idea, SQL is a vastly more capable language for dealing with relational data and layers built over the top often dumb down the database.
The REAL problem SQL wrappers need to solve is that most languages don't have a good interface with SQL. So most often SQL is handled in strings and manipulated with string manipulation and there is no type safe way transitioning data from the DB to a general purpose language.
Microsoft at one stage had Linq to SQL which was quite good..... but they killed it :)
> I think NoSql, especially things like Mongo, got popular because it is super easy to program with javascript.
NoSQL caught on because of what it doesn't do: attempt to implement the relational model.
A lot of people were using MySQL in infamous LAMP stacks to set up websites with virtually no business logic.
So you had:
1. An application that didn't need the relational model
2. A DBMS that had a very poor implementation of SQL, which itself is a poor implementation of the relational model
3. A large community who didn't know anything about database theory.
When you're using MySQL as a glorified hashtable, you get none of the benefits of the relational model, and all the downsides.
From their understanding of the technology, throwing all that out made a lot of sense. It's pretty reasonable to use a distributed hashtable if what you want is a distributed hashtable.
I think the backlash now is that NoSQL marketed themselves as a capable alternative to SQL DBMSs. But they had no query optimizer, no transaction support, no reliable backups, and no schemas. The last was supposedly a selling point, but the reality is that your data always has a schema, and code will crash if the data doesn't match it.
Great marketing, but people were sold half a car and then told building the rest themselves was a feature.
> One of the problems of many stacks is that the frameworks wrap general purpose languages over SQL, which, is not really a good idea, SQL is a vastly more capable language for dealing with relational data and layers built over the top often dumb down the database.
Another problem is that object-orientation is fairly broken, and ORMs tried to figure out how to shoehorn relational databases into inheritance hierarchies. If current notions like value classes had been around, a more straightforward mapping would have been possible.
MongoDB certainly was missing a lot of things when they first launched and were very hyped, but now pretty much everything you could want in a database is now in MongoDB.
Do they have a scale/redundancy option for write scaling beyond shard + mirror set? I think my bigger issue with mongo, is you really need 6+ servers for a production deployment in many scenarios. It's great for getting a single developer instance up and running. But if I'm sticking to one instance or a couple mirrors for production, I may just reach for ms-sql or postgresql.
Aside: I'm sad that RethinkDB didn't get more of a foothold.
MongoDB supports a combination of shards and replica sets. Each replica set uses a protocol similar to raft for strong consistency. The replicas are a lot better than just active-secondary mirroring
Yeah, sorry for using mirrored instead of replica. I've worked with 3+ replica sets where I needed read scaling, and that worked well enough. But there are cases where I want read scaling and don't want to have a full copy of all data on all servers, but do want redundancy.
I really like, by comparison, how RethinkDB, Cassandra and others do it with clustering that has sharding + replication. It seems to be a better model in a lot of ways when you want to spread things out (read/write) but also want some replication of data. It's just a different way of working with things.
Thanks for posting that; I haven't had a good reason to look at it in quite a while. It makes sense that they're adding those features as they all exist in SQL DBMS's because they're a need.
I am currently working on a project for a client where they expect to have about 3,000 users each with some basic profile information, and the ability to upload some documents etc - nothing too crazy. I don't think they ever have to worry about scaling to 50,000 users and performing any data-intensive database queries. Why not use something like Mongo in this situation?
NoSQL caught on because our industry is like the fashion industry and we need to follow the fads year after year. Plus a lot of FUD about relational databases not scaling (it's true that NoSQL will scale certain problems way better than relational, but the vast majority of people were/are not writing applications at that scale).
> The REAL problem SQL wrappers need to solve is that most languages don't have a good interface with SQL. So most often SQL is handled in strings and manipulated with string manipulation and there is no type safe way transitioning data from the DB to a general purpose language.
For me this is (was) the main problem. When building Android apps and there was need to store some data locally, I always used Realm (https://realm.io/blog/realm-for-android/) because their API was way easier to use, other than the default SQLite database. Of course this made the apk significantly big, but at least i knew how to manage the data. SQLite on the other hand, wasn't easy to work with, and worse when it came to versioning / updating the local database schema / definition without messing things.
I say (was) because late last year, I started using Room (https://developer.android.com/topic/libraries/architecture/r...), and working with SQLite on Android has never been way easier. So yes, this has been a problem, but the solutions if built well, are great.
It would be more honest to say that the weird js design, its lack of features and ridiculously small stdlib made it harder to do anything complicated at the time (no ES6 or webpack, remember ?).
Hence, not good tooling existed for relational databases: no decent abstraction layer, migration lib or ORM. Lost of languages had a 2 liner to pop an sqlite db without needing to install anything, but JS didn't even have that.
So here you are, with thousands of front end devs that were just learning to declare their variable correctly, coming to the server. They had the choice between a technology that needed a lot of boiler plate to get anything done, and provided a painful experience with many early errors.
Or something that just seemed to work out of the box. You just dump your thoughts in it. You don't need to think. No error back. It just says "ok".
Seemed is, however, important here. The first years of mongo, it failed miserably, but you only saw it down the road of the projects using it. And of course the inexperience of the css wizards using it as a "dump-my-data" store prove to be a terrible curse.
At the time, if you said any of those things, people would have called you an old grumpy looser that didn't want to live in the future.
But they learned that, just like "var" was important to avoid bugs instead of using your variable whenever you felt like it, deciding of a formal schema was, even for document data store, an important thing to do.
Fast forward, the JS community has now matured and include modern constructs, a rich ecosystem and people with way more experience. Mongo has changed its design and has now sweet spots use cases.
But we still don't have good tooling for relational DB in JS.
And it's still important to have an explicit document stating which schema you use for your document store. Schema less means you can do whatever you want. It doesn't mean, however, that you should.
True that Node.JS didn't have, and still doesn't have a 2 liner to pop an sqlite db. But what it did have(a few years after its launch) was a pure-JS sqlite client which can leverage V8's event loop and libuv to make parallel execution of db-queries a trivial operation.
There is a lot of mettle in Node.js for server side programming.
> Microsoft at one stage had Linq to SQL which was quite good..... but they killed it :)
I think that might be like when rock stars die at 26 instead of hanging around long enough to become the subject of a VH1 "Where are they Now?" episode.
I was on a project that used LINQ to SQL long after Microsoft stopped development on it, and consequently fell out of love with it. The caching mechanism was flaky, and could cause really weird problems that manifested in non-obvious ways. Also, changing the database schema in a LINQ to SQL project could be a chore.
I ended up migrating us away from it and toward Dapper. We ended up bypassing it way too often to be able to straight-facedly claim it was saving us from having to understand SQL (if that was ever possible), It was the warts, but even more than that, it was the many situations where, to get the query right, we needed to use a recursive CTE, or a temp table, or a table-valued parameter, or any of the oodles of other things that are awkward or impossible with an ORM.
Dapper is great IMHO. However, the Microsoft-recommended solution after Linq to SQL ended up being Entity Framework... which was an even heavier ORM.
My personal opinion is akin to the parent: the more you abstract away from thinking like your data store, the easier it is to write poorly performing code, because some of the fundamentals get lost in the process.
For instance, in EF and other ORMs, one classic pitfall is "N+1" queries. Even though it's a well known issue to avoid, I still run into poorly performing loops quite often. Here, I feel the abstraction is partially responsible. Most for loops run in code (in memory at least) very quickly, even against a lot of records. However, those who develop against SQL know that you avoid cursors etc. as much as possible, and that for loops / cursors against a SQL database will bite you the instance you are dealing with thousands of records to loop through. I don't think it's always "obvious" to, say, a relatively new C# developer with less experience writing SQL, that this for loop against a SQL database is going to literally translate into thousands of select statements that will perform poorly.
I think it's possible to use a heavy ORM like Entity Framework and write well performing code, but there has to be some awareness of when the task is too complex for the ORM to perform well, and you need to switch to a stored procedure. And there has to be at least a little bit of understanding on how to write well performing interactions with the DB.
From a performance perspective, N+1 queries are spectacular, but the much more pervasive problem is simply that they encourage poor discipline about hitting the database.
Most ORM code bases I've looked at do a pretty decent job with avoiding N+1 queries. But they've all been guilty of doing things like incurring orders of magnitude more network transfer than is necessary by pulling down a collection of objects just to calculate the sum over a single field.
I think that stuff ends up being so common because it's exactly what heavyweight ORMs are encouraging you to do. To an approximation, their entire point is to make it easy to forget that you're talking to a remote resource.
And, for that matter, by hiding the subtlety of the relational model, they also hide the power of the relational model.
This just comes down to caring about the work you do and putting in the effort to ensure you are delivering quality/performant code. If your devs can't handle that then you have a lot more problems than a heavyweight ORM. No developer worth their salt would be "encouraged" by a library to write bad code.
> kicking out as many layers between a relational db and a REST Api
Cool, is there anything we can see? I pretty much stumbled upon something along those lines back in 2003, and developed it into what I now call "In-Process REST"[1][2] and "Polymorphic Identifiers"[3][4].
The idea being that you can stay within a "storage-oriented vocabulary" for as much as possible, and that can include seamlessly all the way from provided interface to various storage mechanisms.
What's neat is that it reduces friction a lot, greatly enhances composability and reusability and just makes things blatantly obvious.
Not only easy in Javascript, it is easy in any language that has easy serliazation to JSON. I always use it for my own
hobby projects because I dont have that much SQL knowledge. Which makes Nosql easy choice for me, just serialize your object into json and push it to your DB.
Pushing a serialized object into a distributed hash definitely solves some problems.
And you have finite time, so if the choice is between that solution that you can do now and searching for another solution that may or may not be better, it's often reasonable to go with the known solution. In economics, this is known as "rational ignorance."
The value proposition of learning will change, though. You will find it's worth it, especially if you move beyond hobby projects and have to deal with any kind of business logic, to spend some time learning how databases work, and maybe even the math behind them, the relational model.
There are a lot of mediocre or awful references out there. If you need a good text on the subject, I recommend CJ Date's "Introduction to Database Systems." Note, though, I said good, not easy.
You might be "choosing" not to bother understanding the alternatives, but you're not making a choice between the alternatives because you don't know what they are.
This is similar to the problem of macro programming by eval'ing a string (in dynamic languages). Basically you are trying to construct a tree (AST) by glueing together linear strings of text (instead of working natively with trees)
Linq to SQL is still a thing, you may need to click it’s package when you install visual studio 2017+, but it’s still there and it’s still the best SQL interface they have build in my opinion.
We rarely use anything else, but I can see the value of entity if you’re changing your DB a lot. We typically don’t do that though, and to be honest, working with changes with linq to sql, typically goes smoother than entity migrations.
Disclaimer: it’s entirely possible that we simply suck at entity.
LINQ to SQL is a dead end. Look into LINQ to DB https://github.com/linq2db/linq2db which is way faster than LINQ to SQL ever was (faster than even Dapper). If you must you could give EF a try, but stay away from LINQ to SQL.
The value of Dapper IMHO is not the speed. It is precisely the interface. With dapper I can copy paste code between C# and SQL contexts freely. Exploring the database directly in SQL and then use the result directly in the program without a translation step.
LINQ to DB has always seemed interesting, but I don't like it's use of attributes for POCO properties. It's not clear from the introduction whether non-attribute properties will work if the names and types are compatible, or even if the attributes are custom or come from the standard System.ComponentModel.DataAnnotations. The attributes aren't discussed at all in the introductory examples actually.
When you say JSON is more agile is that because you don't need to write migrations? Do you use a framework that supports migrations like active record or south?
Adding and removing happens on the application layer with using Option[_] monads in the data model. Add a field means add Option[_], removing a field means making e.g. String an Option[String] - or fill with default values on read. If we drop a field over time we ignore it during writes.
case class Person(name:String)
adding a job field after some time becomes
case class Person(name:String, job:Option[String])
and application code needs to deal with it.
Works for us, compared to my previous jobs where we used SQL and migrations.
Dapper is a great tool for getting an app started and up to speed, and even later once the app gets bigger if sometimes it makes sense to bypass the ORM and hit the DB straight-on.
>I think NoSql, especially things like Mongo, got popular because it is super easy to program with javascript.
I chose Mongo for a project recently. I have many years experience with SQL databases, and I think SQL will become more of a niche in the future.
Here are the reasons:
Everybody uses an ORM with a SQL database. You can pretend they don't and everyone is writing raw sql, but they aren't. This is basically a big, complicated, and slow piece of software that tries to make a SQL database into something else. A NoSQL database is like a native version of that. It starts off with you being able to define collections and schemas in code and so on.
SQL is a shit way to query a database. I used to write massive queries that took an hour or so to write just so that the non-programming people could put them into analytics software that didn't support any other methods of input. While I was writing these queries, I was just thinking to myself "I can write this in code in a few minutes instead of an hour or more". The fact is that programming languages are much better at querying databases than SQL.
Joins are slow and complicated, and ORMs are slow and complicated, so you can never actually use any of these things in a high-traffic environment. As you say, not many people will reach these levels of traffic, but it is still a factor.
Most of the comments about "I moved from mongo to postgres" lately are first-time programmers who didn't know any of the basic concepts of databases. They then discovered patterns that the SQL database forced on them and declared that mongo sucks when in reality, they just didn't know what they were doing.
>One of the problems of many stacks is that the frameworks wrap general purpose languages over SQL, which, is not really a good idea, SQL is a vastly more capable language for dealing with relational data and layers built over the top often dumb down the database.
This is not true at all. Programming languages are vastly superior at querying a database. Just go write a complex query and compare it to the one in linq or whatever you use.
>Microsoft at one stage had Linq to SQL which was quite good..... but they killed it :)
They have entity framework, which uses linq and is the same thing. It is very slow though.
I'm ripping ORMs out of any backend code and using things like jOOQ.
> I was just thinking to myself "I can write this in code in a few minutes instead of an hour or more"....
> Joins are slow and complicated
You know what's really slow? Creating joins in code.
> This is not true at all. Programming languages are vastly superior at querying a database. Just go write a complex query and compare it to the one in linq or whatever you use.
It makes zero sense to pull a bunch of records back from the db using multiple network calls to join and then filter. Let the db do its job.
> It makes zero sense to pull a bunch of records back from the db using multiple network calls to join and then filter. Let the db do its job.
But then it wouldn't be distributed processing! :)
Seriously, though, consider it for a moment.. this pattern has similar features to something like Hadoop. The data comes from storage nodes (database server and, hopefully, their read replicas) and goes to processing nodes (app server) to have the work done and is then new data is written back out over the network to storage nodes and replicated across the network (to the replica/slave database servers).
If the data volume is particularly low or the compute load (CPU and/or RAM) is particularly high, the distributed method would make intuitive sense. I haven't seen it yet, however.
Doing distributed joins correctly requires an architectural/technical capability that most distributed database engines don't have: decentralized parallel orchestration. If you have this, you can do joins even with very high data volumes efficiently given good parallel scheduling algorithms. Most databases are designed such that there is a single point of control that declaratively schedules all data flows required to execute the query; this scales poorly for operations like joins, never mind recursive joins, which is why you don't see it.
Letting individual database nodes dynamically schedule and orchestrate their own data flows with each other, essentially allowing each node in the parallel system construct its own execution plan in relation to other nodes as it goes along, does not fit within the "giant distributed file system" paradigm that most distributed systems are based on. People who design codes for supercomputers are often familiar with parallel orchestration idioms that work at extremely large scales but it hasn't crossed over into ordinary distributed database engines. (This is also a good litmus test for what makes a database "parallel" as distinct from "distributed".)
Most distributed database architectures are much more centralized than they need to be, particularly around control of execution planning, and this limits their expressiveness. It is quite difficult to hack together a distributed join that performs better than a centralized one without good support for parallel orchestration.
Thanks for that detail. I knew there were more interesting, technical reasons than my oversimplification for why I had never seen it.
> People who design codes for supercomputers are often familiar with parallel orchestration idioms that work at extremely large scales but it hasn't crossed over into ordinary distributed database engines. (This is also a good litmus test for what makes a database "parallel" as distinct from "distributed".)
My understanding was that this is primarily (or also) due to the lack of low-latency (and high-bandwidth) interconnects in the ordinary distributed environments, as those enable that orchestration.
I admit to wanting that to be true, as it supports a mantra I like to say to managers with programming-only (no Ops/sysadmin or even DBA) backgrounds: not all problems can be solved with software.
I'm obviously not the parent commenter and have no inside knowledge of the HPC world, but my guess is the main open source supercompting projects would be from NASA, the US national labs, and CERN.
You might search HN for the recent announcements about new clusters, especially top500, and look for the comments discussing using MPI (versus something custom, I think?), as my recollection is that those topics would yield further pointers to the actual examples you're looking for.
> You know what's really slow? Creating joins in code.
It really isn't. In a good programming language expressing a join should be just as easy as in SQL.
> It makes zero sense to pull a bunch of records back from the db using multiple network calls to join and then filter. Let the db do its job.
Agreed that you need a way to do aggregate queries on the server, but plenty of NoSQL systems give you ways to do that. I'd far rather write a map-reduce in javascript/erlang/... than express the operation in SQL and hope that the query planner does what I think it's going to do (which ends up a bit like trying to program in Prolog - it's magic until the day it doesn't work and you then have no idea what's gone wrong or how to fix it).
>It really isn't. In a good programming language expressing a join should be just as easy as in SQL.
I don't think they were saying it wasn't easy, just that it isn't as performant as letting the DB do the join. Which should be true in pretty much all cases.
>You know what's really slow? Creating joins in code.
Ahh, you have combined two separate points into a new point that I didn't make and took it out of context. As I mentioned, I was writing queries for analytics software. This software needed fresh data once per day. There was no requirement for speed. I was referring to speed of development, not speed of the query.
As for the other point, yes joins are slow, and yes, multiple queries can scale better, even if they are not the fastest when you time them in an isolated one-off scenario. Multiple queries are simpler in code and scale better. Nobody is using joins at scale and every large company that started their scaling journey in a SQL database started by performing multiple queries on a distributed database.
Then there is the argument that "well you won't be operating at that scale". Then in that case, you don't have to worry about the minute differences between a single network call and multiple calls. There are so many approaches and realities that make your argument just a theoretical argument rather than a practical one. There is caching, denormalizing, and then there is the fact that people use a vast array of languages from slow as hell like python and upwards.
As for joins in code in particular, they are slower than in a database that is often written in c++ sure. It depends on the specific situation as to how fast things need to be. Did you know that many ORMs join in code? This is because of all the redundant data that the join creates. So you have to factor in whether the larger amount of data is slower to send than smaller data in multiple network calls.
From django:
However, to avoid the much larger result set that would result from joining across a ‘many’ relationship, select_related is limited to single-valued relationships - foreign key and one-to-one.
prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python.
> Nobody is using joins at scale and every large company that started their scaling journey in a SQL database started by performing multiple queries on a distributed database.
I've used joins "at scale" in multiple jobs in (conventional relational) databases up to a few TB with tens of thousands of transactions per second and never experienced any performance problem that was the result of a join, unless you count cross joins with no filters. I have never seen an instance where doing a join was faster in code and I don't see how it could be except perhaps in some unusual edge cases.
I'm sure things change at Google or Facebook scale, but almost nobody is Google or Facebook scale.
I've never understood the "joins are slow" meme or where it came from.
I'm also having difficulty understanding how writing analytic SQL queries is slower than writing normal code. Can you go into more detail? In my experience, the "slow" part of writing any analytic query is deciding exactly what you want to know and making sure you understand that the data means what you think it means. Once you have that, the only thing slowing you down is your typing speed, and I don't think a more compact syntax would really make a difference.
> never experienced any performance problem that was the result of a join, unless you count cross joins with no filters.
An unindexed join will have major performance problems, and look exactly like an indexed join.
> I have never seen an instance where doing a join was faster in code and I don't see how it could be except perhaps in some unusual edge cases.
Faster to do the same thing? No. Easier to see which things are fast and which things are slow? Yes. Also easier to avoid the deadlocks that traditional databases' overzealous application of ACID can easily lead to.
> Can you go into more detail? In my experience, the "slow" part of writing any analytic query is deciding exactly what you want to know and making sure you understand that the data means what you think it means. Once you have that, the only thing slowing you down is your typing speed, and I don't think a more compact syntax would really make a difference.
I find the pseudo-English syntax of SQL is always very hard to follow - it slips into a kind of uncanny valley - the grammar of what goes where can be backwards from what I'd expect. Tooling is also rather limited compared to a "real" programming language. Just basic things like unit testing your queries are much harder than they should be.
>An unindexed join will have major performance problems, and look exactly like an indexed join.
Then add an index.
Its not exactly difficult in a database. Using a database without indexes is kind of stupid. Why would you do that?
And personally I find SQL to be one of the easiest languages to read. I agree that it's kind of back to front in many ways, but its way easier than trying to work out what happening in some nested loops that someone else has written.
Then you're adding work to your writes, and your database will block your live write transactions until the corresponding index updates are done.
> Its not exactly difficult in a database. Using a database without indexes is kind of stupid. Why would you do that?
You wouldn't intend to, but you might do it by accident. The failure modes can be pretty bad, since an SQL database will take whatever nonsense query you give it and try to run it, even if doing so impacts your live operations. Whereas in many NoSQL systems if you try to use an index that doesn't exist it'll fail fast.
SQL is also not very composeable. Queries and their constituent parts are not first class concepts in SQL. There is no way to, for instance, pass a query to a piece of code and have that code add part of a WHERE clause to that query. That results in a lot code dynamically generating SQL queries.
>I've never understood the "joins are slow" meme or where it came from.
Well SQL databases generally don't support joining across a sharded database, which is usually necessary to scale unless you try to scale vertically with high powered machines and your data fits into memory and so on.
They are also obviously slow compared to denormalizing and querying without a join. Then there is the other fact I mentioned that they contain redundant data so if the query you need to pull into code is large enough, it is a lot of data that has to get sent over the network.
>I'm also having difficulty understanding how writing analytic SQL queries is slower than writing normal code. Can you go into more detail?
Yes. A programming language, combined with a database like mongo or an ORM, allows you to create complex queries much more quickly compared to SQL. You can maybe go into stored procedures and start doing loops and recombining multiple queries in there, but programming languages like javascript etc are typically much nicer than those used in stored procedures.
I am talking about queries that require joins, self-joins, sub-queries, multiple types of joins, group bys layered on top of each other and so on. They are horseshit and terrible compared to nice programming languages and maybe using a couple of queries instead of one.
>In my experience, the "slow" part of writing any analytic query is deciding exactly what you want to know and making sure you understand that the data means what you think it means.
This takes a while, but so does writing the query. My non-programming coworker, while good with SQL, spent entire days trying to write the query to a query that he already knew in concept (as in, he knew what he wanted). So I don't agree with your point that understanding what you need is going to take so much time.
> Yes. A programming language, combined with a database like mongo or an ORM, allows you to create complex queries much more quickly compared to SQL. You can maybe go into stored procedures and start doing loops and recombining multiple queries in there, but programming languages like javascript etc are typically much nicer than those used in stored procedures.
I guess this is where we differ. I've written many SQL queries many hundreds of lines long taking advantage of all kinds of SQL features. I don't see how I could make them "nicer" by writing them in Javascript: SQL has plenty of warts, but well-formatted and organized SQL is hard to beat for expressing exactly what you want without all the cruft associated with how you;re getting it. Once you know what you want, it comes out pretty quick (IME), only your typing speed is the limit. I find you have to think much more carefully about what you're doing in other languages because you have to think more about how to do it without the db abstracting all of that away.
IME loops in SQL are a huge code smell - everything should almost always be done using set logic to be clean and performant.
Yes well it would be ideal if I had code to show you and compare it to the SQL, but all of that is at my old workplace and I will get PTSD if I ever look at it again.
I agree that loops in SQL are not great. Loops in programming languages are fine obviously. There are just many more and nicer constructs in a programming language to manipulate data.
Maybe you have a root table that is anchoring your query, say a Staff table. The Staff table has a Manager column, which is another row in the Staff table. You then need to do a bunch of aggregate stuff. So in a programming language, you can maybe query the database 3 times, once for the staff you need, and then again for maybe shifts completed and so on.
You can then easily put the staff into a dictionary with virtually no code. Then you loop over the non-dictionary staff array, and you have something like:
for (var staff in shittyStaff) {
processedStaff.add({
manager: staffDic[staff.manager],
shiftsCompleted: shifts[staff.id].length,
shiftsWithManager: shifts[staff.id].filter(s => s.coworker == staff.manager.id).length
});
Or if it is setup with entity framework/c# stuff, it is just:
Probably a terrible and not particularly complex example because I made it up, but to do that in SQL requires a lot more stuff, a self-join, group by with count etc, etc.
I use both raw SQL and the entity framework on a regular basis. That Linq to entities query gets translated pretty directly to SQL. Each of those includes translates directly to a left join on whatever column is specified as the key. You would need a group by, but no self joins. Assuming that you are familiar with the database schema and are proficient in SQL, it shouldn't be any slower to write the SQL version than the C# version.
It depends a little on the exact columns you need, but it would look something like this, which is a supper common form for a SQL query:
select
m.id Manager,
count(sh.id) as ShiftsCompleted,
sum(iif(sh.coworker = m.id,1,0) as ShiftsWithManager
from staff s
left join manager m
on m.id = s.manager_id
left join shifts sh
on sh.id = s.shifts_id
group by s.id, m.id
The SQL version has the advantage that it's more intuitive to specify the columns that you need so if your query is running slow because you're pulling too much data (something that's happened to me a bunch,) you can omit unneeded columns pretty easily.
I think a big point of the article was that these more recent relational dbs (like memsql) figured out how to make distributed joins across multiple shards scale really well - that's one of their core value adds. So you can shard your for example customer and order data stored across multiple nodes and partitions and do distributed joins, aggregations etc. Scaling to use hardware resources on multiple machines is a crucial aspect of these systems.
Disclaimer, I work at Memsql, speaking for myself.
> The fact is that programming languages are much better at querying databases than SQL.
They really are not though. Sure, perhaps for simple single table scans, bu nothing beats the pure optimizing potential of SQL. Try joining three tables in JS neatly and fast.
It's a bit terse, but you're lying to yourself if you think hodgepodge mess of ad-hoc JavaScript written to poorly replicate a single specific query is in any way better than what we have now.
> This is basically a big, complicated, and slow piece of software that tries to make a SQL database into something else.
Just to nitpick a little, but I think it's a worthwhile distinction, ORMs don't try to make a SQL database something else... either literally or philosophically. To use them in any more than a trivial way you still to understand RDBMSs. They just make the queries less verbose and the output more convenient to work with.
> ORMs don't try to make a SQL database something else
This is simply not true.
Many ORMs are designed to abstract away SQL and RDBMS concepts entirely. They deal in objects and object graphs and output SQL which can be quite disjointed from the object model e.g. many-many relationships.
I'm becoming increasingly convinced that there is only one good ORM and the rest are fit only for CRUD operations. ActiveRecord. The abstraction it provides is amazingly flexible. You can be as close to the SQL as you want to be.
With other ORMs, any time I need something not CRUD, I wind up hand-generating SQL. With ActiveRecord, I never need to. Looking at generated SQL, I do all the time with AR, happily it makes that very easy for me. Including my own SQL snippets in scopes? Easy. Hand-generating joins? Almost never.
Providing a wrapper around a thing is not the same as wanting to turn a thing into something else. I would argue that ORMs offer cosmetic change. Convenience, not substantive change.
> Many ORMs are designed to abstract away SQL and RDBMS concepts entirely.
Which ones? Look at their docs and you'll see where conditions, joins, columns, ordering, aggregations, etc.
You are just defining the limits of "something else" and then saying it doesn't do that. If your data looks completely differently to how the database outputs it, I don't think you can say it is just a trivial difference. Both the structure of a query and its output use different concepts to sql.
in sql, it is SELECT * from products, manufacturers, parts, JOIN ...... ON .........
The linq query syntax makes it seem like you are just plucking a Product out of a database that has manfufacturer and a list of parts as part of its object. It is an object, not a row and table based structure.
Then the output itself is also an object, in a completely different structure to what the database gave to you.
People can use an ORM and not really know how SQL works if they never bother to learn. It is presenting them with an object-based database.
In a NoSQL database like Mongo, you either have to literally store the Product with its parts and manufacturer, or you take them separately from the database and put them back together in code. This requires no abstraction. It is exactly what is happening.
Could you please stop creating new accounts for every few comments you post? We ban accounts that do this, and it's in the site guidelines: https://news.ycombinator.com/newsguidelines.html. It's particularly abusive that you used multiple accounts in this same thread.
HN is a community. Obviously you don't have to use your real name, but if users don't have some consistent identity for others to relate to, we may as well have no usernames and no community at all. That would be quite a different kind of forum. Anonymity is fine, and throwaways for a specific purpose are ok—just not routinely.
>While I was writing these queries, I was just thinking to myself "I can write this in code in a few minutes instead of an hour or more"
Well, no, because SQL is a 4th generation language of higher level than most general-purpose languages. It is a domain-specific language focused on database handling.
A simple SELECT with a few joins and indexes involved encompasses a query execution plan that would be a few hundreds of code in a regular programming language.
>I chose Mongo for a project recently.
Storing relational data on a document store is a bad idea.
And if you need a document store, Mongo is pretty much a bad option.
Linq works for trivial left join project style of queries. But for anything actually using the features of modern sql-databases it comes short quickly.
As a example try to express something like “sum(x) over (partition by y order by z)”
I do agree that SQL the syntax leaves a lot to be desired, and a proper relational language with a syntax optimized for actual development, and even better, optimized for 6NF style databases, would be awesome, but linq is not by any stretch sufficient to be that
My understanding is everyone uses 3NF in practice though, so what would a 6NF database get you practically?
I see SQL the way I see things like Linux: it's got some weird things that you'd design differently if you could do it over again, but overall it's pretty good and probably not worth the effort to switch.
6NF gives flexibility since every thing is decoupled until you bring it in. It makes evolving the schema easier, makes optimizations easier. I also have a hunch that with such a focus it would be easier to find an interesting design space for new relational programming models since it’s, in a sense, “purer”.
But I guess it depends on the use case. For normal oltp style save/fetch entity 3NF makes sense. But it kind of makes sense in the way an ORM would make sense. Which makes you question if you really need relational database at all.
> Joins are slow and complicated, and ORMs are slow and complicated, so you can never actually use any of these things in a high-traffic environment. As you say, not many people will reach these levels of traffic, but it is still a factor.
I have to clean up crap written by people like you. Go learn how to use a relational database properly instead of writing tosh like this and crap buggy code in the application layer.
Redis is not mentioned here at all so maybe the author is thinking mostly of other NoSQL software here, but well, in the case of Redis the whole point was not just the in-memory performance part, but the data model as well. My claim is that you can't really exploit the advantage of using memory if you perpetuate in using the memory to represent the same data model that you were using with relational databases. For instance think at Redis sorted sets in the use case of leaderboards in popular games (the same pattern is used in a number of applications that have nothing to do with games). To use SQL, even an in memory one, in such use case is not going to work. So in the case of Redis the point was to remove the interface between the user and the way the data is actually fetched from data structures, to make the user do the choices in a very direct way. Thus this article does not apply to Redis in my opinion. I've the feeling that many other NoSQL products could argument like that, but in other areas of their research and difference. For instance I've issues to see how modern SQL systems can replace CRDTs based stores.
This is the problem with people that use the term NoSQL.
Do you mean Cassandra (BigTable), MongoDB (Document), Riak (Key/Value), Redis (Mix), Kafka (Log Structured) ? There are dozens of fundamentally different systems many of which are closer to an RDBMS than their NoSQL peers. And many of them have rigid schemas so it definitely isn't that either.
In general understanding performance is way more critical for NoSQL/distributed systems, where one wrong assumption can have a huge performance penalty. SQL as a language and abstract relational model just can't work in this situation well.
> For instance I've issues to see how modern SQL systems can replace CRDTs based stores.
They can't of course, but it's not only CRDTs. SQL is incompatible with a lot of things that came out of distributed systems research. Just like POSIX filesystem APIs or pretty much any legacy stateful tech. And no amount of PR articles can fix it.
Well, this is exactly why competition is good, in every domain. That's why PostgreSQL got its column mode and JSON type, because there was a clear use case for performance for the former and schema-less data for the latter. EAV pattern is a plague and I'm glad documents are replacing it.
With arrays and composite type the potential for semi-structured schema was there for decades. The concept just got more mainstream with NoSQL, more easy to maintain and more efficient with indexables type such as JSONB.
Yes hstore is limited but since you can represent numbers and even hstore itself as text, all you really have to do is cast to and from text and then you have a nested structure with almost anything in it...
jsonb is much better obviously but hstore was really nice and kind of under appreciated.
What concernes me is that the article is strongly biased in the sense, that modern developments of non-relational databases just get ignored while ranting on their state from a few years ago. Schemaless databases do support full ACID transactions (MongoDB, ArangoDB), with some you can also enforce a schema and not everybody loves SQL. So having competition even on the query language-side, can only improve the status quo.
Yes, thinking about your data model will for sure increase your understanding of what you are actually doing... and every professional developer is doing this as well when choosing a non-relational database. Anythings else would be stupid.
For me the article rather shows that the modern developments in non-relational databases do affect vendors on the relational side of the spectrum. Otherwise, it would not make sense to invest so much time in writing such a long article.
My name is Rick Negrin. I run the Product Management team at MemSQL, a scalable relational database. I recently wrote a blog on my thoughts regarding NoSQL vs. Relational Databases and I'd love to hear the community’s thoughts on this.
I notice not a single mention of the CAP theorem. I have a hard time taking any "scalable" RDBMS solution seriously without a discussion on how you scale well-known problems with deletes and updates, maintaining b+tree indexes, distributed joins, distributed data, node loss/partitions, and distributed transactions/updates.
If you are a new distributed system and don't have Jepsen tests or a similar level of discussion on how your database handles partition events, then that tends to be harbingers of snake oil in distributed systems.
Granted I only did a quick search on CAP on your article, but the initial paragraph of your article didn't exactly invite further investigation or time investment.
If your "NoSQL" is just a single node MongoDB, then you should state that rather than your blanket statement. As is your categorization of "NoSQL" is unspecified and unqualified, leading me to believe this is a management-level article with little regard for the real issues in large-scale distributed databases, and why would I think your software also considers such problems?
Hi Rick, I've been following Memsql for a few years now, are there any plans to release "community" edition? Last time I checked about 1.5 years ago json support was very basic and EE pricing (dont remember exact #s) was rather high. Thanks
There already was a community edition[0]. And it was replaced by a "developer" edition that can no longer be used in production. It seems it didn't pan out as a marketing strategy and I don't think it's coming back.
EE features without support. I hate to bring up Mongo as example, but something similar...where support, additional software/plugins and cloud hosting are where the $ is made.
I did thorough testing of Memsql two years ago but went with Aurora instead. Would love to see how the product evolved since (Spark and Streaming integration was just being rolled out at the time), but something tells me pricing will be a deal breaker.
Thing is they are competing with PostgresSQL which you can extensively try for free before opting for a support.
"Free" being already hard to beat. The fact that you can't extensively test a solution is a real turn down for me (unless negotiating with commercials which is not nerds cup of tea).
I was asking because I am the creator of RediSQL[1] -- SQL steroids for Redis -- which is a less sophisticated product than MemSQL but still has its own use cases.
And maybe for parent was enough, or if not it would be very interesting to know what is missing.
Honestly, I believe that for small workload you can definitely use RediSQL in production, it will happily contain your cache or it will be a great SQL database.
However, I need a way to cut it between people just using the free product and people actually supporting the project, so provide as paying feature something that the big company will require it seemed to me the only way to go.
Unfortunately, I don't have the capital nor the bandwidth to go with fully open source product and selling just support, which I don't believe is anyway a good business model.
If you were in my shoes, you would do something different?
I haven't followed any links posted in this thread, but some things I see often are: free for non-commercial use, timed commercial use usually in the region of 30 days, or rates based on reads and writes. The last one seems like a winner from what you've described as your situation.
To be honest, I fail to see what I could use your product for so I'm out of the target audience.
Assuming nosql is for something very efficient or very scalable, I need some space to use it before I have to shell $$. There are many products where I have to pay before going on production.
If I were building a fast prototype I would not use a postgres box anymore but just a redis one.
If you need to cache data in a way more complex than just key->value you don't have too many alternatives at the moment.
If you want an easy and fast way to have an SQL engine in memory, again is not going to be simple.
If you need a separated database for every of your user there are no many alternatives that I am aware of.
It is definitely not a revolutionary product, but it has it's niche, any of the problems that I mentioned can be solved in a different way, but those different ways are quite complex.
NoSQL is an excellent technology for rapid R&D, but once a domain “settles”, data should be modeled and data stores should be switched to relational or graph backends.
If a system already has a well-defined domain, NoSQL adds little value.
Of course you could leverage AWS DynamoDB and reduce cost, but you still have downstream implications for things like reporting, which requires a known schema with relational paradigms.
I don’t see how NoSQL could be better for rapid dev when something like Postgres can migrate its schema and data at once while enforcing constraints. That’s clearly better precisely when your schema is changing.
In R&D scenarios, NoSQL wins hands down because devs are jamming schema changes hourly. That level of frequency in a relational model is a massive drag on time and it’s wasted.
If it’s a redevelopment of a well-known domain, depending on reporting requirements, I’d slightly side with relational.
But AWS DynamoDB can be a huge cost savings, so as any good architect will say, “It depends.”
In the late 90s/early 00s Eric Brewer also wrote a bunch of papers about his experiences scaling stateful systems at Inktomi. The most famous one is probably the one that introduced/codified the CAP theorem: https://pdfs.semanticscholar.org/5015/8bc1a8a67295ab7bce0550...
Someone please help me understand. I use relational databases everyday for my projects (db, api, frontend). I'm well-versed in SQL and general best practices, and my projects scale well (admittedly I'm not in the petabytes level of scale). Does the desire to use NoSQL databases come from a disdain for having to think in terms of declarative programming instead of imperative? Is it desirable to have dynamically typed fields in a database? I've always thought that the biggest motivation to use NoSQL is simply due to high-level language programmers' unfamiliarity with relational database logic and SQL syntax. Once I grasped the concepts, the relational model has proven indispensable for me.
I will say, though, that it is important to get your schema correct at the beginning. I've also always worked full-stack, so I don't know what it's like not to have the ability to manipulate the database.
> what it's like not to have the ability to manipulate the database.
In general "not wanting to understand, depend on, or work with other systems/teams" (often cause of dysfunctional eng/pm organizations) has, over 20years experience, been the number one reason devs (esp front end devs cause they tend to move the fastest (faster than traditional backend teams are prepared for) and have least experience in back end systems) go for something new, something they can control, something they can spin up fast and quick. NoSQL, cloud, anything they can do themselves.
NoSQL is used by two camps. One wants a denormalized doc store for their technical needs, and the other because it has an easier quickstart on day 1 when you don’t know SQL. Here we are usually talking of the latter.
The beauty of something like Postgres is that you don’t need to get your schema right on day 1. How could you anyways unless tou already know all future data access pattern?
For me, it's two main reasons: 1) Generating and storing massive amounts of pre-computed permutations of the same data based on how I'll read it with easy linear growth by adding servers and 2) Transparent multi-master reads/writes providing easy HA failover. Note, not all NoSQL options offer this. If existing free relational DBs had these options easy and out of the box, I would use them instead.
I just need a ton of data, often redundant, that I never delete, that supports distributed writes and reads (ACID not required). The things I use Cassandra for would require huge Postgres servers and custom sharding/partitioning setups and Citus HA and all sorts of other things.
My own experience chimes with this somewhat, having used MongoDB on my last project - quite often application data storage requirements are relatively trivial - and it is a boon to do away with the ORM layer and be able to vary any given object schema without touching the database (although in practice a .js data migration script may be involved, so this is moot). As a sidenote local cultural issue, the fact that we can operate the MongoDB servers ourselves, whereas RDBMS instances are with a central team and buried under a layer of bureaucracy was also probably an operational consideration.
When it comes to joins, these have lately been added to MongoDB, as has SQL - although the functionality is still rudimentary compared to a mature RDBMS.
But where we experienced pain was when the business decided they wanted to do live reporting. We ended up piping the data into a SQL Server instance and using SSRS.
> As a sidenote local cultural issue, the fact that we can operate the MongoDB servers ourselves, whereas RDBMS instances are with a central team and buried under a layer of bureaucracy was also probably an operational consideration.
This is clearly an oversight in the organization and will get fixed. I don't see it as neutral in the argument for or against some system.
> But where we experienced pain was when the business decided they wanted to do live reporting. We ended up piping the data into a SQL Server instance and using SSRS.
This is common even in RDBMs. You have a database designed to handle the running of whatever application. That design may not be the most efficient for reporting off that application. Flattening the data, and pre-aggregating is usually a good first step. Many data warehouse solutions can do some of this automagically if set up with a common schema type (like star).
The fact that you could operate the mongo instance, but not the DB is definitely a cultural issue, or more likely a managerial issue. Definitely NOT a DB issue. I’ve worked at places where we could manage the DB ourselves, and other places where we couldn’t eve talk to the DBAs without beuraucracy. It’s all about the management in those scenarios. And the company results were as you would expect.
Sure - I’m not claiming that as a database issue at all. It’s just a side addendum to note that sometimes technical decisions can have a political aspect.
How are people running migrations in these distributed SQL environments? Especially with something like Kubernetes that has canary / rollback functionality. Do people run their migrations as a manual step, is integrated into the CI/CD system?
I can't find anybody talking about how they handle it in a scalable way. Not having a good DB migration story for dev/prod environments with automatic rollout / rollback makes using a SQL DB a non-starter for me.
Most Docker / Kubernetes documentation and their ecosystem assumes the app is fully developed and development doesn't exist. Real world companies develop products and have to perform DB migrations on the production database on a regular basis with each release.
I have the same feeling. If you ever figure this out, let me know. It looks very interesting, but I don't think my clients will let me remake their production with it and it is difficult to evaluate databases without data or a workload.
It's hard to understand the point of this article. First of all this is a good example of a completely biased ranting.
Secondly after that it's bashing NoSQL through many chapters later it is using those same features that NoSQL was capable of in the first place as a selling point for MemSQL... Seriously? This is a really biased marketing opinion piece at best.
Not to mention that it talks about new types of analytics but it completely ignores (or at least it doesn't mention any of) the category of low latency streaming analytics applications like Apache Storm, Apache Spark, Apache Kakfka et al.
> "To do this requires a new breed of analytics systems that can scale to hundreds of concurrent queries, deliver fast queries without pre-aggregation, and ingest data as it is created. On top of that, they want to expose data to customers and partners, requiring an operational SLA, security capabilities, performance, and scale not possible with current data stores"
Guess what! This is what streaming analytics was invented for.
> "To give an analogy, imagine libraries saying they are doing away with the Dewey Decimal System and just throwing the books into a big hole in the ground and declaring it a better system because it is way less work for the librarians"
No, to give an analogy imagine a library where the librarian only accepts books which are below a given size, have a specific color and weight, and he rejects any books which have even a cm bigger size, a different color or it's just a gram heavier. Also from time to time (let's say each month) the librarians are sitting together to decide what other books shall they accept (sometimes extending (and by that inherently limiting) the acceptable ranges to other properties like title or smell). Also to give a good analogy the librarians also reject those books which are returned without their softcover. That's RDBMS for you. It works, but not everyone is satisfied with such a service.
Can't you have ACID with write on ram first, write on disk second, delay the commit by committing after writing on disk, and have some tolerance for the time between write on ram and write on disk? How probable is it that a crash happens in that window? The data loss would seem very minimal and rare.
And is ACID always so important for most databases? I don't think most databases in the world are related to banks or require that much guarantees and safety. Better hardware often achieves that level of safety.
Other side note, I wish there would be alternative to the SQL "english sentence" flavor of the language, so that one could give more precise parameters for querying data, instead of building a sentence query which doesn't always make a lot of sense when compared to a non-sentence programming language.
Last I checked, we stopped doing that back in the early 90's - learning new things is your problem and you better know all of them by next Tuesday, or we'll replace you with somebody who does.
I'm surprised how many people still conflate NoSQL with "document database" and only think of MongoDB or CouchDB. I'd love to hear how MemSQL addresses graph data, for example.
> NoSQL came into existence because the databases at the time couldn’t handle the scale required.
Arguably the first "NoSQL" database was a DIT, generally accessed via the LDAP protocol. OpenLDAP is one of the better known open source instances, but Novell eDirectory was there in the early part of this century, as was MS AD.
In the case of Directory Services, it's a completely different approach to both SQL and the common "Document Store" approach of modern "NoSQL" options. Unless you go cowboy-mode, you're still adhering to schemas (albeit with the option to combine schemas), you still use indexed attributes to conduct searches, but you have a literal tree of objects with a more flexible layout (e.g. multi-value attributes without needing to use a JSON column) than with a regular SQL RDBMS.
The rise in popularity of 'modern' NoSQL options (most commonly document stores) is IMO driven by a combination of lazy developers and too much Kool-aid consumption.
There were NoSQL databases well before LDAP was a thing. We called them "Object databases". The idea was that doing object relational mapping was a PITA and wouldn't it be better to store your domain objects directly in the database? (As an aside, in case you are curious, the answer is: No. Generally it's not a good idea) Of course if your data is not relational, you can't do SQL. :-)
> Of course if your data is not relational, you can't do SQL.
The relational model is, as I understand it, fully general, so this literally cannot be the case. There is a problem of “if you do not know the shape of the data in advance” and “if the RDBMSs available to you do not efficiently handle data of the particular shape and access pattern you are using”, but “relational” isn't an inherent property of data, it's a property of a data model, and any data that can be modelled can be modelled relationally. That doesn't mean it is efficiently handled by every, or any, existing RDBMS, though.
There are plenty of models that don't work well modelled relationally. Few examples here:
(1) Wide table with 100K analytical features keyed against a customer. Now if you need to combine a few dozen of these features for predictive purposes then how do you do that in SQL ? You can't. Since every SQL database has a tiny column limit. In Cassandra that is a trivial O(1) fetch.
(2) Time series data can be many orders of magnitude faster in JSON document stores where you can have embedded/nested records.
(3) Star schema with a fact table and thousands of dimensions. In a SQL database that is thousand joins. In a JSON document store those dimensions can be embedded as well as made available as relations and again that is an O(1) fetch.
> There are plenty of models that don't work well modelled relationally.
There are plenty of data shapes and access patterns for which existing RDBMS implementations are inefficient, as I acknowledged. That's not a feature of whether the data is or is not relational—“relational” is a feature of a data model, and any data can be modelled relationally—but rather of the available RDBMS implementations.
> Wide table with 100K analytical features keyed against a customer. Now if you need to combine a few dozen of these features for predictive purposes then how do you do that in SQL ?
In terms of both the relational model and the SQL language, this is trivial. In most concrete SQL implementations, it is both inefficient and requires a concrete schema that diverges from the abstract relational model of the data, because you need multiple tables with a common key rather than a single table, even if abstractly it is a single relation in a normalized model. There's nothing fundamental about either SQL or the relational model that requires this, its just not the usage pattern common RDBMS storage implementations are optimized for.
> Time series data can be many orders of magnitude faster in JSON document stores where you can have embedded/nested records.
Yes, that's an efficiency of access patterns issue, not an “is the data relational” issue.
Of course, outside the strict relational model of data, SQL based object-relational databases like Oracle and Postgres have supported embedded and nested records (via the combination of composite and collection data types) since before “JSON object stores” existed.
I think you misunderstood the parents use of the term "model". All the things you describe here can be described in the relational model. The parent was arguing that the model is general enough to cover almost all collections of data. You appear to be arguing that for specific use cases specific implementations perform badly. These arguments pass each other unseen like ships in the night.
Actually wide tables for example can't be done at all.
And technically we can use Microsoft Excel as a photo editor. But that's just ridiculous. Just like it's ridiculous to say you can use relational models for any data structure even if the queries will never return.
You are still missing the point. The relational model is not about particular implementations. There is no limit on number of columns etc. It is not "ridiculous to say you can use relational models for any data structure", it is the whole point. That particular implementations may not suite your requirements is not a problem of the model, it is a problem of the particular implementations.
There may be other limitations, but MSSQL supports[1]:
Columns per nonwide table 1,024
Columns per wide table 30,000
I think a lot of this RDBMs can't do X, came from using a MySQL. It has improved some, but was garbage until recently.
I'm not sure how wide you're taking about, but PG support up to 1600 depending on the datatypes (there is a page size maximum). Oracle supports 1000 IIRC.
As I recall, MonetDB claims "practically unlimited" number of columns per table. And in this particular use case you're probably looking for an RDBMS specialized in analytics, not a general-purpose one like Postgres. But relational model will handle it certainly better than any other.
You can do all of that with a relational database. Whether or not your specific relational database makes it easy, or efficient is a different question from "can it be modeled relationally".
I would say the the rise of nosql options is driven by the need for scalability by a few large companies that desperately need it, and cargo-culting by developers that don't actually need it, but want to be like google and don't want to take the time to properly learn and optimize sql.
There's a genuine need for good scalable options, and with stuff like google spanner, those don't necessarily need to be "eventually consistent" or nosql. The simpler datamodels were probably easier to create and solved the problem, now I think there'll be a rise of scalable and consistent databases, best of both worlds.
It wasn't just devs that didn't know how to properly optimize SQL. Programming w a NoSQL db was a better experience because there was no object impedance mismatch.
ORMs fill the gap of course and SQL w JSON types are really nice. So I agree with your last paragraph but the rise was due to speed of db and speed for the developer.
I cranked out a lot of apps under short deadlines w MongoDB way back in the day. And I moved to Elasticsearch and CouchDB awhile ago. I'll still use those where appropriate but now in MariaDB for most. The competition was good.
I've seen some of those applications. It might be easier & faster to start with a schema-less json database, but I saw some serious maintenance issues with applications like that:
* Using the stored data outside the use cases imagined by the original developer is harder than it should be. (Oh you want a customer dashboard with this and this data correlated, ..)
* "Migrations" can be rather error-prone if developers don't pay attention. (It was stored as a string first, and now it's an integer, now how do we query that). Just because there's no database schema, doesn't mean there's no schema.
* It's hard to see what's actually in a json table/collection. You'd need to browse the code, see how it's used, and then browse through the available data.
Some apps aren't subject to new use-cases and some aren't meant to live that long or have another developer either. And SQL isn't great for the use-cases that solutions like Elasticsearch solves.
Migrations are less of an issue if you use an object document mapper which is just a leaner ORM. You can enforce your "schema" in your code while being schema-less. But yes if the type changes it could alter your queryability or you might have to write a script to update all records to reflect the new type. That can be a pain of course.
Viewing JSON collections in NoSQL is nice compared to SQL. Most your data is already composed as a sensible document and if you have subcollections the clients usually allow you to fold them or expand and show all. That's a win over SQL IMO.
Regarding JSON columns in SQL - I tend to store config/settings there and it's better than what it used to be for that stuff (serialized data or base64 encoded). In Navicat JSON columns are viewable in single record mode but pretty print format would be ideal.
There have been scalable and strongly consistent databases since the invention of the concept of NoSQL i.e. HBase and Cassandra (CL=ALL).
And the idea that "learning and optimize SQL" would instantly change people's rationale for using these databases shows you don't understand them much at all. There are many factors that come into play. For example you can't use SQL databases for large scale feature engineering.
He's probably refering to a combination of features that "nosql" databases have.
* Schema-on-read: Makes it easier to ingest large amounts of data, and then do ad-hoc exploration. The schema is only determined when reading the data, which is a bit easier for one-off data exploration, you determine how to interpret the data when actually using it. Not appropriate for production systems though. For example, a customer gives you a few TBs of data, you dump it on hadoop, and query it with spark. It would slow you down if you first have to convert it to a relational schema. Again, only good for one-off stuff.
* Most SQL databases have column limits, so if you have a very large amount of features, I'd imagine you'd run into these limits.
* Scalability. Feature engineering is very parallelizable, most normal SQL databases (excluding stuff like cassandra) aren't trivial to scale.
Cassandra is AP (available and partition tolerant), with tunable consistency. HBase is CP (consistent and partition tolerant, but not always available in case of network partitions).
Spanner is consistent, available and partition tolerant due to the extremely precise clock synchronization that google achieved on their hardware and network, an innovation I hope will spread. So yes, that's new.
I don't think I claimed sql was the only solution that ever made sense. I'm sure for some use cases, schema-on-read or no schema is appropriate. But most of those usecases are rather niche. NoSQL became popular for (web)application software, which in my opinion isn't an appropriate use case for most of these applications.
> Spanner is consistent, available and partition tolerant due to the extremely precise clock synchronization...
This is nonsensical; even in a universe without relativity and with instantaneous communication--an unrealistic world where being able to talk about synchronized clocks can even make sense (as otherwise you fundamentally must deal with an inability to decide the order of two events due to latency)--you still cause indeterminate simultaneity by introducing a partition: your perfectly accurate and infinitely precise timestamps on one side of the partition would no longer be comparable with timestamps on the other side of the partition for purposes of establishing causality between events, as the recent changes being made on one side of the partition "should have" taken place between any older changes on the other side and any transactions being performed right now, and yet they couldn't because they were unknown. Spammer either gives up availability (waiting to verify what happened) or consistency (accepting that it will have to fix the result later) when a partition occurs: they can't do both, fundamentally; the only extent to which there is an engineering challenge here is "maybe you can minimize the number of partitions you have", which I totally believe Google to have done; but, given a partition (which cannot be fully prevented), the tradeoff is clear and unavoidable.
TLDR: It's technically CP, but google claims partitions are so rare people can assume it's CA as well (99.999% available).
I assumed TrueTime bypassed the CAP theorem, but apparently that's marketing bs. It's to ensure something called external consistency, which is important if you want to take consistent snapshots over a distributed system.
Semi-related soapbox: Can we all please stop saying "CAP theorem" and instead say "CAP rule"? The formulation of this supposed theorem is so vague that it can neither be proven nor disproven.
> This sounds like an old-wives tail at this point.
Except it's not. I've seen it in person, over and over. I've gone into systems where people were complaining about the 'database is slow', but there were no indexes. I've seen systems pull back 100s of thousands of records and then sort on app server and take the first 50. I've seen what would have been a simple join with exists clause turned into many round trips to the db with loops and app code complexities where a query would have 100x easier to reason about.
It's not even about optimization yet, but taking the time to learn even the surface capabilities of the tool.
Although honestly, sql is more than 40 years old and can be really nasty. Stored procedures, each database has a different dialect, nullability comparisons, CTEs. It's very different from any other programming language. It's not that hard to imagine people finding it hard to learn, especially front-end web developers who'd rather not touch the backend. If people don't know how indexes work, then it'll perform really badly as well, which would make databases like mongodb all the more attractive.
Not all developers are formally schooled, a lot of people don't know what a relational database model is.
I've seen people:
* Iterate over entire tables to get to a small number of records
* Not using joins, having the ORM "magically" handle everything (lazily getting non-joined values in separate queries, resulting in thousands of queries)
* Denormalize everything, and then wonder why things are inconsistent and hard to query
It actually IS quite hard to imagine solid engineers finding it hard to learn simple SQL. It's a lot harder to write "nasty" sql than "nasty" js, and if you're a front-end dev struggling with sql, then you probably should not be doing back end work. Yes not all databases are equal, thats why there are ORMs and ANSI standards.
From the list of bad practices you've "seen" people do in SQL, I would guess their comfort zone code is probably a hot mess as well.
> It actually IS quite hard to imagine solid engineers finding it hard to learn simple SQL.
I used to think this because I learned SQL right along with all my other coding. I've realized though it is a mindset shift to go from imperative to declarative, and to think mostly in set operations. That shift can be hard for otherwise good developers.
> It's a lot harder to write "nasty" sql than "nasty" js
That may be true (though, when it comes to dealing with more complex reporting functions and per-database-implementation differences/quirks outside of the realm of ANSI SQL it may not be).
However, it misses the point. Most engineers that struggle with SQL struggle with it because it's hidden from them partially (they're composing queries from snippets/query builders that come from other code, and never get to see the schema directly since it's hidden behind migrators and management interfaces) or completely (ORMs). Because the actual queries being run on actual schemas are less obvious, people do the wrong thing a lot.
That's not to say that abstractions on top of SQL are always bad--perhaps they're over/mis-used, but having worked on massive codebases where every dev's interaction with the database was "write a query in text or with a select().where().from()-type thin wrapper" and codebases that were 100% Django ORM, I can say with confidence that neither approach scales well absent big investments in correctness and RDBMS education.
So the developers at Google (BigTable), Facebook (Cassandra), Yahoo (HBase), Apple (FoundationDB), Redis etc are "lazy" and have drunk "too much Kool-aid".
Because these technologies every day power the most demanding applications on the planet.
I think the accusation was against the developers who caused the rise in popularity (i.e. the users), not the developers who created the technologies (i.e. the authors).
In that light, it might be best to exclude BigTable, since it's only available publically as a service.
Facebook, Yahoo, Google etc built these databases to use not as coding exercises. So are they lazy and drinking the kool aid for not relying on SQL databases instead of building their own ?
Or maybe SQL isn't perfect for every use case and other developers should stop arrogantly belittling those that choose to architect their own systems a particular way.
I'd say that 99.9% of organisations using any database technology, don't have the scaling problems of Google, Yahoo, Facebook or Apple.
They objectively can't be "drinking the koolaid" or cargoculting something, because they've specifically created their own solutions, to solve their own problems.
I specifically didn't say SQL is perfect - I said that they're very popular for the wrong reasons, and often are used in situations that a relational database would be a better solution.
So it's okay for developers to build their own database for their specific use case. But it's not okay for developers to use a database for their specific use case.
This is not what he said. He basically said that in many cases, some developpers pushed something like mongo, ES or cassandra in their project or even company when a small sqlite or an effective postgressql were much more appropriate.
Specific databases solve specific problems. In my current company, for log storage and analysis, InfluxDB was our solution after testing both ES and postgress, but we did compare all the products, and we still do as we expand. In my previous company, the main database in the product was Zope. I agree that mongoDB may be better in some use case, but i strongly doubt that most company are running it because it's the most appropriate product.
>lazy developers and too much Kool-aid consumption.
When your argument is based on personal attacks of the people using that thing, it sounds more like you have an emotional investment in the alternative rather than a rational argument.
It is a very rational argument people want to copy what Big boys are doing often fairly rationally as they want that stuff on the resume. I've seen countless projects using "BIG data" tools for dealing with 100GB or less of data that has very modest grows rate.
Perfect! We at Shippable moved from NoSQL MongoDB to PostgreSQL for several reasons. Here is a small story,
It started with small problems...
Even though we had the ability to add features at a lightening pace, we started seeing occasional downtimes which always seemed to come down to MongoDB. For instance:
> We were very happy to have 24x7 availability with primary and secondary instances of MongoDB. However, our perf suddenly deteriorated one day and retrieval started taking more than a second per document. We tried using many tools and profilers, but could not figure out what was happening. Finally, we rebuilt a new server, switched that over as primary, and rebuilt our secondary. Retrieval times dropped to 150ms again. This is still an unsolved mystery!
> Our Mongo instance reached 4TB and we were proud of our growing adoption. Due to the lack of tooling around managing large DBs, we relied on indexes to keep the search times low. When NoSQL DBs first became popular, there was no way to create uniqueness, so these features were built as an afterthought. Some of the bloating of our MongoDB was actually due to indexes, but rebuilding them was primitive and the entire DB would lock down.
> At one point, we needed to reboot our DB server and it took MongoDB 4 hours to come back online. This led to an extended downtime for our service, and we had very little visibility into the MongoDB process or status.
And then came the knockout punch!
The biggest advantage, and disadvantage, of MongoDB is that it has a flexible schema. This means that documents in the same collection (aka table in the old world) do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data. In a nutshell, there are no strict schema rules and this opens it up to a lot of cowboy tinkering.
While many developers love the flexibility, it also puts a very high degree of responsibility on their shoulders to get things right.
MemSQL is so expensive you have to call for a quote.
If it's just "the ability to scale" then sure, SQL is back on the menu. MemSQL, Spanner, CockroachDB are leading the charge and that's great! But you have to pay a pretty penny for it (in TCO). There's still a lot of value in a cheap fire-and-forget scalable database, and there are not currently any SQL options there.
Regarding schemaless: I think this is a divide between people who work in dynamic languages vs people who work in statically typed languages. Schemaless databases are just fine in languages like Java; your classes define the schema with enough rigidity to keep you out of trouble. They wouldn't be my choice for Javascript though, that's for sure.