Cassandra works really bad with fat nodes (lots of data on one node), and much much better with a lot of small nodes, and 100PB with 300K nodes confirms this. Scylla scales better vertically, but don't know how much.
Some comments are already comparing this to pgsql/mysql/whatever. Please don't. You can't make the same queries even though the language seems to support it.
Cassandra is good at ingesting data, bad at deleting, really really bad at anything remotely relational. Errors are almost pointless.
The takeaway should be: Yes, cassandra/scylla can be really fast and scale a lot. But it is also very probably unusable for your use case. Don't trust what the CQL language says you can do. Don't get me started on how bad the CQL language is, either.
Your statement about data propagation in that linked comment is at least misleading. A write at quorum will always be visible instantly to a read at quorum.
I only wrote that quorum is not a transaction (no transactions in cassandra), and that is not a consensus.
While quorum looks a lot like consensus, it is not since what is returned to the client is the latest timestamp. Different nodes could have and return different data. So Quorum write+local_quorum read might fail even if you are in the datacenter that accepted the write. Quorum is also on the total copies of the data, not a quorum of DC, so in certain (weird) multi-DC setups you could have a quorum in a single DC.
In general though I think that the data consistency options of cassandra (quorum/local/N) are a good idea but underdeveloped.
Anyway my point was: cassandra has too many pitfalls and eliminating those restricts the use case by a lot more than people realize. Plus the naming of all features look designed to trick you into thinking soewthing else
No, you wrote that you have to write and wait several minutes for data to replicate. This is straight-up false.
Yes, if you mix consistency levels incorrectly you will do it wrong and maybe get stale data, but that is a different criticism. I agree that it is easy for unsophisticated users to incorrectly use consistency levels in complex topologies, and I hope we will introduce mechanisms to prevent users making such mistakes in future. But that was not your claim, and in my experience users do understand consistency levels just fine.
There are lots of valid criticisms to point at various use cases with Cassandra, but this was just incorrect.
it’s worse than waiting actually.
Because client A might write value « 1 » and after client B write value « 2 ».
Then client C can read Value 2 for 5 minute and then cassandra internal read-repair eventually put value 1 on all replica and value 2 is lost forever.
If value 2 is newer than value 1, value 1 will never overwrite value 2. If a client reads value 2 at QUORUM then it will always be seen by all future queries.
Client B’s write will only be successfully read from 1 of the nodes it wrote to, and read-repair only runs on QUORUM reads. So, no, it will not be possible to read value 2 for 5m - it will never be visible to operations at QUORUM.
This would have been a valid criticism of LWW (and there are other more contrived examples), but I think (or hope) this is an explicit trade off made by anyone using Cassandra in eventual consistency mode. There are strategies to prevent this being a problem for workloads where it matters, some discussed elsewhere in the thread.
Since "client B write" was successfully written on 2 of the 3 nodes.
Any read request that read only from 2 of the 3 nodes instead of using quorum read, will be able to see what Client B wrote until it magically disappears.
Quorum Read-repair is only one reason for why the value would randomly disappear.
Another one is periodic anti Entropy repair!
No, it won’t. Successfully written does not mean what you think it means. If there is “newer” (by timestamp) data on disk or in memtable then it will not be returned to a client, regardless of which order that data arrived. It is unlikely even to be written to disk (except the commit log).
Since at least one of those nodes has the “newer” value, only one node can serve this “older” value
> Different nodes could have and return different data.
Only if they miss some writes, and eventually they will converge. But if you do quorum writes and quorum reads (or local quorum W + local quorum R), this guarantees you'll read from at last one node that received all the writes issued before the read, so you get the converged value immediately, regardless of which node you ask. All nodes will eventually agree on the value, because timestamps are assigned by the coordinator or by the application, not at the replica. A single write will get the same timestamp across all replicas.
Incorrect timestamps can cause a different problem - a write that happened at physical time T2 > T1 might be considered to be older than T1 by the cluster, if it was accepted by the coordinator whose clock was set in the past. Such write might simply not take any effect, as old updates would be considered newer. However, again, the resolution would be consistent on all replicas once they get all updates.
timestamp is coming from the operating system where the client library is running. so unless all your write request are issued by the same machine then
you are guaranteed to face the problem where the client machine clocks are not in sync.
just comparing timestamp is obviously a design from someone that didn’t review the academic literature on distributed transaction and consensus
>> timestamp is coming from the operating system where the client library is running.
The client can set the timestamp to any value of its choice. It does not have to correspond to clock time.
>> so unless all your write request are issued by the same machine then
you are guaranteed to face the problem where the client machine clocks are not in sync.
It's not about all writes. It's about ensuring that all writes to a single partition (specifically, a single column within a single partition) are done using a source of monotonic integers to ensure ordering.
You are correct that if you have a singleton server returning monotonic time to clients this would work.
But this Monotonic time server is not part of Cassandra itself and for this reason majority of the people using Cassandra will use the OS time without knowing this is silently corrupting the database.
Nothing bad? That’s the whole point of quorum reads and writes. You write to at least two, and read from at least two (one typically being a digest-only to corroborate the result).
"write will fail"...but the data will still be there (you can SELECT it normally) and will be replicated to the node that was down once it is up again.
Unless the node was down too much and could not fully catch up before a set time (DB TTL if I remember correctly), in which case the data might be propagated or not, and old deleted data might come up again and be repropagated by a cluster repair.
I know this is HN and people love to troll, but it makes me sad when I see you using falsehoods to take a steaming dump on a group of engineers that are obsessed with building a database that can scale while maintaining the highest level of correctness possible. All in Open Source... for free! Spend some time on the Cassandra mailing list and you'll walk away feeling much differently. Instead of complaining, participate. Some examples of the project's obsession with quality and correctness:
That blog was posted 9 years ago. Needless to say, a lot of improvement and engineering has happened since then. The limited use cases of LWTs will soon be replaced with general, ACID transactions.
LWTs were added for a very simple reason: stronger isolation where the performance trade-off makes sense. Nothing to do with the GP comment.
Until recently they were indeed slow over the WAN, and they remain slow under heavy contention. They are now faster than peer features for WAN reads.
However, the claim that they have many bugs needs to be backed up. I just finished overhauling Paxos in Cassandra and it is now one of the most thoroughly tested distributed consensus implementations around.
> and old deleted data might come up again and be repropagated
That pretty much describes iCloud.
Argh! Zombies!
iCloud has terrible syncing. Here’s an example. Synced bookmarks in Safari:
I use three devices, regularly; my laptop (really a desktop, most of the time), my iPad (I’m on it, now), and my iPhone.
On any one of these devices, I may choose to “favorite” a page, and add it to a fairly extensive hierarchy of bookmarks, that I prefer to keep in a Bookmarks Bar.
Each folder can have a lot of bookmarks. Most are ones that I hardly ever need to use, and I generally access them via a search.
I like to keep the “active” ones at the top of the folder. This is especially important for my iPhone, which has an extremely limited screen (it’s an iPhone 13 Mini -Alas, poor Mini. I knew him well).
The damn bookmarks keep changing order. If I drag one to the top of the menu, I do that, because it’s the most important one, and I don’t want to scroll the screen.
The issue is that the order of bookmarks changes, between devices. In fact, I just noticed that a bookmark that I dragged to the top of one of my folders, yesterday, is now back down, several notches.
Don’t get me started on deleting contacts, or syncing Messages.
I assume that this is a symptom of DB dysfunction.
You can only SELECT at QUORUM if you are guaranteed to be able to SELECT it again at QUORUM, ie it must be durable (and the QUORUM read will ensure it if the prior write did not).
If you are selecting at ONE then yes, you can expect stale replies if you contact a different node. That is what the consistency level means; that just one node has seen the write.
I don't think that's true. Can't speak for everybody but for the stuff I worked on. Quorum reads at RF=3 are only twice as slow as reading a single node and it's pretty practical, if you care about it, to write and read using quorum writes/reads. It's true there's many applications where you're ok with some time not reading your latest data, and for those you can get somewhere better performance/latency for a given scale by going CL=ONE...
* LWT (Lightweight Transactions) are nothing like what you expect when you head "transactions". LWT are single row only.
* error handling is asinine. E.g: A quorum insert fails. is the data there or not? Maybe!
* material views. Added, removed, readded, deprecated.
* big limits on secondary indexes which only work inside each single partition
* different format of timestamp for insert and select
* BATCH is atomic, but not isolated.
...In general CQL feels like a hack. In classic SQL you understand early that you can compose queries, subqueries and such. CQL leads to you to believe that this is possible, only for you to find out that it isn't. The docs are full of stuff like "You can do this query, except in this case, unless this other case in which you can again, but not in this other". It's just exception on top of exception. Even the documentation does not list all exceptions (I managed to find a couple in my last work).
Single partition only. Poorly named, sure. Feature-wise though pretty comparable to peer offerings. Fortunately general purpose transactions are coming next year.
> material views. Added, removed, readded, deprecated.
No; added, then marked experimental. Never removed or deprecated, though they may be superseded before long.
> error handling is asinine
Fundamental distributed systems problem that is just more apparent with eventual consistency. Failure does not have a certain outcome.
> A quorum insert fails. is the data there or not? Maybe!
How is this point fundamentally different from SQL? In SQL, after you send COMMIT to a remote database and get back a network error, you don't know if your INSERTed data is committed or not. You'll have to reconnect and check for it.
In CQL queries can fail even if your connection is still up and running.
In this case it means that it could not write to 3 nodes, but it is not telling you if it did not write anything or at least once. Writing just once means that the data will be replicated.
So if only 1 of the 3 nodes is up, it will still write there and then return error, even if the coordinator knows that the other nodes are not up.
Queries have a maximum running time (only configurable at the database level if I remember correctly), and if your write exceeds this, it returns error. The replication still goes on and the data is still there.
Network errors are different category, since you might not even know if your command was sent or not. Cassandra kind of sidesteps the network errors though since the client connects to multiple coordinators in the cluster.
They are not a different category. This is a distributed database, network errors and node failures are a fundamental part of its function.
Fundamentally every distributed protocol has a moment where it may have to tell the client it has abandoned the operation, but already has in flight messages to remote replicas that would result in a decision to complete the operation.
Before that operation is answered by the replicas, the operation is in an unknown state. It is fundamental. Some slower approaches to reaching decisions may mask this problem more often, but it is there whether you realise it or not.
I think the concern there is that these different failure modes are all given the same error/exception messaging? The caller could make different decisions based on these different potential outcomes, but only if the caller receives specific errors. It's been a while since I've used Cassandra, so if the error handling has improved along those lines then I apologize.
A timeout is pretty much always an unknown outcome. Cassandra does have dedicated exceptions for failed writes (which should have a certain outcome of not applied) but this is much rarer in practice.
Cassandra does today inform you on timeout how many replicas have been successfully written to, if it was insufficient to reach the requested level of durability, which many years ago it did not. But this is no guarantee the write is durable, as this will represent a minority of nodes - and they may fail, or be partitioned from the remainder of the cluster. So the write remains in an unknown state until successfully read at QUORUM, much like a timeout during COMMIT for any database.
What you're saying applies more or less no any NoSQL database. Unusable for your use case is a bit of a big statement. There's plenty of cases where you don't need the power of relational databases and relational databases can't/don't typically scale horizontally the way NoSQL can. And really NoSQL can because of its attributes. If all you need is key/value lookups or time series, you're not looking for any joins or random reads that don't agree with the native ordering/partitioning then it's hard to beat NoSQL scalability/ingest-wise.
True, which is why I said that cassandra has its uses.
The problem is that the documentation and the language tries really hard to sell you something else, and I have seen multiple people wasting way too much time on this.
I agree, but I think the part that's not stated is that when picking such a technology, you basically commit to not needing joins or random reads. If you do, as the application evolves, you're really out of luck. Generally, that commitment cannot be made upfront.
Typically, solutions like time-series or document databases are deployed alongside other components in a much larger system. If you are looking for a single database solution, then maybe NoSQL is not for you
I don't think those have the same performance characteristics. At the end of the day you can't have your cake and eat it. You can't keep large scale indices that facilitate fast joins without writing them out to disk and you can't keep them in sync with your data without transactionality/atomicity. This boils down to tradeoffs in terms of locking/consensus/scale/indices/data locality. That said if you're going to be emulating this over noSQL anyways then there's an argument for using something that does it for you (assuming it meets your requirements).
This is definitely about tradeoffs.
But there is 2 tradeoffs spectrum,
1- Consistent (strict serializability + global transaction) vs Eventual consistency.
2- Range query and Join vs KeyValue only
FoundationDB provide a Consistent databse but By default only a KeyValue API.
(CitusDB,Vitess..) give you Consistency but only within a partition, but also give a nice SQL API.
(Cockroachdb/Yugabyte) provide a sql api with a consistent database.
Cassandra gives you Eventual consistency with the Key-value API.
My experience has been that the performance penalty for using a Consistent database instead of one that will silently corrupt your data is small "less than 10% slower".
As for the cost of using Join and Index, no once is forcing you to create index and send query that use join. Its 100% opt-in.
But still Google Spanner and CockroachDB made it very cheap to do join between parent entity and a child entity table but using interleaved table.
And using RocksDB and good SSD, the cost of index maintenance is not as bad as it appear to be.
Cassandra has tunable consistency. It also has some, terribly slow, lightweight transactions. If you have replication, which you need for HA/DR, then the cost of writes is pretty much the same, data has to be written to all 3 replicas. The question is, for the most part, when do you ack the writes. So on a single site Cassandra you can e.g. write 3 replicas, read 2 back, and you have a consistent database (for a given key). That's not sufficient for ACID though.
I'm honestly not familiar at all with CitusDB or Vitess in terms of where they sit in their ability to handle failures, scale out, transactions etc. or where they sit in the CAP story. I'm sure Spanner is paying the tax somewhere but it's just that Google can back it with infinite resources to give you the performance you need.
If there was a database that offered ACID, HA/DR, and scaled horizontally perfectly at a cost of 10% more resources then I doubt anyone would be using stuff like Cassandra, ScyllaDB or HBase... I don't think that exists though but I'll look up those other databases you mentioned.
It will exist next year. Cassandra is getting HA ACID transactions (depending how you define C; we won't be getting foreign key constraints by then) that will be fast. Typically as fast (in latency terms) as a normal eventually consistent operation. Depending on the read/write balance and complexity, they may even result in a reduction in cluster resource utilisation. For many workloads there will be additional costs, but it remains to be discovered exactly how much. I hope it will be on that sort of scale.
>It also has some, terribly slow, lightweight transactions
FWIW, LWTs are much (2x) faster in 4.1 (which has been very slowly making its way out the door for a while now... the project is mostly already more focused on 4.2, so pushing 4.1 out the door is tortuous for procedural reasons)
cool, good to know. There's still the relational vs. no-relational question though for many use cases but having better transactions opens the door for more use cases.
Right with real transaction you could maintain your own index if you really need them.
And to have good performance you should never join entity that have distinct partition key. Otherwise you end up reading too much data over the network.
I’m honestly very happy Cassandra will get real transaction. But i still need to understand the proposed design. Will it be possible to have serializability when doing Range scan query?
Yes, but probably not initially. The design supports them, it’s just not very pressing to implement since most users today use hash partitioners. There will be a lot of avenues in which to improve transactions after they are delivered and it is hard to predict what will get the investment when.
Proper global indexes are likely to follow on the heels of transactions, as they solve most of the problem. Though transactions being one-shot and needing to declare partition-keys upfront means there’s some optimistic concurrency control required. Interactive transactions supporting pessimistic concurrency and without needing to declare the involved keys will also hopefully follow, at which point indexes are very trivial, but the same caveat as above applies.
For real, I'm not against java, but I find time and again that those services that depend on the JVM have lots of "weird stuff". Not because Java is bad but Java fans (which you usually are when you start to develop something this big) think a lot of this crap is acceptable.
In my (extensive) experience in infrastructure. When people say the JVM is the problem, the JVM is never the problem. It's usually just a symptom of something else and lazy ops people just want to blame something and throw up their hands. I've never had to "tune a JVM" to make things work.
I'll give you an example in Spark. We had a huge job that was failing and after checking the logs, it was when results were being spooled to disk. More log diving showed a lot of GC on every node. At that point we could have gone down the route of tuning something in the JVM, but more digging found the real culprit. IOstats when the jobs ran showed while reading data, writes were completely blocked and write latency was in the 100s of ms. The spark executor trying to dump data was blocked and the first symptom that things were falling apart was... GC. We changed the scheduler on the nodes and magically everything worked great. The VM in JVM is virtual machine. Same rules for resources apply and if you run out of resources, don't expect the mythical ops faeries to save you.
Sure I can answer both. The OS scheduler CFQ is generally bad for high volume disk applications. Used deadline in this case. Amy's Guide is still a treasure trove of info and a solid recommended read: https://tobert.github.io/pages/als-cassandra-21-tuning-guide...
If you are using Spark in a bare metal cluster with spark-submit, the above advice applies. In Kubernetes, never use the default pod scheduler with analytic workloads. Great choices are Volcano(https://volcano.sh/en/) and Yunikorn(https://yunikorn.apache.org/). Also great and evolving projects to support and contribute if you can.
Cassandra works really bad with fat nodes (lots of data on one node), and much much better with a lot of small nodes, and 100PB with 300K nodes confirms this. Scylla scales better vertically, but don't know how much.
Some comments are already comparing this to pgsql/mysql/whatever. Please don't. You can't make the same queries even though the language seems to support it.
Cassandra is good at ingesting data, bad at deleting, really really bad at anything remotely relational. Errors are almost pointless.
I'm going to point at an older comment of mine on cassandra: https://news.ycombinator.com/item?id=20430925#20432564
The takeaway should be: Yes, cassandra/scylla can be really fast and scale a lot. But it is also very probably unusable for your use case. Don't trust what the CQL language says you can do. Don't get me started on how bad the CQL language is, either.