Couple of things as always: Cassandra works really bad with fat nodes (lots of d...

_benedict · on Oct 8, 2022

Your statement about data propagation in that linked comment is at least misleading. A write at quorum will always be visible instantly to a read at quorum.

Luker88 · on Oct 9, 2022

I only wrote that quorum is not a transaction (no transactions in cassandra), and that is not a consensus.

While quorum looks a lot like consensus, it is not since what is returned to the client is the latest timestamp. Different nodes could have and return different data. So Quorum write+local_quorum read might fail even if you are in the datacenter that accepted the write. Quorum is also on the total copies of the data, not a quorum of DC, so in certain (weird) multi-DC setups you could have a quorum in a single DC.

In general though I think that the data consistency options of cassandra (quorum/local/N) are a good idea but underdeveloped.

Anyway my point was: cassandra has too many pitfalls and eliminating those restricts the use case by a lot more than people realize. Plus the naming of all features look designed to trick you into thinking soewthing else

_benedict · on Oct 9, 2022

No, you wrote that you have to write and wait several minutes for data to replicate. This is straight-up false.

Yes, if you mix consistency levels incorrectly you will do it wrong and maybe get stale data, but that is a different criticism. I agree that it is easy for unsophisticated users to incorrectly use consistency levels in complex topologies, and I hope we will introduce mechanisms to prevent users making such mistakes in future. But that was not your claim, and in my experience users do understand consistency levels just fine.

There are lots of valid criticisms to point at various use cases with Cassandra, but this was just incorrect.

skyde · on Oct 9, 2022

it’s worse than waiting actually. Because client A might write value « 1 » and after client B write value « 2 ». Then client C can read Value 2 for 5 minute and then cassandra internal read-repair eventually put value 1 on all replica and value 2 is lost forever.

_benedict · on Oct 9, 2022

No, that is literally impossible.

If value 2 is newer than value 1, value 1 will never overwrite value 2. If a client reads value 2 at QUORUM then it will always be seen by all future queries.

skyde · on Oct 9, 2022

Client À write value 1 on 2 of 3 node with timestamp=3

then later client B write value 2 on 2 of the 3 node with timestamp=1

then read repair happen and (value=1 timestamp=3) is written to all 3 nodes.

this is only one of many scenarios where this stupid design fail.

_benedict · on Oct 10, 2022

Client B’s write will only be successfully read from 1 of the nodes it wrote to, and read-repair only runs on QUORUM reads. So, no, it will not be possible to read value 2 for 5m - it will never be visible to operations at QUORUM.

This would have been a valid criticism of LWW (and there are other more contrived examples), but I think (or hope) this is an explicit trade off made by anyone using Cassandra in eventual consistency mode. There are strategies to prevent this being a problem for workloads where it matters, some discussed elsewhere in the thread.

skyde · on Oct 10, 2022

Since "client B write" was successfully written on 2 of the 3 nodes. Any read request that read only from 2 of the 3 nodes instead of using quorum read, will be able to see what Client B wrote until it magically disappears.

Quorum Read-repair is only one reason for why the value would randomly disappear. Another one is periodic anti Entropy repair!

_benedict · on Oct 10, 2022

No, it won’t. Successfully written does not mean what you think it means. If there is “newer” (by timestamp) data on disk or in memtable then it will not be returned to a client, regardless of which order that data arrived. It is unlikely even to be written to disk (except the commit log).

Since at least one of those nodes has the “newer” value, only one node can serve this “older” value

pkolaczk · on Oct 9, 2022

> Different nodes could have and return different data.

Only if they miss some writes, and eventually they will converge. But if you do quorum writes and quorum reads (or local quorum W + local quorum R), this guarantees you'll read from at last one node that received all the writes issued before the read, so you get the converged value immediately, regardless of which node you ask. All nodes will eventually agree on the value, because timestamps are assigned by the coordinator or by the application, not at the replica. A single write will get the same timestamp across all replicas.

Incorrect timestamps can cause a different problem - a write that happened at physical time T2 > T1 might be considered to be older than T1 by the cluster, if it was accepted by the coordinator whose clock was set in the past. Such write might simply not take any effect, as old updates would be considered newer. However, again, the resolution would be consistent on all replicas once they get all updates.

skyde · on Oct 9, 2022

eventually they will converge on a « random » value! that’s why we say it’s write quorum is not consensus it’s offer very weak guarantee

pkolaczk · on Oct 9, 2022

No, they will converge on a value with higher timestamp. And timestamps are controllable by your app, so you can guarantee they are monotonic.

skyde · on Oct 9, 2022

timestamp is coming from the operating system where the client library is running. so unless all your write request are issued by the same machine then

you are guaranteed to face the problem where the client machine clocks are not in sync.

just comparing timestamp is obviously a design from someone that didn’t review the academic literature on distributed transaction and consensus

wpaladin · on Oct 10, 2022

>> timestamp is coming from the operating system where the client library is running.

The client can set the timestamp to any value of its choice. It does not have to correspond to clock time.

>> so unless all your write request are issued by the same machine then you are guaranteed to face the problem where the client machine clocks are not in sync.

It's not about all writes. It's about ensuring that all writes to a single partition (specifically, a single column within a single partition) are done using a source of monotonic integers to ensure ordering.

skyde · on Oct 10, 2022

You are correct that if you have a singleton server returning monotonic time to clients this would work.

But this Monotonic time server is not part of Cassandra itself and for this reason majority of the people using Cassandra will use the OS time without knowing this is silently corrupting the database.

staticassertion · on Oct 9, 2022

What happens if you send the 'write' to 3 nodes but only 2 are up?

_benedict · on Oct 9, 2022

Nothing bad? That’s the whole point of quorum reads and writes. You write to at least two, and read from at least two (one typically being a digest-only to corroborate the result).

This is all done for you by Cassandra

alanfranz · on Oct 9, 2022

It depends on the consistency level you set for writing.

If you have 3 nodes, set quorum cl, the write will succeed, because quorum of 3 is at least 2.

If you explicitly require cl of three, and only two nodes answer, write will fail.

Luker88 · on Oct 9, 2022

"write will fail"...but the data will still be there (you can SELECT it normally) and will be replicated to the node that was down once it is up again.

Unless the node was down too much and could not fully catch up before a set time (DB TTL if I remember correctly), in which case the data might be propagated or not, and old deleted data might come up again and be repropagated by a cluster repair.

So much fun to maintain

pmcf · on Oct 9, 2022

I know this is HN and people love to troll, but it makes me sad when I see you using falsehoods to take a steaming dump on a group of engineers that are obsessed with building a database that can scale while maintaining the highest level of correctness possible. All in Open Source... for free! Spend some time on the Cassandra mailing list and you'll walk away feeling much differently. Instead of complaining, participate. Some examples of the project's obsession with quality and correctness:

https://cassandra.apache.org/_/blog/The-Path-to-Green-CI.htm...

https://cassandra.apache.org/_/blog/Finding-Bugs-in-Cassandr...

https://cassandra.apache.org/_/blog/Testing-Apache-Cassandra...

https://cassandra.apache.org/_/blog/Introducing-Apache-Cassa...

skyde · on Oct 9, 2022

sorry to disappoint you but there is a reason Cassandra tried to add LWT using paxos.

it mostly work but it’s extremely slow and still have many bugs.

https://www.datastax.com/blog/lightweight-transactions-cassa...

pmcf · on Oct 9, 2022

That blog was posted 9 years ago. Needless to say, a lot of improvement and engineering has happened since then. The limited use cases of LWTs will soon be replaced with general, ACID transactions.

https://thenewstack.io/an-apache-cassandra-breakthrough-acid...

skyde · on Oct 9, 2022

This new « accord » thing look very promising! If this get merged into Cassandra this change everything!

A prototype has been developed that has demonstrated its correctness against Jepsen.io’s Maelstrom tool

_benedict · on Oct 9, 2022

LWTs were added for a very simple reason: stronger isolation where the performance trade-off makes sense. Nothing to do with the GP comment.

Until recently they were indeed slow over the WAN, and they remain slow under heavy contention. They are now faster than peer features for WAN reads.

However, the claim that they have many bugs needs to be backed up. I just finished overhauling Paxos in Cassandra and it is now one of the most thoroughly tested distributed consensus implementations around.

skyde · on Oct 9, 2022

LWT is still the only way to guarantee your write don’t random disappear and that conditional update are really conditional.

LWT still have awful performance compared to a write request with same guarantee in any other database.

ChrisMarshallNY · on Oct 9, 2022

> and old deleted data might come up again and be repropagated

That pretty much describes iCloud.

Argh! Zombies!

iCloud has terrible syncing. Here’s an example. Synced bookmarks in Safari:

I use three devices, regularly; my laptop (really a desktop, most of the time), my iPad (I’m on it, now), and my iPhone.

On any one of these devices, I may choose to “favorite” a page, and add it to a fairly extensive hierarchy of bookmarks, that I prefer to keep in a Bookmarks Bar.

Each folder can have a lot of bookmarks. Most are ones that I hardly ever need to use, and I generally access them via a search.

I like to keep the “active” ones at the top of the folder. This is especially important for my iPhone, which has an extremely limited screen (it’s an iPhone 13 Mini -Alas, poor Mini. I knew him well).

The damn bookmarks keep changing order. If I drag one to the top of the menu, I do that, because it’s the most important one, and I don’t want to scroll the screen.

The issue is that the order of bookmarks changes, between devices. In fact, I just noticed that a bookmark that I dragged to the top of one of my folders, yesterday, is now back down, several notches.

Don’t get me started on deleting contacts, or syncing Messages.

I assume that this is a symptom of DB dysfunction.

rfoo · on Oct 9, 2022

This is a symptom of DB dysfunction only if you are using just one device.

Otherwise it's far more likely to be faulty sync algorithm.

_benedict · on Oct 9, 2022

You can only SELECT at QUORUM if you are guaranteed to be able to SELECT it again at QUORUM, ie it must be durable (and the QUORUM read will ensure it if the prior write did not).

If you are selecting at ONE then yes, you can expect stale replies if you contact a different node. That is what the consistency level means; that just one node has seen the write.

alanfranz · on Oct 9, 2022

At application level, you’ll probably retry, and you’ll probably select with the same cl you’ve written with.

The other behaviour you’re talking about is hinted handoff.

“So much fun to maintain” -> cassandra has very specific use cases. And hosted versions exist.

Disclosure: I work at Aiven, which has an hosted cassandra offer.

GauntletWizard · on Oct 9, 2022

And nobody does reads at quorum, they're slow. People do reads at closest, even when they really need quorum.

YZF · on Oct 9, 2022

I don't think that's true. Can't speak for everybody but for the stuff I worked on. Quorum reads at RF=3 are only twice as slow as reading a single node and it's pretty practical, if you care about it, to write and read using quorum writes/reads. It's true there's many applications where you're ok with some time not reading your latest data, and for those you can get somewhere better performance/latency for a given scale by going CL=ONE...

mping · on Oct 9, 2022

Unless they want "read your writes", otherwise bugs start appearing "why isn't the data i just put showing"?

jhgg · on Oct 9, 2022

What is wrong with CQL? The only gotcha I am aware of is UPDATE/INSERT really just meaning UPSERT.

Luker88 · on Oct 9, 2022

In no particular order:

* LWT (Lightweight Transactions) are nothing like what you expect when you head "transactions". LWT are single row only.

* error handling is asinine. E.g: A quorum insert fails. is the data there or not? Maybe!

* material views. Added, removed, readded, deprecated.

* big limits on secondary indexes which only work inside each single partition

* different format of timestamp for insert and select

* BATCH is atomic, but not isolated.

...In general CQL feels like a hack. In classic SQL you understand early that you can compose queries, subqueries and such. CQL leads to you to believe that this is possible, only for you to find out that it isn't. The docs are full of stuff like "You can do this query, except in this case, unless this other case in which you can again, but not in this other". It's just exception on top of exception. Even the documentation does not list all exceptions (I managed to find a couple in my last work).

_benedict · on Oct 9, 2022

> LWTs

Single partition only. Poorly named, sure. Feature-wise though pretty comparable to peer offerings. Fortunately general purpose transactions are coming next year.

> material views. Added, removed, readded, deprecated.

No; added, then marked experimental. Never removed or deprecated, though they may be superseded before long.

> error handling is asinine

Fundamental distributed systems problem that is just more apparent with eventual consistency. Failure does not have a certain outcome.

skyde · on Oct 9, 2022

please tell me more about General purpose transaction! I really hope they work Kyle Kingsbury.

jlokier · on Oct 9, 2022

> A quorum insert fails. is the data there or not? Maybe!

How is this point fundamentally different from SQL? In SQL, after you send COMMIT to a remote database and get back a network error, you don't know if your INSERTed data is committed or not. You'll have to reconnect and check for it.

Luker88 · on Oct 9, 2022

In CQL queries can fail even if your connection is still up and running.

In this case it means that it could not write to 3 nodes, but it is not telling you if it did not write anything or at least once. Writing just once means that the data will be replicated.

So if only 1 of the 3 nodes is up, it will still write there and then return error, even if the coordinator knows that the other nodes are not up.

Queries have a maximum running time (only configurable at the database level if I remember correctly), and if your write exceeds this, it returns error. The replication still goes on and the data is still there.

Network errors are different category, since you might not even know if your command was sent or not. Cassandra kind of sidesteps the network errors though since the client connects to multiple coordinators in the cluster.

_benedict · on Oct 9, 2022

They are not a different category. This is a distributed database, network errors and node failures are a fundamental part of its function.

Fundamentally every distributed protocol has a moment where it may have to tell the client it has abandoned the operation, but already has in flight messages to remote replicas that would result in a decision to complete the operation.

Before that operation is answered by the replicas, the operation is in an unknown state. It is fundamental. Some slower approaches to reaching decisions may mask this problem more often, but it is there whether you realise it or not.

jfray2k22 · on Oct 10, 2022

I think the concern there is that these different failure modes are all given the same error/exception messaging? The caller could make different decisions based on these different potential outcomes, but only if the caller receives specific errors. It's been a while since I've used Cassandra, so if the error handling has improved along those lines then I apologize.

_benedict · on Oct 10, 2022

A timeout is pretty much always an unknown outcome. Cassandra does have dedicated exceptions for failed writes (which should have a certain outcome of not applied) but this is much rarer in practice.

Cassandra does today inform you on timeout how many replicas have been successfully written to, if it was insufficient to reach the requested level of durability, which many years ago it did not. But this is no guarantee the write is durable, as this will represent a minority of nodes - and they may fail, or be partitioned from the remainder of the cluster. So the write remains in an unknown state until successfully read at QUORUM, much like a timeout during COMMIT for any database.

mnd999 · on Oct 9, 2022

It encourages you to think you can use secondary indexes to make different types of query over the same data like SQL, which you mostly can’t.

curiousDog · on Oct 9, 2022

And not to mention, for 90% of the use cases out there Postgres or MySql on aurora or whatever will be more than enough

scarface74 · on Oct 9, 2022

If I know what my access patterns are, DynamoDB is often the better choice.

JensRantil · on Oct 9, 2022

More like 99,9% of use cases. But yeah.

YZF · on Oct 9, 2022

What you're saying applies more or less no any NoSQL database. Unusable for your use case is a bit of a big statement. There's plenty of cases where you don't need the power of relational databases and relational databases can't/don't typically scale horizontally the way NoSQL can. And really NoSQL can because of its attributes. If all you need is key/value lookups or time series, you're not looking for any joins or random reads that don't agree with the native ordering/partitioning then it's hard to beat NoSQL scalability/ingest-wise.

Luker88 · on Oct 9, 2022

True, which is why I said that cassandra has its uses.

The problem is that the documentation and the language tries really hard to sell you something else, and I have seen multiple people wasting way too much time on this.

eusto · on Oct 9, 2022

I agree, but I think the part that's not stated is that when picking such a technology, you basically commit to not needing joins or random reads. If you do, as the application evolves, you're really out of luck. Generally, that commitment cannot be made upfront.

Typically, solutions like time-series or document databases are deployed alongside other components in a much larger system. If you are looking for a single database solution, then maybe NoSQL is not for you

skyde · on Oct 9, 2022

why settle for noSQL when we can use newSQL? Cockroachdb, CitusDB, TiDB,Yugabyte …

YZF · on Oct 10, 2022

I don't think those have the same performance characteristics. At the end of the day you can't have your cake and eat it. You can't keep large scale indices that facilitate fast joins without writing them out to disk and you can't keep them in sync with your data without transactionality/atomicity. This boils down to tradeoffs in terms of locking/consensus/scale/indices/data locality. That said if you're going to be emulating this over noSQL anyways then there's an argument for using something that does it for you (assuming it meets your requirements).

skyde · on Oct 10, 2022

This is definitely about tradeoffs. But there is 2 tradeoffs spectrum, 1- Consistent (strict serializability + global transaction) vs Eventual consistency. 2- Range query and Join vs KeyValue only

FoundationDB provide a Consistent databse but By default only a KeyValue API.

(CitusDB,Vitess..) give you Consistency but only within a partition, but also give a nice SQL API.

(Cockroachdb/Yugabyte) provide a sql api with a consistent database.

Cassandra gives you Eventual consistency with the Key-value API.

My experience has been that the performance penalty for using a Consistent database instead of one that will silently corrupt your data is small "less than 10% slower".

As for the cost of using Join and Index, no once is forcing you to create index and send query that use join. Its 100% opt-in.

But still Google Spanner and CockroachDB made it very cheap to do join between parent entity and a child entity table but using interleaved table. And using RocksDB and good SSD, the cost of index maintenance is not as bad as it appear to be.

YZF · on Oct 12, 2022

Cassandra has tunable consistency. It also has some, terribly slow, lightweight transactions. If you have replication, which you need for HA/DR, then the cost of writes is pretty much the same, data has to be written to all 3 replicas. The question is, for the most part, when do you ack the writes. So on a single site Cassandra you can e.g. write 3 replicas, read 2 back, and you have a consistent database (for a given key). That's not sufficient for ACID though.

I'm honestly not familiar at all with CitusDB or Vitess in terms of where they sit in their ability to handle failures, scale out, transactions etc. or where they sit in the CAP story. I'm sure Spanner is paying the tax somewhere but it's just that Google can back it with infinite resources to give you the performance you need.

If there was a database that offered ACID, HA/DR, and scaled horizontally perfectly at a cost of 10% more resources then I doubt anyone would be using stuff like Cassandra, ScyllaDB or HBase... I don't think that exists though but I'll look up those other databases you mentioned.

_benedict · on Oct 12, 2022

It will exist next year. Cassandra is getting HA ACID transactions (depending how you define C; we won't be getting foreign key constraints by then) that will be fast. Typically as fast (in latency terms) as a normal eventually consistent operation. Depending on the read/write balance and complexity, they may even result in a reduction in cluster resource utilisation. For many workloads there will be additional costs, but it remains to be discovered exactly how much. I hope it will be on that sort of scale.

>It also has some, terribly slow, lightweight transactions

FWIW, LWTs are much (2x) faster in 4.1 (which has been very slowly making its way out the door for a while now... the project is mostly already more focused on 4.2, so pushing 4.1 out the door is tortuous for procedural reasons)

YZF · on Oct 13, 2022

cool, good to know. There's still the relational vs. no-relational question though for many use cases but having better transactions opens the door for more use cases.

skyde · on Oct 17, 2022

Right with real transaction you could maintain your own index if you really need them.

And to have good performance you should never join entity that have distinct partition key. Otherwise you end up reading too much data over the network.

I’m honestly very happy Cassandra will get real transaction. But i still need to understand the proposed design. Will it be possible to have serializability when doing Range scan query?

_benedict · on Oct 17, 2022

> when doing range scan query

Yes, but probably not initially. The design supports them, it’s just not very pressing to implement since most users today use hash partitioners. There will be a lot of avenues in which to improve transactions after they are delivered and it is hard to predict what will get the investment when.

Proper global indexes are likely to follow on the heels of transactions, as they solve most of the problem. Though transactions being one-shot and needing to declare partition-keys upfront means there’s some optimistic concurrency control required. Interactive transactions supporting pessimistic concurrency and without needing to declare the involved keys will also hopefully follow, at which point indexes are very trivial, but the same caveat as above applies.

_benedict · on Oct 17, 2022

Yes, Cassandra will still not be suitable for all workloads.

raverbashing · on Oct 9, 2022

> older comment of mine on cassandra: https://news.ycombinator.com/item?id=20430925#20432564

Wow, that's surprisingly bad

For real, I'm not against java, but I find time and again that those services that depend on the JVM have lots of "weird stuff". Not because Java is bad but Java fans (which you usually are when you start to develop something this big) think a lot of this crap is acceptable.

pmcf · on Oct 9, 2022

In my (extensive) experience in infrastructure. When people say the JVM is the problem, the JVM is never the problem. It's usually just a symptom of something else and lazy ops people just want to blame something and throw up their hands. I've never had to "tune a JVM" to make things work.

I'll give you an example in Spark. We had a huge job that was failing and after checking the logs, it was when results were being spooled to disk. More log diving showed a lot of GC on every node. At that point we could have gone down the route of tuning something in the JVM, but more digging found the real culprit. IOstats when the jobs ran showed while reading data, writes were completely blocked and write latency was in the 100s of ms. The spark executor trying to dump data was blocked and the first symptom that things were falling apart was... GC. We changed the scheduler on the nodes and magically everything worked great. The VM in JVM is virtual machine. Same rules for resources apply and if you run out of resources, don't expect the mythical ops faeries to save you.

ibgeek · on Oct 10, 2022

Can you expand on which scheduler you swapped out? OS? Spark? Something else?

pmcf · on Oct 10, 2022

Sure I can answer both. The OS scheduler CFQ is generally bad for high volume disk applications. Used deadline in this case. Amy's Guide is still a treasure trove of info and a solid recommended read: https://tobert.github.io/pages/als-cassandra-21-tuning-guide...

If you are using Spark in a bare metal cluster with spark-submit, the above advice applies. In Kubernetes, never use the default pod scheduler with analytic workloads. Great choices are Volcano(https://volcano.sh/en/) and Yunikorn(https://yunikorn.apache.org/). Also great and evolving projects to support and contribute if you can.