You mean replication factor of 3? It doesn't work like that. If your replication...

jedberg · on May 19, 2015

Like I said, it depends on how you lay out your data. Let's say you have three data centers, and you lay our your data such that there is one copy in each datacenter (this is how Netflix does it for example).

You could then lose an entire datacenter (1/3 of the machines) and the cluster will just keep on running with no issues.

You could lose two datacenters (2/3s of the machines) and still serve reads as long as you're using READ ONE (which is what you should be doing most of the time).

YZF · on May 20, 2015

If you read and write at ONE (which I think NetFlix does) then this kind of works. Still with virtual nodes losing a single node in each DC leaves you with some portion of the keyspace inaccessible.

You're susceptible to total loss of data since at any given time there will be data that hasn't been replicated to another DC and you're OK with having inconsistent reads.

That works for some applications where the data isn't mission critical and (immediate) consistency doesn't matter but doesn't for many others. I'm not sure what exactly NetFlix puts in Cassandra but if e.g. it's used to record what people are watching then losing a few records or looking at a not fully consistent view of the data isn't a big deal...

_benedict · on May 19, 2015

If a machine is down (as in the machines themselves are dead) then you absolutely need to move the data onto another node to avoid losing it entirely. There is no way to avoid this, whatever your consistency model.

If there is a network partition, however, there is no need to move the data since it will eventually recover; moving it would likely make the situation worse. Cassandra never does this, and operators should never ask it to.

If you have severe enough network partitions to isolate all of your nodes from all of its peers, there is no database that can work safely, regardless of data shuffling or consistency model.

YZF · on May 20, 2015

It can be hard to tell if the machine is down as in permanently dead including the disk drives or if it's temporarily down, unreachable or just busy. Obviously if you want to maintain the same replication factor you need to create new copies of data that is temporarily down to a lower replication number. In theory though you could still serve reads as long as a single replica survives and you could serve writes regardless.

You can kind of get this with Cassandra if you write with CL_ANY and read with CL_ONE but hinted handoffs don't work so great (better in 3.0?) and reading with ONE may get you stale data for a long while. It would be nice, and I don't think there's any theory that says it can't be done, if you could keep writing new data into the database at the right replication factor in the presence of a partition and you could also read that data. Obviously accessing data that is solely present on the other side of the partition isn't possible so not much can be done about that...