Cloudant/IBM back off from FoundationDB based CouchDB rewrite

samwillis · on March 12, 2022

While I don’t have enough knowledge of the wider implications of this, it does impact something I was experimenting with last year.

The FoundationDB rewrite would introduce a size limit on document attachments, there currently isn’t one. Arguably the attachments are a rarely used feature but I found a useful use case for them.

I combined the CRDT Yjs toolkit with CouchDB (PouchDB on the client) to automatically handle sync conflicts. Each couch document was an export of the current state of the Yjs Doc (for indexing and search), all changes done via Yjs. The Yjs doc was then attached to the Couch document as an attachment. When there was a sync conflict the Yjs documents would be merged and re-exported to create a new version. The issue being that the FoundationDB rewrite would limit the size and that makes this architecture more difficult. It’s partly why I ultimately put the project on hold.

(Slight aside, a CouchDB like DB with native support for a CRDT toolkit such as Yjs or Automerge would be awesome, when syncing mirrors you would be able to just exchange the document state vectors - the changes to it - rather than the whole document)

robertnewson · on March 12, 2022

The (low) attachment size limit at Cloudant is about service quality and guiding folks to good uses of the service more than a technical issue.

As others have noted, the solution to storing attachments in FDB, where keys and values have an enforced maximum length, is to split the attachments over multiple key/values, which is exactly what the CouchDB-FDB code currently does.

The other limit in FDB is the five second transaction duration, which is a more fundamental constraint on how large attachments can be, as we are keen to complete a document update in a single transaction. The S3 approach of uploading multiple parts of a file and then joining them together in another request would also work for CouchDB-FDB. While it _could_ be done, there's no interest in the CouchDB project to support it.

samwillis · on March 12, 2022

Exactly, almost all the time you would be better to save the attachment to an object store. However I think I found that small edge case where the attachment system was perfect. It was essential to save the binary Yjs doc with the couch document, it needed to be synced to clients with the main document. Saving it to an object store is not viable due to the overhead during syncing.

robertnewson · on March 12, 2022

yup. the purpose of couchdb's original attachment support was for "couchapps". The notion that you'd serve your entire application from couchdb. Attachments were therefore for html, javascript, image, font assets, which are all relatively small. The attachment support in CouchDB <= 3.x is a bit more capable than that due to its implementation, but storing large binaries was not strictly a goal of the project.

didericis · on March 13, 2022

> a CouchDB like DB with native support for a CRDT toolkit such as Yjs or Automerge would be awesome, when syncing mirrors you would be able to just exchange the document state vectors - the changes to it - rather than the whole document

syncedstore -> https://syncedstore.org/docs/sync-providers#y-indexeddb-for-...

samwillis · on March 13, 2022

SyncedStore is a brilliant Yjs reactive store for SPAs, it’s not a database. It like an automatically distributed real-time reactive redux/vuex for collaborative apps.

The y-IndexedDB you linked to is actually part of the Yjs toolkit, it is a way of persisting a Yjs document in the browser for offline editing in a PWA. It doesn’t provide a way to sync a whole (or subset of a) database like Couch/PouchDB does. It’s a very important part of the Yjs toolkit but doesn’t do what I’m describing (it’s just a key value store).

didericis · on March 13, 2022

If all of your data is in yjs, then you can use syncedstore to efficiently send updates between clients and y-IndexedDB to store the data when the app is off. Its not exactly what was asked for, no, but depending on the application that setup can be substituted for a central database.

tluyben2 · on March 12, 2022

Why don't you open source your work? Can you contact me otherwise, maybe I can take over this work on couchdb; we have to do it anyway and we would open source it.

janl · on March 14, 2022

the work is all open source on the CouchDB main git repo.

kevincox · on March 12, 2022

I don't see why there would be a fundamental reason why there would be an attachment size limit. I guess it would just need to be implemented by breaking the attachment into multiple keys? There may be some overhead but it seems that this is valuable because it allows large attachments to be split across servers as required.

tlarkworthy · on March 12, 2022

When you chunk it you have problems about what happens if that process is interrupted. So it's not trivial (though solvable) but it's the kind of atomics you want the new engine to do.

aseipp · on March 12, 2022

I think the person you're replying to is saying that the document should be split across keys inside the implementation, i.e. split across the fdb keyspace, not split by the user at the application level. Which is the approach you mostly always have to use for 'large' values; FoundationDB has size limitations on the k/v pairs it can accept and splitting documents and writing those chunks in small transactional batches is the recommended workaround (along with some other 'switch over' transactional write which makes the complete document visible all at once.)

tehbeard · on March 12, 2022

If I remember the fdb docs, there's also a time limit on transactions that further limits the feasible max size.

malkia · on March 12, 2022

Reminds me, when a team I worked in, had to migrate from one database to another (we were the only team left using that one, and no one was supporting it internally), but the new one had 22MB (or was it 44mb) limit on the total transaction size, while previous one did not have (AFAIR). Someone worked on splitting into several transactions (the bulk was really due to long recorded conversation "forum" like messages related to specific data), but overall it changed how things worked and had some issues initially... Who would've thought you would need that, years from the day it was originally designed...

HelloNurse · on March 12, 2022

But is it a small size limit that affects realistic usage? Don't you have performance issues if you use a CRDT implemented in JavaScript and running in the browser with large files?

samwillis · on March 12, 2022

So yes, a particularly large document is not the norm but it can happen.

JavaScript CRDTs can be quite performant, see the Yjs benchmarks: https://github.com/dmonad/crdt-benchmarks

matlin · on March 12, 2022

This is too bad. I understand there is likely a ton of complexity in making this switch but I think it still leaves CouchDB with a frustrating problem which is document conflicts within a given cluster. Client <-> Server conflicts are very understandable but when you might unexpectedly get a document conflict from two server instances replicating with each other, you're just bound to run into a bunch of issues.

To have multi-master work properly you basically need Strong Eventual Consistency via CRDTs which most databases don't natively support (I think only Riak). Otherwise, you're better off switching to a single writer model.

jFriedensreich · on March 12, 2022

just a side node but crdts only help you for online/offline or master-master replication when you are fine with deterministically losing one of the edits. in all cases where losing an edit means data-loss you cannot avoid having application specific conflict resolution, which is exactly what you need to do in couchdb anyways.

Diggsey · on March 13, 2022

I'd like to understand more about the difficulties with running an FDB cluster in production.

Running it locally seemed pretty simple, and AIUI there are kubernetes operators for deploying a cluster. Can anyone provide some insight into where the difficulties lie?

janl · on March 14, 2022

the difficulty starts happening when something goes wrong. FDB has a lot of moving parts (for good reasons, this is not to speak ill of FDB), and that conceptual complexity is a lot more what CouchDB has today with its very basic Dynamo sharding model. Resolving issues with CouchDB is also not trivial, make no mistake, but there is generally a lot less going on.

bsaul · on March 12, 2022

Sidenote : i've heard foundationDB was used for cloudkit, but is it also used for iMessage ?

It seems like its transactionnal properties would be quite well suited to something like a messenger service (where order of messages matter, especially with e2e encryption)

navarro485 · on March 12, 2022

pretty sure Cassandra is used for iMessage. although that may have changed after apple acquired foundationDB.

gigatexal · on March 12, 2022

On device iMessage is SQLite I think. Backend not sure.

elitepleb · on March 12, 2022

So what's the deal with the unpopularity of CouchDB?

It's seems like a compelling database, but i've yet to run into it in the wild.

tehbeard · on March 12, 2022

Beyond the meta of it being old/mature and thus not continually piercing the tech newsspace with releases etc.

Querying in a more ad-hoc way (vs. building indexes ahead of time and querying by key, etc) is a bit janky / not 1st class (I think mango addresses this but not entirely sure).

The runtime being erlang? It certainly seemed to be the cause of some issues when I tried to run it in WSL, or atleast my lack of knowledge with erlang made diagnosing it more trouble.

The JS query server engine is/was fairly old (I think it might have jumped to a more recent version of Spidermonkey at some point), and hooked up in a way that, while more modular, limits the performance (documents have to be serialized to/from the engine in another process, rather than just natively passed in)

The authorization model is... unique. You can limit down to a doc/field level who can submit changes via validate_doc_update(...) in a design doc. So allowing those with a reviewer role to only be able to edit a notes field on a document, while the user in the author field has full access to the other fields is possible. But read access is at the database level, as in you can either read the db, or not.

The way around this for having "private" storage is enabling a feature to make a db per user automatically and assign them rights, but this is more complicated to manage client side (two dbs to talk to) and replication even more of a nightmare if stuff needs to be shareable instead of just private.

janl · on March 14, 2022

Good observations, let me add the current state to that:

> Querying in a more ad-hoc way (vs. building indexes ahead of time and querying by key, etc) is a bit janky / not 1st class (I think mango addresses this but not entirely sure).

Mango does address this.

> The JS query server engine is/was fairly old

On the one hand, running an old JS engine isn’t big trouble for CouchDB, especially with transpiration tools available, but we now support modern SpiderMonkey up to version 91. The main benefit for CouchDB users is modern JS syntax being available.

> The authorization model is... unique.

per-doc-auth is in the works, no ETA, and the first iteration is going to be limited, but it’ll address the main issues for db-per-user users.

jFriedensreich · on March 12, 2022

I would not call it "unpular" in the sense that it does have a strong base of community and most people, that once "got it" really loved it. I know people that tell me how they miss being able to work with it and keep using it privately even after switching jobs etc.

Unpopular in the sense of "not popular"/too few people know about it or consider it currently is true.

For one, couchdb does not have a major VC backed owner with a huge marketing budget, this is also good aspects, eg. as a true apache project no single company can just take over or do major changes against the community motivated by their investment structure. From a technical perspective the admin interface "fauxton" never really felt finished and debugging views is not welcoming for new users. Also creating good and working indexes is critical but is still too hard for novices even with mango syntax, especially as devs are now used to things like query autocompletion and friendly error messages/ warnings while typing or simple guis. The managed hosting story is also not compelling as forced ibm cloud was a big step back for many non corporate users after the cloudant acquisition and other players seem too small/niche to consider.

Apart from traditional marketing, couchdb does not create a lot of news, as it is just REST, you don't need client apis or very integrated frontend libraries, the features scope is quite settled. I have email alerts for hot fixes and apart of that i check the progress once a year and apart from that things just work.

Already__Taken · on March 12, 2022

npm does or did run on it https://github.com/npm/npm-registry-couchapp if that's what you mean by in-the-wild

kache_ · on March 12, 2022

I've used it, its pretty decent given you understand the internals.

gedy · on March 12, 2022

It is/was nice, just an early NoSQL DB with a lot of interesting features. Just better options came about to take its mindshare. We used it about 11 years ago for an internal marketing CMS system and the replication and attachment support were a good fit.

tehbeard · on March 12, 2022

What's out there with a better client device sync option?

This is something I've been looking for for a few PWAs that need to operate on bad/no network, and most other solutions are build your own entire sync setup, or magic-in-a-box you can't tune.

With Couch/pouch, I can sync with a filter/several filters to make sure the subset of data I need is on the device.

gedy · on March 12, 2022

Yeah agreed that's really cool. Closest I've seen is Apollo client, but to your point you have a lot less fine grained control.

jFriedensreich · on March 12, 2022

I would highly object to "better options came about". I am not debating maybe a better fit to your specific problems came along, but in general case of the sweet spot for couchdb there are no obvious better alternatives. The sweet spot being "a schemalesss json database with a rest api and first level support for master-master and online/offline replication that values your data safety and reliability first and everything else second."

gedy · on March 13, 2022

Yes what I meant by "better options" was for the devs just looking for a schemaless/document NoSQL database. They weren't needing CouchDB's sweet spot basically. The project I mentioned was great fit though.

rat9988 · on March 12, 2022

What betters options would you have in mind? Asking from curiosity, I don't follow this space closely.

gedy · on March 12, 2022

It's been a while, but seems like many people wanted something simpler like MongoDB for a NoSQL document database. CouchDBs map/reduce queries were hard to get people's heads around, many people didn't need attachments, etc.

robertlagrant · on March 13, 2022

I would love to hear about how open source project maintainers balance corporate-sponsored contributors' goals with the project's existing goals. E.g. do they plan out areas of work prioritised by what helps everyone first, and what only helps the corp second? How does it work?

endisneigh · on March 12, 2022

What’s the simplest client or way to use foundationDB? I was excited for this because FDB is somewhat unintuitive to use and deploy

manishsharan · on March 12, 2022

The FoundationDB Document Layer is compatibile with MongoDb 3.x API. https://github.com/FoundationDB/fdb-document-layer .and you get the transaction Al integrity.

I stopped using MongoDB and switched to this.

tarlinian · on March 13, 2022

The document layer seems to be unmaintained since the end of 2019 and has quite a few bugs. The 1.8.x releases fixed a bunch of bugs but also dropped transaction support. It honestly seems like a great idea, but I'm guessing no one at Apple could justify the time spent working on it.

endisneigh · on March 12, 2022

Amazing - much appreciated. How is it going compared to mongo?

manishsharan · on March 12, 2022

Better than MongoDB. Easy to scale up. And no MongoDB gotchas for transaction.

I use my FoundationDb cluster as a MongoDB alternative and a Redis alternative. Only one cluster to maintain and two types of functionality! I have tried setting up and maintaining clusters of MongoDB and Redis in the past and it was horribly complicated. FoundationDB cluster is so much easier to setup and maintain. And it gives me functionality of both Redis KV and MongoDB .

shrumm · on March 14, 2022

I can see MongoDB. How did you find the performance of running FDB as a Redis alternative vs actual Redis?

ksec · on March 13, 2022

Does it have something to do with Apple being FoundationDB's lead developers.

janl · on March 14, 2022

not really, no.