Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cloudant/IBM back off from FoundationDB based CouchDB rewrite (apache.org)
137 points by jFriedensreich on March 12, 2022 | hide | past | favorite | 45 comments


While I don’t have enough knowledge of the wider implications of this, it does impact something I was experimenting with last year.

The FoundationDB rewrite would introduce a size limit on document attachments, there currently isn’t one. Arguably the attachments are a rarely used feature but I found a useful use case for them.

I combined the CRDT Yjs toolkit with CouchDB (PouchDB on the client) to automatically handle sync conflicts. Each couch document was an export of the current state of the Yjs Doc (for indexing and search), all changes done via Yjs. The Yjs doc was then attached to the Couch document as an attachment. When there was a sync conflict the Yjs documents would be merged and re-exported to create a new version. The issue being that the FoundationDB rewrite would limit the size and that makes this architecture more difficult. It’s partly why I ultimately put the project on hold.

(Slight aside, a CouchDB like DB with native support for a CRDT toolkit such as Yjs or Automerge would be awesome, when syncing mirrors you would be able to just exchange the document state vectors - the changes to it - rather than the whole document)


The (low) attachment size limit at Cloudant is about service quality and guiding folks to good uses of the service more than a technical issue.

As others have noted, the solution to storing attachments in FDB, where keys and values have an enforced maximum length, is to split the attachments over multiple key/values, which is exactly what the CouchDB-FDB code currently does.

The other limit in FDB is the five second transaction duration, which is a more fundamental constraint on how large attachments can be, as we are keen to complete a document update in a single transaction. The S3 approach of uploading multiple parts of a file and then joining them together in another request would also work for CouchDB-FDB. While it _could_ be done, there's no interest in the CouchDB project to support it.


Exactly, almost all the time you would be better to save the attachment to an object store. However I think I found that small edge case where the attachment system was perfect. It was essential to save the binary Yjs doc with the couch document, it needed to be synced to clients with the main document. Saving it to an object store is not viable due to the overhead during syncing.


yup. the purpose of couchdb's original attachment support was for "couchapps". The notion that you'd serve your entire application from couchdb. Attachments were therefore for html, javascript, image, font assets, which are all relatively small. The attachment support in CouchDB <= 3.x is a bit more capable than that due to its implementation, but storing large binaries was not strictly a goal of the project.


> a CouchDB like DB with native support for a CRDT toolkit such as Yjs or Automerge would be awesome, when syncing mirrors you would be able to just exchange the document state vectors - the changes to it - rather than the whole document

syncedstore -> https://syncedstore.org/docs/sync-providers#y-indexeddb-for-...


SyncedStore is a brilliant Yjs reactive store for SPAs, it’s not a database. It like an automatically distributed real-time reactive redux/vuex for collaborative apps.

The y-IndexedDB you linked to is actually part of the Yjs toolkit, it is a way of persisting a Yjs document in the browser for offline editing in a PWA. It doesn’t provide a way to sync a whole (or subset of a) database like Couch/PouchDB does. It’s a very important part of the Yjs toolkit but doesn’t do what I’m describing (it’s just a key value store).


If all of your data is in yjs, then you can use syncedstore to efficiently send updates between clients and y-IndexedDB to store the data when the app is off. Its not exactly what was asked for, no, but depending on the application that setup can be substituted for a central database.


Why don't you open source your work? Can you contact me otherwise, maybe I can take over this work on couchdb; we have to do it anyway and we would open source it.


the work is all open source on the CouchDB main git repo.


I don't see why there would be a fundamental reason why there would be an attachment size limit. I guess it would just need to be implemented by breaking the attachment into multiple keys? There may be some overhead but it seems that this is valuable because it allows large attachments to be split across servers as required.


When you chunk it you have problems about what happens if that process is interrupted. So it's not trivial (though solvable) but it's the kind of atomics you want the new engine to do.


I think the person you're replying to is saying that the document should be split across keys inside the implementation, i.e. split across the fdb keyspace, not split by the user at the application level. Which is the approach you mostly always have to use for 'large' values; FoundationDB has size limitations on the k/v pairs it can accept and splitting documents and writing those chunks in small transactional batches is the recommended workaround (along with some other 'switch over' transactional write which makes the complete document visible all at once.)


If I remember the fdb docs, there's also a time limit on transactions that further limits the feasible max size.


Reminds me, when a team I worked in, had to migrate from one database to another (we were the only team left using that one, and no one was supporting it internally), but the new one had 22MB (or was it 44mb) limit on the total transaction size, while previous one did not have (AFAIR). Someone worked on splitting into several transactions (the bulk was really due to long recorded conversation "forum" like messages related to specific data), but overall it changed how things worked and had some issues initially... Who would've thought you would need that, years from the day it was originally designed...


But is it a small size limit that affects realistic usage? Don't you have performance issues if you use a CRDT implemented in JavaScript and running in the browser with large files?


So yes, a particularly large document is not the norm but it can happen.

JavaScript CRDTs can be quite performant, see the Yjs benchmarks: https://github.com/dmonad/crdt-benchmarks


This is too bad. I understand there is likely a ton of complexity in making this switch but I think it still leaves CouchDB with a frustrating problem which is document conflicts within a given cluster. Client <-> Server conflicts are very understandable but when you might unexpectedly get a document conflict from two server instances replicating with each other, you're just bound to run into a bunch of issues.

To have multi-master work properly you basically need Strong Eventual Consistency via CRDTs which most databases don't natively support (I think only Riak). Otherwise, you're better off switching to a single writer model.


just a side node but crdts only help you for online/offline or master-master replication when you are fine with deterministically losing one of the edits. in all cases where losing an edit means data-loss you cannot avoid having application specific conflict resolution, which is exactly what you need to do in couchdb anyways.


I'd like to understand more about the difficulties with running an FDB cluster in production.

Running it locally seemed pretty simple, and AIUI there are kubernetes operators for deploying a cluster. Can anyone provide some insight into where the difficulties lie?


the difficulty starts happening when something goes wrong. FDB has a lot of moving parts (for good reasons, this is not to speak ill of FDB), and that conceptual complexity is a lot more what CouchDB has today with its very basic Dynamo sharding model. Resolving issues with CouchDB is also not trivial, make no mistake, but there is generally a lot less going on.


Sidenote : i've heard foundationDB was used for cloudkit, but is it also used for iMessage ?

It seems like its transactionnal properties would be quite well suited to something like a messenger service (where order of messages matter, especially with e2e encryption)


pretty sure Cassandra is used for iMessage. although that may have changed after apple acquired foundationDB.


On device iMessage is SQLite I think. Backend not sure.


So what's the deal with the unpopularity of CouchDB?

It's seems like a compelling database, but i've yet to run into it in the wild.


Beyond the meta of it being old/mature and thus not continually piercing the tech newsspace with releases etc.

Querying in a more ad-hoc way (vs. building indexes ahead of time and querying by key, etc) is a bit janky / not 1st class (I think mango addresses this but not entirely sure).

The runtime being erlang? It certainly seemed to be the cause of some issues when I tried to run it in WSL, or atleast my lack of knowledge with erlang made diagnosing it more trouble.

The JS query server engine is/was fairly old (I think it might have jumped to a more recent version of Spidermonkey at some point), and hooked up in a way that, while more modular, limits the performance (documents have to be serialized to/from the engine in another process, rather than just natively passed in)

The authorization model is... unique. You can limit down to a doc/field level who can submit changes via validate_doc_update(...) in a design doc. So allowing those with a reviewer role to only be able to edit a notes field on a document, while the user in the author field has full access to the other fields is possible. But read access is at the database level, as in you can either read the db, or not.

The way around this for having "private" storage is enabling a feature to make a db per user automatically and assign them rights, but this is more complicated to manage client side (two dbs to talk to) and replication even more of a nightmare if stuff needs to be shareable instead of just private.


Good observations, let me add the current state to that:

> Querying in a more ad-hoc way (vs. building indexes ahead of time and querying by key, etc) is a bit janky / not 1st class (I think mango addresses this but not entirely sure).

Mango does address this.

> The JS query server engine is/was fairly old

On the one hand, running an old JS engine isn’t big trouble for CouchDB, especially with transpiration tools available, but we now support modern SpiderMonkey up to version 91. The main benefit for CouchDB users is modern JS syntax being available.

> The authorization model is... unique.

per-doc-auth is in the works, no ETA, and the first iteration is going to be limited, but it’ll address the main issues for db-per-user users.


I would not call it "unpular" in the sense that it does have a strong base of community and most people, that once "got it" really loved it. I know people that tell me how they miss being able to work with it and keep using it privately even after switching jobs etc.

Unpopular in the sense of "not popular"/too few people know about it or consider it currently is true.

For one, couchdb does not have a major VC backed owner with a huge marketing budget, this is also good aspects, eg. as a true apache project no single company can just take over or do major changes against the community motivated by their investment structure. From a technical perspective the admin interface "fauxton" never really felt finished and debugging views is not welcoming for new users. Also creating good and working indexes is critical but is still too hard for novices even with mango syntax, especially as devs are now used to things like query autocompletion and friendly error messages/ warnings while typing or simple guis. The managed hosting story is also not compelling as forced ibm cloud was a big step back for many non corporate users after the cloudant acquisition and other players seem too small/niche to consider.

Apart from traditional marketing, couchdb does not create a lot of news, as it is just REST, you don't need client apis or very integrated frontend libraries, the features scope is quite settled. I have email alerts for hot fixes and apart of that i check the progress once a year and apart from that things just work.


npm does or did run on it https://github.com/npm/npm-registry-couchapp if that's what you mean by in-the-wild


I've used it, its pretty decent given you understand the internals.


It is/was nice, just an early NoSQL DB with a lot of interesting features. Just better options came about to take its mindshare. We used it about 11 years ago for an internal marketing CMS system and the replication and attachment support were a good fit.


What's out there with a better client device sync option?

This is something I've been looking for for a few PWAs that need to operate on bad/no network, and most other solutions are build your own entire sync setup, or magic-in-a-box you can't tune.

With Couch/pouch, I can sync with a filter/several filters to make sure the subset of data I need is on the device.


Yeah agreed that's really cool. Closest I've seen is Apollo client, but to your point you have a lot less fine grained control.


I would highly object to "better options came about". I am not debating maybe a better fit to your specific problems came along, but in general case of the sweet spot for couchdb there are no obvious better alternatives. The sweet spot being "a schemalesss json database with a rest api and first level support for master-master and online/offline replication that values your data safety and reliability first and everything else second."


Yes what I meant by "better options" was for the devs just looking for a schemaless/document NoSQL database. They weren't needing CouchDB's sweet spot basically. The project I mentioned was great fit though.


What betters options would you have in mind? Asking from curiosity, I don't follow this space closely.


It's been a while, but seems like many people wanted something simpler like MongoDB for a NoSQL document database. CouchDBs map/reduce queries were hard to get people's heads around, many people didn't need attachments, etc.


I would love to hear about how open source project maintainers balance corporate-sponsored contributors' goals with the project's existing goals. E.g. do they plan out areas of work prioritised by what helps everyone first, and what only helps the corp second? How does it work?


What’s the simplest client or way to use foundationDB? I was excited for this because FDB is somewhat unintuitive to use and deploy


The FoundationDB Document Layer is compatibile with MongoDb 3.x API. https://github.com/FoundationDB/fdb-document-layer .and you get the transaction Al integrity.

I stopped using MongoDB and switched to this.


The document layer seems to be unmaintained since the end of 2019 and has quite a few bugs. The 1.8.x releases fixed a bunch of bugs but also dropped transaction support. It honestly seems like a great idea, but I'm guessing no one at Apple could justify the time spent working on it.


Amazing - much appreciated. How is it going compared to mongo?


Better than MongoDB. Easy to scale up. And no MongoDB gotchas for transaction.

I use my FoundationDb cluster as a MongoDB alternative and a Redis alternative. Only one cluster to maintain and two types of functionality! I have tried setting up and maintaining clusters of MongoDB and Redis in the past and it was horribly complicated. FoundationDB cluster is so much easier to setup and maintain. And it gives me functionality of both Redis KV and MongoDB .


I can see MongoDB. How did you find the performance of running FDB as a Redis alternative vs actual Redis?


Does it have something to do with Apple being FoundationDB's lead developers.


not really, no.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: