Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The data lake problem is specifically due to microservices or SOA. People love to separate customers from orders until they realize that you want to do complex filtering on customers and join it to their orders. Then everyone says "oh crap" when they realize they've created a problem without a great solution.


Yes. Fundamentally there is almost no data that is not relational in some aspect. Any given document, you can virtually instantly start picking out "ok that could be a foreign key...", and if one does not exist, it certainly will before too many sprints go by. And most usages of NoSQL are essentially equivalent to a row within a table, especially given many NoSQL solutions have bugs or performance issues with deeply nested documents (ran into that with SOLR - deep pagination can give incorrect results with nested documents).

Personally I view a true document (not a table row turned into JSON) as being the deeply-nested kind, and ideally generated from the relational data itself, to allow different "dimensionalities" to be represented without needing pivots/windows/analytical queries, and that's very seldom what I see it being used for in practice. Again, most people just have a RDBMS row but stored in JSON.

example: in the "netflix" example, your movies, your actors, your users, your likes, etc are all relational, and then you build a document collection that is good for searching movies, a collection that is good for displaying user data/history/settings, a collection for displaying actors' filmography, etc, but all are generated from the same actual, consistent relational data.


Storing Json data into a highly structured RDBMS table can be problematic if any document contains arrays or nested documents.

I built a new general-purpose data management system that uses key-value stores that I invented to attach meta-data tags to objects. These key-value stores can also be used to create relational tables.

Because each table is basically a columnar store, I can map multiple values to each row key to create a 3D table. It seems ideal for importing Json data where any item in a document can be an array of values. I am trying to figure out how useful this system might be to the average DBA or NoSql user.

See a quick demo at https://www.youtube.com/watch?v=1b5--ibFhWo


what I'm saying is, use the relational DB for OLTP, but export in a JSON document format to NoSQL in whatever document shapes are efficient for various services. And that can be multiple different shapes generated from the same set of relational "ground truth", if various services need different "views" to run efficiently.

The idea is you always have a relational "source of truth" and optimize that for OLTP, but also get the scalability benefits of documents/microservices/etc by having data already pre-coalesced/pre-digested into your correct format(s), so you're not doing complex analytical/window/aggregation queries on the RDBMS for every request. You run the analytical queries once, convert the result to json, and store that in the NoSQL.

Of course you still potentially have some "sync time" between the OLTP and the final commit to all the various nosql collections... unless you hold OLTP locks until everything is synced, which would be excessive. But this goes back to CAP and there's no magic wand for that - you can either put everything inside the RDBMS and take the performance hit, or you can have external nosql read replicas and accept the inconsistency due to the sync time, or you can hold locks until both systems are consistent at the cost of "availability" (updatability).


How different is this from using pg’s jsonb field type? Which is also queryable.

What are the advantages/disadvantages? Or what am I misunderstanding?


To be honest, I haven't played around with the jsonb feature of pg enough to know which is better. I do know that the queries of data in my system average about 10x faster than regular pg tables for the same data set. Also my tables do not need a separate indexing step in order to achieve maximum speed for any query. Do you have a data set in pg you created using jsonb? If you want to try my system, the beta is available for free download at: https://didgets.com/download




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: