Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The data rules worth $40k a day (tinybird.co)
62 points by __all__ on July 21, 2022 | hide | past | favorite | 65 comments


The example shown has 14GB of data, which is absolutely tiny, yet it can still rack up $40k in costs per day if used incorrectly (i.e. less than optimal, but not egregiously wrong)? Why would you use this?


Scanned 6.67GB and processed for 2.41s, cost $40340.16 / 86400 = $0.4669. More than the hourly cost of AWS on-demand a1.x4large with 16 vCPU and 32GB of RAM.

Can’t think of a worse advertisement for their product.


Seriously, when i read this, i was like, there is no way i will ever use this example product they mentioned in TinyBird.

After reading it, i was curious what the writer’s relationship was with TinyBird. Was he/she a recent user?

Then i put my palm over my face..


Plus, on what planet would you be refreshing that query every second of the day?


Would that be the same planet where you can refresh a query every second that takes 2.4s to complete? How does that work?


The one where some dev gets the query in a loop unintentionally and you discover your bankrupt the next day.


This happens more often than you might think…


[flagged]


Never ever become 'the SQL guy' in your group. No one wants to learn it.

SQL is the intersection of set theory, computational theory, network usage, cpu usage, and laziness.

"your table is 120 columns wide and nothing but strings with no rules of the data types and no indexes, sooooo your perf is a bit slow huh?" "yeah how did you know?" "wild guess"


Among my favorites, "NULL" vs NULL, date/timezone/utc and locals, anything related to data warehouse. Transferring everyday billions of records from SaAS applications.


We should start a business selling data intelligence to solve this problem.

Maybe call it... business intelligence?


This headline is egregiously misleading clickbait, since it is based on an absurd and self contradictory hypothetical:

"If my product had 10,000 daily active users generating 100 events per day, I’d easily hit that size in less than a year ... If I built an application to visualize this result and refreshed it every second, I would spend $40,340.16/day on Tinybird’s Pro-plan"

There is no plausible business sense in refreshing the display of a year's worth of data every second, and even scaling back to a likely still-unreasonably-frequent refresh rate of once per hour you're down to just $11/day.


If you want to stop out of control spend, have your analysts learn SQL. Avoid database systems that charge per query (like tinybird) and make damn certain your people know SQL if you do. Ignore and preferably fire people who mention things like “data lakes”.


1000% this Almost every solution I've seen to tooling that is supposed to "avoid data analysts having to learn SQL" is far more costly, requires engineering staffing, performs much slower and sometimes requires the analyst to instead learn some esoteric less powerful language ...

If your job is to slice&dice data all day, and can't be bothered to learn SQL, I don't know what to say.


The data lake problem is specifically due to microservices or SOA. People love to separate customers from orders until they realize that you want to do complex filtering on customers and join it to their orders. Then everyone says "oh crap" when they realize they've created a problem without a great solution.


Yes. Fundamentally there is almost no data that is not relational in some aspect. Any given document, you can virtually instantly start picking out "ok that could be a foreign key...", and if one does not exist, it certainly will before too many sprints go by. And most usages of NoSQL are essentially equivalent to a row within a table, especially given many NoSQL solutions have bugs or performance issues with deeply nested documents (ran into that with SOLR - deep pagination can give incorrect results with nested documents).

Personally I view a true document (not a table row turned into JSON) as being the deeply-nested kind, and ideally generated from the relational data itself, to allow different "dimensionalities" to be represented without needing pivots/windows/analytical queries, and that's very seldom what I see it being used for in practice. Again, most people just have a RDBMS row but stored in JSON.

example: in the "netflix" example, your movies, your actors, your users, your likes, etc are all relational, and then you build a document collection that is good for searching movies, a collection that is good for displaying user data/history/settings, a collection for displaying actors' filmography, etc, but all are generated from the same actual, consistent relational data.


Storing Json data into a highly structured RDBMS table can be problematic if any document contains arrays or nested documents.

I built a new general-purpose data management system that uses key-value stores that I invented to attach meta-data tags to objects. These key-value stores can also be used to create relational tables.

Because each table is basically a columnar store, I can map multiple values to each row key to create a 3D table. It seems ideal for importing Json data where any item in a document can be an array of values. I am trying to figure out how useful this system might be to the average DBA or NoSql user.

See a quick demo at https://www.youtube.com/watch?v=1b5--ibFhWo


what I'm saying is, use the relational DB for OLTP, but export in a JSON document format to NoSQL in whatever document shapes are efficient for various services. And that can be multiple different shapes generated from the same set of relational "ground truth", if various services need different "views" to run efficiently.

The idea is you always have a relational "source of truth" and optimize that for OLTP, but also get the scalability benefits of documents/microservices/etc by having data already pre-coalesced/pre-digested into your correct format(s), so you're not doing complex analytical/window/aggregation queries on the RDBMS for every request. You run the analytical queries once, convert the result to json, and store that in the NoSQL.

Of course you still potentially have some "sync time" between the OLTP and the final commit to all the various nosql collections... unless you hold OLTP locks until everything is synced, which would be excessive. But this goes back to CAP and there's no magic wand for that - you can either put everything inside the RDBMS and take the performance hit, or you can have external nosql read replicas and accept the inconsistency due to the sync time, or you can hold locks until both systems are consistent at the cost of "availability" (updatability).


How different is this from using pg’s jsonb field type? Which is also queryable.

What are the advantages/disadvantages? Or what am I misunderstanding?


To be honest, I haven't played around with the jsonb feature of pg enough to know which is better. I do know that the queries of data in my system average about 10x faster than regular pg tables for the same data set. Also my tables do not need a separate indexing step in order to achieve maximum speed for any query. Do you have a data set in pg you created using jsonb? If you want to try my system, the beta is available for free download at: https://didgets.com/download


Wow, I've previously had exactly same feelings about couple of people from whom I heard "data lakes".


Only thing worse is "data lakehouse"


I couldn't help but laugh the first time I saw that.

Should we go ahead and file copyright on Data Ocean and Data Beachhouse?


Careful, if you build wrong you'll get wiped out by a data hurricane.


“We used to dream about having our data in a warehouse! We had to have our data in a lake!”


I always wonder, who looks at their pricing model and decides "Paying multiple dollars to query on a few GB of data sounds reasonable"?


Seriously? A $10k dedicated server would have a payback period measured in minutes.


I don't think it's always that easy. I work on some large databases (1TB+) where even queries utilizing indexed columns still take prohibitively long to run. Sure we could look into partitioning (and we likely might) but then more work is progressively falling on engineering to keep the system running for the standards of analysts but not necessarily the product.

BigQuery in these cases has been very useful. Everything I throw at it returns in seconds. I can also do more text comparison queries that on a standard RDBMS would have required a full text index (which would probably be huge). But with BigQuery even those queries take seconds and I don't have to take production resources to view it.


Sounds like you need a better index :-)

If you need fulltext search, I've enjoyed using Sphinx and Lucene in the past. Is the column you want to do fulltext on 1 TB? 1 TB of RAM in 128 gig DIMMs is <$10k these days, so might as well get at least that much if you're running anything like the bills these guys are talking about.


Importantly it depends on how often these things are run and who has to maintain a server.


The example in the article is 40k/day for querying on 14 GB of data, every second. You can do that on a ~100/month machine or a somewhat modern laptop.


Our entire, hundreds of millions of dollars business with hundreds of millions of orders of historical data can be "explored" by a data scientist on an average macbook.

If your data scientists can't do data science, then that's pretty bad


It depends on the business unit. Some don't have the ability to run a technology group to run a $10K dedicated server.


Query bills of $40k a day will pay for a lot of staff, or a lot of outside expertise to administer it for you.


I feel like for 1.4 million USD a year I could solve that problem for them.


But they have the budget to run $10K of queries a day?


The person who doesn't have to pay the bill.


I think it’s worth mentioning that tinybird doesn’t necessarily charge per query, the billing model seems more geared towards API usage, though queries are the underpinning.

Also it’s not unreasonable to see people spending $10K+ a day in Snowflake because bad practices just like this.


I heard at a certain Professional Social Network company that some of their analysts would slam their db cluster with inefficient queries until it rebooted, and first one to rerun their query would finally get a result.

Learn SQL ffs.


>Avoid database systems that charge per query

oh my god

i'm currently $13k/mo in dynamodb costs because of this whereas the same requirements sql database costs $2k/mo


Materialized Views are damn near magic for solving issues involving slow queries on for tools that don't need real time results (eg daily reporting). They essentially act as a cache of a query at a given point in time that you're able to refresh whenever you want.

https://www.postgresql.org/docs/current/rules-materializedvi...


> It’s only significant weakness now is in Materialized views, with their lack of incremental refresh.

> That work towards incrementally updated views is happening and progressing. For now, it's a separate extension, though: https://github.com/sraoss/pg_ivm.

https://news.ycombinator.com/item?id=32098603


I just sent this to our devops guy with the headline "INVESTIGATE FEASABILITY, I WANT THIS ASAP"


relational databases do best with facts. a proper materialized view is just another fact.

in my experience materialized views are critical for most large databases.


One of the many reasons I wish we hadn't chosen MySQL. Our internal dashboard has loads of "reports" that are basically tabular displays of queried data with optional filtering and sorting on every column, plus pagination. It's near impossible to make any of them performant due to not being able to optimize for a specific use case; because of the myriad of combinations of filters and sorts different users might apply. We've tried to explain this many times over, even bringing in a DBA consultant who said the same thing, but we're still met with "why is this so slow?" on a regular basis.

Materialized views wouldn't be a silver bullet, but it would certainly help by allowing us to "cache" all of the joins and pre-processing into indexable views.


You should schedule a "create or replace table" statement every N hours/days/etc. It will be effectively just the same as a scheduled materialized view refresh.


seriously, does google "query" all the documents that you might be interested in? no, i argue google makes mat view(s) of your interests by querying the response database and specifically NOT the actual documents.

remember, google only needs to be "kinda" CORRECT, unlike a relational db. try 'plaining that to management.


I've been reading up on this as a potential solution; I may have to finally take it seriously and try it out.


the "t" at the end of "fact" implies past tense. if a materialized view is a statement of fact then you are safe to deploy in a relational schema, i claim. think about it.

https://www.protechtraining.com/blog/post/the-value-of-value...


is caching the correct analogy?

the stars you see died eons ago. special relativity is your friend. we witness "immutability" casually.

nothing in the relational model that i am aware of says that a materialized view cannot be a simple immutable table of "facts". same input same output. unambiguous.

simply use the dynamic queries to summarize the "facts" in, typically, a small mat view tables and your gui will never lie.

https://www.youtube.com/watch?v=-6BsiVyC1kM


I'm really curious how people use these types of platforms in practice. It seems really easy to screw yourself over when you're "expirementing". The advice seems really obvious, but I'm sure I'd absent mindedly do a super naive query at some point and potentially cost my employer a lot of money.

I worked on a system to capture the production test data in a semiconductor company. We had trillions of rows and terabytes of data. While we were figuring stuff out, I'm sure I ran queries that scanned the entire dataset accidentally. I imagine one of those queries would have cost at least 1k to run. Our entire setup cost less than 10k a month to run on AWS regardless of how many queries we ran. I can't imagine spending 40k on 14gb of data regardless of what you were doing.


Yeah I mean I look at this example and it’s obviously pretty “extreme” - nobody is running that query every second (at least I hope not) - but I think the principle is important. You run that query once a day and it’s still costing you 1Kx what it would if it were optimized.


> If I built an application to visualize this result and refreshed it every second, I would spend $40,340.16/day on Tinybird’s Pro-plan.

What? Even if I used SQLite on my laptop and queried this thing every second, I'd still use <$3 a day. Also, this platform has no concept of caching? Don't understand this post at all, total clickbait based on an inefficiency in your platform you really shouldn't be advertising.


I can think of another way to save $40k/day: don't use this product.


I remember when I studied databases in college, the execution plan took care of what to do first: join, filters, etc (I remember doing excercises with paper and pen where I got a query, a few tables, and I had to build the optimal execution tree).

So this product is not only expensive, but I have to think of the execution plan myself? Or I am wrong and modern dbs don't do that?


I've understood it to be a bit more nuanced than that. If you were to perform two queries:

Query 1 -> join Table A with Table B, both have 1M records

Query 2 -> Filter Table A to 10k records, then join Table A with Table B (1M records)

I would expect Query 2 to execute faster - I don't think the exec plan would've optimized Query 1 equivalently.


So I read the full article and it's interesting. Maybe the title is a little misleading, but when you actually read it I think they make it pretty clear that it's an extreme example designed to prove a simple point: bad data practice has a cost. For somebody who isn't super data savvy but wants to get into the space, I actually found it helpful.


Mythical creature vs 'Real-World Sighting'?

I try to never underestimate the potential for someone to do something really stupid and I'm sure there are some egregious examples out there where a DB was set up and run such that outrageous charges resulted; but has anyone seen a situation in real life anywhere close to this kind of example?

Even if there was a instance where a poorly designed and implemented data set caused a $40K charge for a single day; I wonder how long it would take for the bean counters to notice and take action?


I'm sorry but if your egress costs are 40k/day I think it's time to consider leasing your own pipe. $1.2 million per month will get you one hell of an internet connection.


I just skimmed through the article tbh, but I saw 14 GB and what looked like a predicate pushdown optimization. I think DuckDB could handle that on my 16GB mac.


Lost me at refreshing every second. Probably only needs updating a couple of times per day. This is why you drill into requirements rather than blindly accepting them without question.


> What if I told you I could save you $40,000 a day?

Learn from history, specifically, SQL. Or people who think they are too important to learn SQL aren't that important after all


at $40k/day SAVINGS you could build your own dc, easily.


$40k/day. no way.


Isn't this all extremely elementary and common knowledge for someone designing these workflows? Next blog post about how to not waste money by just not doing something extremely poorly optimized.


Don't use floating point for monetary calculations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: