More

hendiatris · 2026-01-21T14:39:12 1769006352

I run

sudo shutdown +15 (or other amount of minutes)

when I need a compute instance and don’t want to forget to turn it off. It’s a simple trick that will save you in some cases.

apawloski · 2026-01-21T15:26:38 1769009198

For expensive GPU instances I have a crontab one-liner that shuts the node down after 2 hours if I don't touch an override file (/var/run/keepalive).

/5 * * * [ -f /var/run/keepalive ] && [ $(( $(date +\%s) - $(stat -c \%Y /var/run/keepalive) )) -gt 7200 ] && shutdown -h now

tln · 2026-01-21T15:20:10 1769008810

Sort of like cancelling Disney right after signing up for the free trial.. nice!

hendiatris · 2025-12-05T19:04:32 1764961472

And Brasil, Portugal and other portuguese-speaking places

hendiatris · 2025-09-10T05:30:06 1757482206

They probably threw a fat CIDR block in their IP blacklist to fight off a spam campaign, and your IP was caught in the dragnet. This is how the big companies do it. They’ll evaluate for risk of false positives and as long as that stays below a threshold, they proceed.

hendiatris · 2025-05-27T19:59:46 1748375986

In the lower level arrow/parquet libraries you can control the row groups, and even the data pages (although it’s a lot more work). I have used this heavily with the arrow-rs crate to drastically improve (like 10x) how quickly data could be queried from files. Some row groups will have just a few rows, others will have thousands, but being able to bypass searching in many row groups makes the skew irrelevant.

Just beware that one issue you can have is the limit of row groups per file (2^15).

hendiatris · 2025-03-06T17:17:39 1741281459

You may be able to get close with sufficiently small row groups, but you will have to do some tests. You can do this in a few hours of work, by taking some sensor data, sorting it by the identifier and then writing it to parquet with one row group per sensor. You can do this with the ParquetWriter class in PyArrow, or something else that allows you fine grained control of how the file is written. I just checked and saw that you can have around 7 million row groups per file, so you should be fine.

Then spin up duckdb and do some performance tests. I’m not sure this will work, there is some overheard with reading parquet, which is why it is discouraged to have small files and row groups.

hendiatris · 2025-03-06T10:12:38 1741255958

This is a huge challenge with Iceberg. I have found that there is substantial bang for your buck in tuning how parquet files are written, particularly in terms of row group size and column-level bloom filters. In addition to that, I make heavy use of the encoding options (dictionary/RLE) while denormalizing data into as few files as possible. This has allowed me to rely on DuckDB for querying terabytes of data at low cost and acceptable performance.

What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.

jasonjmcghee · 2025-03-06T13:52:07 1741269127

Have you written about your parquet strategy anywhere? Or have suggested reading related to the tuning you've done? Super interested.

indoordin0saur · 2025-03-06T14:47:29 1741272449

Also very interested in the parquet tuning. I have been building my data lake and most optimization I do is just with efficient partitioning.

hendiatris · 2025-03-06T16:33:16 1741278796

I will write something up when the dust settles, I’m still testing things out. It’s a project where the data is fairly standardized but there is about a petabyte to deal with, so I think it makes sense to make investments in efficiency at the lower level rather than through tons of resources at it. That has meant a custom parser for the input data written in Rust, lots of analysis of the statistics of the data, etc. It has been a different approach to data engineering and one that I hope we see more of.

Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...

EdwardDiego · 2025-03-07T06:44:23 1741329863

What query engine are you using?

Tends to be that an optimal file size for Parquet is about 1GiB, once again, the "many small files" problem of Hadoop remains.

Then it's things like, can you organise your data in such a way to take advantage of RLE etc.?

indoordin0saur · 2025-03-12T14:21:32 1741789292

Either Spark or Redshift (serverless)

EdwardDiego · 2025-03-07T06:42:53 1741329773

Parquet tuning has always been like that, ever since it first came out in 2013.

I worry with Iceberg that people think it's just a case of "use an Iceberg table in Snowflake" and boom, amazingly fast querying of data in S3!

chrsig · 2025-03-07T00:12:15 1741306335

how nested is the data in the parquet files?

hendiatris · on Jan 10, 2025

Look at the website for that polling company. It is bizarre. None of the people on the people page have the company on their LinkedIn pages. Seems to be astroturf.

Edit: look at the photos of the people… AI generated perhaps?

hendiatris · on Dec 11, 2024

Sqitch is an incredibly under appreciated tool. It doesn’t have a business pushing it like flyway and liquibase, so it isn’t as widely known, but I vastly prefer it to comparable migration tools.

hendiatris · on Dec 4, 2024

How quickly does the append-only chain grow? What are the storage needs for it?

jgiraldo29 · on Dec 4, 2024

I've put a lot of thought into managing the storage growth, the chain grows proportionally to system activity, but I've implemented several optimizations to keep it manageable:

1. Efficient proof encoding: Each proof is typically 128 bytes (64-byte operation hash + 64-byte signature). For context, a 1GB system performing ~1000 operations/second would generate roughly 10MB of proof data per minute before optimizations.

2. Smart pruning strategies:

- Automatic pruning of validated proof chains after state transitions

- Configurable retention windows (default: 1 hour) for non-critical proofs

- Merkle tree summarization of older proofs (keeping root hashes only)

- Proof batching for high-frequency operations

3. Storage management: - In-memory proof cache (default 10,000 proofs)

- Efficient disk serialization format

- Automatic archive rotation

In practice, a typical desktop workload generates about 100-200MB of proof data per day after optimizations. High-security environments can keep full chains (roughly 1-2GB/day), while standard deployments can use pruned chains (~100MB/day).

I'm also working on implementing selective proof generation where you can choose which operations require verification, allowing even finer control over storage growth.

The code in proof_storage.rs shows the implementation details if you're interested in the specifics.

hendiatris · on May 26, 2024

There are some inaccuracies in this article. For example, the villa on Capri was built by Tiberius, not Augustus. https://en.wikipedia.org/wiki/Villa_Jovis

ginko · on May 26, 2024

Augustus also had a villa on Capri.

JumpCrisscross · on May 26, 2024

Tiberius built Villa Jovis. I believe Augustus ruled from Capri, but I don’t remember him having had a villa. (He did acquire it from Napoli, in exchange for Ischia.)

Do you have a source? Would love to visit the ruins.

ginko · on May 26, 2024

The article didn't mention Villa Jovis. Augustus also established various buildings there. For instance Villa/Palazzo a Mare:

https://en.wikipedia.org/wiki/Palazzo_a_Mare

https://www.italytraveller.com/en/e/imperial-capri

JumpCrisscross · on May 26, 2024

Thanks!