They probably threw a fat CIDR block in their IP blacklist to fight off a spam campaign, and your IP was caught in the dragnet. This is how the big companies do it. They’ll evaluate for risk of false positives and as long as that stays below a threshold, they proceed.
In the lower level arrow/parquet libraries you can control the row groups, and even the data pages (although it’s a lot more work). I have used this heavily with the arrow-rs crate to drastically improve (like 10x) how quickly data could be queried from files. Some row groups will have just a few rows, others will have thousands, but being able to bypass searching in many row groups makes the skew irrelevant.
Just beware that one issue you can have is the limit of row groups per file (2^15).
You may be able to get close with sufficiently small row groups, but you will have to do some tests. You can do this in a few hours of work, by taking some sensor data, sorting it by the identifier and then writing it to parquet with one row group per sensor. You can do this with the ParquetWriter class in PyArrow, or something else that allows you fine grained control of how the file is written. I just checked and saw that you can have around 7 million row groups per file, so you should be fine.
Then spin up duckdb and do some performance tests. I’m not sure this will work, there is some overheard with reading parquet, which is why it is discouraged to have small files and row groups.
This is a huge challenge with Iceberg. I have found that there is substantial bang for your buck in tuning how parquet files are written, particularly in terms of row group size and column-level bloom filters. In addition to that, I make heavy use of the encoding options (dictionary/RLE) while denormalizing data into as few files as possible. This has allowed me to rely on DuckDB for querying terabytes of data at low cost and acceptable performance.
What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.
I will write something up when the dust settles, I’m still testing things out. It’s a project where the data is fairly standardized but there is about a petabyte to deal with, so I think it makes sense to make investments in efficiency at the lower level rather than through tons of resources at it. That has meant a custom parser for the input data written in Rust, lots of analysis of the statistics of the data, etc. It has been a different approach to data engineering and one that I hope we see more of.
Look at the website for that polling company. It is bizarre. None of the people on the people page have the company on their LinkedIn pages. Seems to be astroturf.
Edit: look at the photos of the people… AI generated perhaps?
Sqitch is an incredibly under appreciated tool. It doesn’t have a business pushing it like flyway and liquibase, so it isn’t as widely known, but I vastly prefer it to comparable migration tools.
I've put a lot of thought into managing the storage growth, the chain grows proportionally to system activity, but I've implemented several optimizations to keep it manageable:
1. Efficient proof encoding: Each proof is typically 128 bytes (64-byte operation hash + 64-byte signature). For context, a 1GB system performing ~1000 operations/second would generate roughly 10MB of proof data per minute before optimizations.
2. Smart pruning strategies:
- Automatic pruning of validated proof chains after state transitions
- Configurable retention windows (default: 1 hour) for non-critical proofs
- Merkle tree summarization of older proofs (keeping root hashes only)
In practice, a typical desktop workload generates about 100-200MB of proof data per day after optimizations. High-security environments can keep full chains (roughly 1-2GB/day), while standard deployments can use pruned chains (~100MB/day).
I'm also working on implementing selective proof generation where you can choose which operations require verification, allowing even finer control over storage growth.
The code in proof_storage.rs shows the implementation details if you're interested in the specifics.
Tiberius built Villa Jovis. I believe Augustus ruled from Capri, but I don’t remember him having had a villa. (He did acquire it from Napoli, in exchange for Ischia.)
Do you have a source? Would love to visit the ruins.
sudo shutdown +15 (or other amount of minutes)
when I need a compute instance and don’t want to forget to turn it off. It’s a simple trick that will save you in some cases.