This is a real need right now. Humio has a fairly novel approach and it searches...

jpgvm · on March 24, 2020

Humio looks interesting but it appears to be a linear search approach. This is fine as I commented elsewhere and their numbers match what I was able to achieve with my linear based prototypes.

The reason I rejected this approach is it gets very expensive for large data volumes if you want your queries to remain responsive.

Say you want to search a 100 TB dataset (not as large as it sounds when it comes to log data...). You can do about 1GB/s/core assuming you have the data on local disks that can scan fast enough and each of your machines have 16 cores that is 16GB/s/machine. Lets say your query target time is 60s (pretty slow tbh, incredibly generous I would say).

The math then plays out like this. You can scan 60*16GB for each node in your cluster, i.e 960GB. You need to have that ~TB of data on disks that can read at 16GB/s which means you need it evenly spread across 8-16 very high end SSDs.

On top of that you need 100 of these machines to complete this query in 60 seconds.

Now Humio goes on to say you can store all your persistent data in a cloud bucket. Which is a good strategy and something I am employing too but if you have no indices you actually need to scan it all which means you are limited by how fast you can reasonably fetch the objects across your cluster and hence the speed of your network interfaces.

Say if you are on GCP which has a relatively fast network that seems capable of around 25Gbit to GCS most of the time and you actually get peak performance all the time (pretty unrealistic but ok). Then to fully scan 100TB in 60s you would need over 500 machines. If you are able to use Humios tags to reduce this somewhat say by only searching for errors and that gets you down to 10% of your total logs that would represent a 90% speedup. Humio sort of has quasi indexing in this way similar to Prometheus however they don't help you when what you are looking for isn't tagged to stand out.

This is why indices. Yes - indices are hard, yes they can have bad worst cases if you aren't super careful. However lets consider indices for this query.

Say your logs are syslog + some extra structured data. You have things like application_name, facility, level, etc. You have 100TB of logs to search, you are looking for logs by your main application which is say 60% of your total log volume, you are looking for DNS resolutions errors that contain a specific string "error resolving".

With my prototype my indices are approximately 5% the size of the ingested data. The raw ingested data is also compressed really highly, lets assume equal to Humio (it's probably higher due to file format but not important).

So the thing that jumps out here is we now only need 5TB of indices on our machines to reasonably find needles in our 100TB of data. Additionally our indices are split by field so if we know we are searching the message field we only need to load those. Indices for columns with low cardinality are extremely small, those for high cardinality much larger but capped due to various log specific optimisations I am able to use. So lets assume the message field makes up the majority of our index, say 80%. That brings us to 4TB of data we need to scan.

Now usually 60s would be super slow for a query for a system with indices like this, usually you would try make sure that 4TB is in RAM and simply blow through it in <1s across a few machines. However for comparisons sake lets say 60s is still our query budget and we don't have completely stupid amounts of RAM.

So we need to be able to scan 4TB/60s ~= 66GB/s which given our previous machine calcs with local storage puts us at ~5 machines.

However we could likely do this with even less CPU assuming our storage is fast enough as unlike an unindexed system like Humio we aren't applying expensive search algorithms to every row, we are simply crunching a ton of bitset operations in this case a metric ton of bitwise AND.

Anyway this is a long rant. The reason why is that many people always say "Can't you just do this fast enough with linear search" and I always have to reply "it depends on how big the haystack is". This quantifies what is too big of a haystack in a reasonable way.