Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Linear search approaches fall down when you have a lot of data and you only want to select a very small portion of it.

A linear based approach can get you to about 1GB/s or so per core with Rust.

A medium-ish size startup probably logs around 200GB/day of logs if they aren't very tight on their log volume. If you only want to search the last 24 hours that is maybe ok, you can search that in ~10-20 seconds on a single machine.

However this quickly breaks down when a) your log volume is a multiple of this and/or you want to search more than just a few hours.

In which case you need some sort of index.

There are different approaching to indexing logs. The most common is full text search indexing using an engine like Lucene. Elasticsearch (from the ELK stack) and Solr explicitly use Lucene. Splunk uses their own indexing format but I'm pretty sure it's in a similar vein. Papertrail uses Clickhouse which probably means they are using some sort of data skipping indices and lots of linear searching.

Of these approaches Clickhouse is probably the best way to go. It combines fast linear search with distributed storage and data skipping indices that reduce the amount of data you need to scan. (especially if you filter by PRE WHERE clauses).

So why not go with Clickhouse? Clickhouse requires a schema. You can do various things like flatten your nested structured data into KV (not a problem if you are already using a flat system) and have a single column for all keys and the other column for values. This works but doesn't get great compression, makes filtering ineffective for the most part and you now have to operate a distributed database that requires Zookeeper for coordination.

The reason I am choosing to build my own is that logs require unique indexing characteristics. First and foremost the storage system needs to be fully schemaless. Secondly you need to retain none word characters. The standard Lucene tokenizers in Elastic strip important punctuation that you might want to match on when searching log data. Field dimensionality can be very high so you need a system that won't buckle with metadata overhead when there are crazy numbers of unique fields, same goes for cardinality.

TLDR: For big users you must have indices in order not to search 20TB of logs for a month. Current indices suck for logs. I write custom index that is hella fast for regex.



My first thought was also Clickhouse and the problems with schema for dynamic log content.

Also, if this is open source, I will definitely be interested in checking it out/contributing etc.


Here's another thought--why not try to fix this in ClickHouse? It sounds like you are rebuilding Elastic.


I considered that but it's harder than it sounds. Clickhouse s very strongly coupled to the idea of a schema and it also is very coupled to only using indices for data skipping.

If I was to make the changes I want to Clickhouse, i.e schemaless and full indexes per/segment then it wouldn't be Clickhouse anymore.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: