Features of Solr vs. ElasticSearch

superkt · on Nov 14, 2012

Hi! I'm the author of http://solr-vs-elasticsearch.com

It's something I threw together in a couple hours, and figured I'd iterate and improve over the next couple days, so please bear with the mistakes.

I fixed the more glaring errors (copy field, dynamic fields , Django etc), and will continue to do so as comments come in.

whalesalad · on Nov 14, 2012

I've been playing with ElasticSearch + Tire[0] over the span of the last week. It's a joy to use. Sunspot + Solr isn't a bad alternative, though.

Tire's docs are a bit lacking but it maps more-or-less 1:1 with ElasticSearch, so it's not too bad.

Very pleased with the performance ElasticSearch provides. The installation was a bit foreign for me personally (a java service container? openjdk 6 or 7?) but it's lightning quick and very flexible.

I chose to go the ElasticSearch route mainly due to my need for Geospatial indexing. Solr does it too, but Foursquare[1] uses ElasticSearch and so that got me interested in learning more. Geo queries are really fast. My last experience with Geo involved GeoDjango and all kinds of obnoxious hacks to PostgreSQL to make it work. With ElasticSearch you tell it to index a point and boom you're off to the races.

[0]: https://github.com/karmi/tire

[1]: https://foursquare.com/about/

Freaky · on Nov 14, 2012

I used Solr for the previous generation of FreshBSD[0], and migrated to ElasticSearch and Tire over a year ago and haven't looked back. The docs for both ElasticSearch and Tire leave something to be desired, but it's still so much nicer to use and a lot faster out of the box, at least for my modest needs.

[0]: http://freshbsd.org/

ecaron · on Nov 15, 2012

Solr's docs also leave a lot to be desired. The wikis are a mess and knowing what is still relevant (e.g. not deprecated) is frequently a crap shoot.

It'd be great if their docs had versioning (like the Apache HTTP Server Project), but I suspect that isn't on anyone's roadmap.

purephase · on Nov 14, 2012

Tire was what convinced me to choose elasticsearch over Solr. A lot of the features described in this document do not apply to my needs, but the speed in which we were up-and-running with elasticsearch because of tire was hard to match.

cedrichurst · on Nov 15, 2012

I love both Solr and ElasticSearch but the big missing comparison for me is: are there any books available? Or even comprehensive tutorials beyond the basics? I love ElasticSearch but it was a huge pain getting up-to-speed on everything. Figuring out things like EdgeNGrams (something I already knew how to do in Solr and Lucene) meant digging into the source code. I'm not shy about doing that myself, but giving that advice to a consulting client would be a non-starter. With the explosive growth of ES just in the last year or two, it's really time for someone to start working on a book. Packt, Manning, O'Reilly, any news?

meritt · on Nov 14, 2012

Add geospatial to your comparison chart please. The way in which these implement support varies widely in performance and accuracy. I've yet to find one that actually uses R-Trees. Geohashes seem to be all the rage these days.

timr · on Nov 15, 2012

"I've yet to find one that actually uses R-Trees."

That's because R-Trees don't scale well with random write loads.

R-tree insertion performance is extremely dependent on insertion order (search for "sort tile recursive"). They're best used for problems where the data can be bulk-loaded and left alone. If random writes are an important part of the problem (as they are for most web-based tools), R-Trees are a bad idea.

nzadrozny · on Nov 14, 2012

Geohashes seem to play nicely to the strengths of an indexing search engine, because you can encode quadrants into a string of characters, and use ngram analysis to compare multiple levels of precision. There's almost no extra work that goes into using geohashes with a term-based index.

That said, they carry a lot of annoying edge cases when determining the adjacency of quadrants, so they're hardly a panacea to geospatial search. Lucene 3 and 4 make a lot of progress in spatial search, but there's still a fair bit of room for improvement.

zimbatm · on Nov 14, 2012

I'm not an ElasticSearch expert but it seems that the scenario for "Field copying" would be supported with the multi_field indexing ( http://www.elasticsearch.org/guide/reference/mapping/multi-f... )

johnnymonster · on Nov 14, 2012

That is correct. It appears that this comparison is a bit inconsistent. Also, there are client libs for javascript as well.

aidos · on Nov 14, 2012

Good overview - shows you just how powerful these engines are.

I can't speak for ElasticSearch but there are a couple things in the Solr list that I'm not sure about.

"Multiple document types per schema" - You can use dynamic fields so that you don't even need to define your document schema

"Schema change requires restart" - I think in MultiCore it happens when you swap cores (which is a good way of running solr) [0]

[0] http://stackoverflow.com/questions/10417422/solr-schema-chan...

nzadrozny · on Nov 14, 2012

Furthermore, with respect to schema changes, ElasticSearch will refuse to make backwards-incompatible changes. So for either search engine, you'll need to get comfortable at some point with the procedure for creating a new index with the new schema or mapping, reindexing your data, and hot-swapping the Solr Core or ElasticSearch Alias.

codewright · on Nov 14, 2012

Hot-swaps of ElasticSearch aliases are how we do it at my company. It's how we produce a rolling archive.

bradbeattie · on Nov 14, 2012

3rd party integration of ElasticSearch with Django: http://haystacksearch.org/. So I'm not sure why the article says N/A.

famousactress · on Nov 14, 2012

Yeah, especially weird since it's the same project that does 2rd party for Solr, and ElasticSearch is increasingly a favored engine of the author.

ecaron · on Nov 14, 2012

After the 3rd thing that was wrong about Solr, I stopped caring to write anything more than this comment.

rgrieselhuber · on Nov 14, 2012

Given that this appears to be a community resource and not sponsored by either SOLR or Elastic Search people, I'm sure your specific critiques would be useful.

nzadrozny · on Nov 14, 2012

Looks like a sales/SEO play for a Solr/ElasticSearch consultant. Still seems pretty helpful as a community resource. I emailed the author to see if he's interested in setting up a public GitHub repo to take pull requests.

Personally, I'd like to see similar comparisons for other search engines, like Sphinx and Postgres Full-Text. When I talk to people about search engines, the first questions they ask me are to compare one against some other.

codewright · on Nov 14, 2012

Which is especially egregious, since both fall apart in more serious use-cases.

nzadrozny · on Nov 14, 2012

Can you expand on what you mean?

Zombieball · on Nov 14, 2012

Not sure about codewright's use cases. However in my own brief experimentation with SOLR I ran into performance issues with garbage collection.

I setup a cluster of about 15 cc2.8xlarge machines (5 Shards with 3 replicas each) containing 240Gb worth of documents (48gb per shard). Each node was given on the order of 40GB heap space. While performing load tests with a relatively small load (~150 QPS) after a few minutes the garbage collector on nodes would kick in and run on the order of 15 to 30s. This had a cascading effect of causing zookeper to think nodes were down, start leader re-election, etc.

Admittedly I am quite inexperienced when it comes to dealing with applications using such large heap sizes. Though I tried a few different JVM options with respect to GC I was unsuccessful in resolving the problem.

If any folks here happen to have some good resources regarding GC and large Solr clusters I would definitely be interested.

fizx · on Nov 15, 2012

That huge heap is extremely counterproductive, because large heaps have terrible GC performance, and you're actually stealing memory from the natively memory-mapped files that make up your index.

Try it again with sane GC parameters, e.g.:

    -Xmx<N>G -Xms<N>g -XX:NewSize=<N/2>G -XX:MaxNewSize=<N/2>G -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+CMSIncrementalMode

Where <N> is a value between 2-8.

Edit: I was benchmarking a similarly sized (though very differently configured) Solr cluster for a well-known internet company, and was able to tune it to do 5000qps, with p50 ~2ms and p99 ~20ms.

Zombieball · on Nov 15, 2012

Thanks for the tips. I was considering trying testing again with more partitions w/ smaller machines. Perhaps N x m1.xlarge w/ 8 GB heap space.

I was starting to think that since the heap space was so big perhaps I should be worrying about page sizes as well. While I tried various GC settings (UseConcMarkSweepGC, ConcGCThreads, UseG1GC, etc. ) I didn't take a stab at playing with the size of New Genearation. Could you explain the reasoning behind this? Is the idea that most objects die young so try to increase the number of short run minor GCs and avoid bigger Major GCs? I am quite interested.

Edit: Regarding the cluster you were working on. Would you be able to give general dimensions to the number of nodes & partitions in your cluster + memory for each? Just trying to get a general guideline to aim for.

fizx · on Nov 15, 2012

In general, I fix the newgen size mostly to avoid the optimizer choosing something braindead in a pathological case. 50/50 is safe, but not optimal.

In general, you should have enough unallocated memory on the box to cover your working dataset (it'll get used by caches and memmaps). If you can, find a way to exploit data locality. I shoot for (number of cores * 1-4)-ish partitions per box depending on workload. Using bigger boxes is usually better, because you can avoid communication latency and variance that arises from having tons of boxes.

If you want to know more, you can email me at kyle@onemorecloud.com.

codewright · on Nov 15, 2012

Your Solr cluster kicked the bucket at 150QPS?

Jesus dude. I couldn't reproduce that with my single or multi-node ElasticSearch clusters if I wanted to.

How were the EBS backing stores setup for these EC2 nodes?

Edit: Also, when I was talking about "them" falling apart, I meant Postgres or Sphinx, not Solr/ElasticSearch.

Well-configured Solr and ElasticSearch clusters can work very well for most people.

fizx · on Nov 15, 2012

EBS shouldn't really matter, because with a reasonable heap, he should have 40-50G of available filesystem cache, and 48G of data.

codewright · on Nov 14, 2012

If you need a serious search engine, using Postgres and Sphinx won't last long. You'll end up moving to Solr or ElasticSearch. (I've used both, but use ElasticSearch now)

xentronium · on Nov 14, 2012

I am not codewright, but we had troubles with sphinx on search queries containing larger number of terms (for us hiccups started after 100 terms or so). Besides, setting up delta-indexing is PITA and extensibility/configurability is limited. We ended up using solr (which is a memory hog) but at least it works.

mcantelon · on Nov 14, 2012

Yeah, Elasticsearch is dead easy to update documents with. Elasticsearch is also easier than Sphinx to set up as well (basically throw some data at it and it'll suss out a mapping for it).

codewright · on Nov 15, 2012

The running meme among the engineers at my company is that ElasticSearch is our secret weapon we love to whip out for various problems.

I almost wish it was more of a standard data store. Here's to hoping RethinkDB can fill that void.

Hikari · on Nov 14, 2012

both are nice and will do the job without too much pain. I've been running an es cluster for about a year now. I appreciate how easy it is to setup but the documentation is terrible. es doc should be a cross between rethinkdb and redis. that would make life easier for everybody.

_euac · on Nov 14, 2012

Couldn't agree more. I think the problem with ElasticSearch docs is they assume the user already understands the inner workings of the Lucene search engine (after all, ElasticSearch is just a nice restful wrapper on top of that.)

If, as was my case, the most complex search you've ever made before was a fulltext search on a database field then you'll be lost for a good couple days until you understand what's going on.

tarr11 · on Nov 14, 2012

Surprised ElasticSearch doesn't support dynamic fields. That is one of the most useful featuresin SOLR.

sandGorgon · on Nov 14, 2012

This is incorrect. Dynamic templates are pretty much the same thing.

http://elasticsearch-users.115913.n3.nabble.com/Apply-dynami...

DEinspanjer · on Nov 14, 2012

You can define a mapping that customizes fields based on wildcards. For instance, you can say that any field matching *_ts is treated as a date field.

That means that as new documents arrive with fields that were never seen before, if those fields end in _ts, they will be properly indexed.

That feels like dynamic field support to me.

ulope · on Nov 14, 2012

It really doesn't need to since you aren't bound to a fixed schema. Just use whatever fields are necessary for your documents and map them to the appropriate types.

superkt · on Nov 14, 2012

This is a typo that has been corrected. dynamic fields is indeed supported!

hajrice · on Nov 14, 2012

Having played with both, I personally find http://www.searchify.com/ much better than bost Solr And ElasticSearch. At least based off the search results.

diek · on Nov 14, 2012

It looks like searchify is a hosted search solution, whereas Solr and ElasticSearch are distributions of search servers that can be deployed on your own hardware.

nzadrozny · on Nov 14, 2012

Technically Searchify is based on the open-sourced code from the previously proprietary IndexTank. So in theory you can run your own: https://github.com/linkedin/indextank-engine.

I'm skeptical about claims of different quality of relevancy results, since IndexTank/Searchify is also based on Lucene (looks like 3.0.1 in the canonical repo), and should share all the same fundamental relevancy and scoring functionality.

hajrice · on Nov 14, 2012

Yep, but hosting it is a pain.

It's actually heavily tweaked (this is what the IndexTank team has told me), and apparently contains components of Solr

nzadrozny · on Nov 15, 2012

There are hosted services for Solr and ElasticSearch, too. Such as fizx's and my own http://websolr.com and http://bonsai.io

johnnymonster · on Nov 14, 2012

With elasticsearch, you are not able to change shard count after initial index creation. You are able to change replicas at any time.

superkt · on Nov 14, 2012

Fixed. thanks..

KaoruAoiShiho · on Nov 14, 2012

This seems biased in favor of Solr... it tries very hard to keep Solr and ES balanced but in reality it's not that balanced.

adient · on Nov 15, 2012

Yokozuna is Riak + Solr, not ES.

superkt · on Nov 15, 2012

Oops. Typo. Fixed.