Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Features of Solr vs. ElasticSearch (solr-vs-elasticsearch.com)
118 points by friendlytuna on Nov 14, 2012 | hide | past | favorite | 48 comments


Hi! I'm the author of http://solr-vs-elasticsearch.com

It's something I threw together in a couple hours, and figured I'd iterate and improve over the next couple days, so please bear with the mistakes.

I fixed the more glaring errors (copy field, dynamic fields , Django etc), and will continue to do so as comments come in.


I've been playing with ElasticSearch + Tire[0] over the span of the last week. It's a joy to use. Sunspot + Solr isn't a bad alternative, though.

Tire's docs are a bit lacking but it maps more-or-less 1:1 with ElasticSearch, so it's not too bad.

Very pleased with the performance ElasticSearch provides. The installation was a bit foreign for me personally (a java service container? openjdk 6 or 7?) but it's lightning quick and very flexible.

I chose to go the ElasticSearch route mainly due to my need for Geospatial indexing. Solr does it too, but Foursquare[1] uses ElasticSearch and so that got me interested in learning more. Geo queries are really fast. My last experience with Geo involved GeoDjango and all kinds of obnoxious hacks to PostgreSQL to make it work. With ElasticSearch you tell it to index a point and boom you're off to the races.

[0]: https://github.com/karmi/tire

[1]: https://foursquare.com/about/


I used Solr for the previous generation of FreshBSD[0], and migrated to ElasticSearch and Tire over a year ago and haven't looked back. The docs for both ElasticSearch and Tire leave something to be desired, but it's still so much nicer to use and a lot faster out of the box, at least for my modest needs.

[0]: http://freshbsd.org/


Solr's docs also leave a lot to be desired. The wikis are a mess and knowing what is still relevant (e.g. not deprecated) is frequently a crap shoot.

It'd be great if their docs had versioning (like the Apache HTTP Server Project), but I suspect that isn't on anyone's roadmap.


Tire was what convinced me to choose elasticsearch over Solr. A lot of the features described in this document do not apply to my needs, but the speed in which we were up-and-running with elasticsearch because of tire was hard to match.


I love both Solr and ElasticSearch but the big missing comparison for me is: are there any books available? Or even comprehensive tutorials beyond the basics? I love ElasticSearch but it was a huge pain getting up-to-speed on everything. Figuring out things like EdgeNGrams (something I already knew how to do in Solr and Lucene) meant digging into the source code. I'm not shy about doing that myself, but giving that advice to a consulting client would be a non-starter. With the explosive growth of ES just in the last year or two, it's really time for someone to start working on a book. Packt, Manning, O'Reilly, any news?


Add geospatial to your comparison chart please. The way in which these implement support varies widely in performance and accuracy. I've yet to find one that actually uses R-Trees. Geohashes seem to be all the rage these days.


"I've yet to find one that actually uses R-Trees."

That's because R-Trees don't scale well with random write loads.

R-tree insertion performance is extremely dependent on insertion order (search for "sort tile recursive"). They're best used for problems where the data can be bulk-loaded and left alone. If random writes are an important part of the problem (as they are for most web-based tools), R-Trees are a bad idea.


Geohashes seem to play nicely to the strengths of an indexing search engine, because you can encode quadrants into a string of characters, and use ngram analysis to compare multiple levels of precision. There's almost no extra work that goes into using geohashes with a term-based index.

That said, they carry a lot of annoying edge cases when determining the adjacency of quadrants, so they're hardly a panacea to geospatial search. Lucene 3 and 4 make a lot of progress in spatial search, but there's still a fair bit of room for improvement.


I'm not an ElasticSearch expert but it seems that the scenario for "Field copying" would be supported with the multi_field indexing ( http://www.elasticsearch.org/guide/reference/mapping/multi-f... )


That is correct. It appears that this comparison is a bit inconsistent. Also, there are client libs for javascript as well.


Good overview - shows you just how powerful these engines are.

I can't speak for ElasticSearch but there are a couple things in the Solr list that I'm not sure about.

"Multiple document types per schema" - You can use dynamic fields so that you don't even need to define your document schema

"Schema change requires restart" - I think in MultiCore it happens when you swap cores (which is a good way of running solr) [0]

[0] http://stackoverflow.com/questions/10417422/solr-schema-chan...


Furthermore, with respect to schema changes, ElasticSearch will refuse to make backwards-incompatible changes. So for either search engine, you'll need to get comfortable at some point with the procedure for creating a new index with the new schema or mapping, reindexing your data, and hot-swapping the Solr Core or ElasticSearch Alias.


Hot-swaps of ElasticSearch aliases are how we do it at my company. It's how we produce a rolling archive.


3rd party integration of ElasticSearch with Django: http://haystacksearch.org/. So I'm not sure why the article says N/A.


Yeah, especially weird since it's the same project that does 2rd party for Solr, and ElasticSearch is increasingly a favored engine of the author.


After the 3rd thing that was wrong about Solr, I stopped caring to write anything more than this comment.


Given that this appears to be a community resource and not sponsored by either SOLR or Elastic Search people, I'm sure your specific critiques would be useful.


Looks like a sales/SEO play for a Solr/ElasticSearch consultant. Still seems pretty helpful as a community resource. I emailed the author to see if he's interested in setting up a public GitHub repo to take pull requests.

Personally, I'd like to see similar comparisons for other search engines, like Sphinx and Postgres Full-Text. When I talk to people about search engines, the first questions they ask me are to compare one against some other.


Which is especially egregious, since both fall apart in more serious use-cases.


Can you expand on what you mean?


Not sure about codewright's use cases. However in my own brief experimentation with SOLR I ran into performance issues with garbage collection.

I setup a cluster of about 15 cc2.8xlarge machines (5 Shards with 3 replicas each) containing 240Gb worth of documents (48gb per shard). Each node was given on the order of 40GB heap space. While performing load tests with a relatively small load (~150 QPS) after a few minutes the garbage collector on nodes would kick in and run on the order of 15 to 30s. This had a cascading effect of causing zookeper to think nodes were down, start leader re-election, etc.

Admittedly I am quite inexperienced when it comes to dealing with applications using such large heap sizes. Though I tried a few different JVM options with respect to GC I was unsuccessful in resolving the problem.

If any folks here happen to have some good resources regarding GC and large Solr clusters I would definitely be interested.


That huge heap is extremely counterproductive, because large heaps have terrible GC performance, and you're actually stealing memory from the natively memory-mapped files that make up your index.

Try it again with sane GC parameters, e.g.:

    -Xmx<N>G -Xms<N>g -XX:NewSize=<N/2>G -XX:MaxNewSize=<N/2>G -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+CMSIncrementalMode
Where <N> is a value between 2-8.

Edit: I was benchmarking a similarly sized (though very differently configured) Solr cluster for a well-known internet company, and was able to tune it to do 5000qps, with p50 ~2ms and p99 ~20ms.


Thanks for the tips. I was considering trying testing again with more partitions w/ smaller machines. Perhaps N x m1.xlarge w/ 8 GB heap space.

I was starting to think that since the heap space was so big perhaps I should be worrying about page sizes as well. While I tried various GC settings (UseConcMarkSweepGC, ConcGCThreads, UseG1GC, etc. ) I didn't take a stab at playing with the size of New Genearation. Could you explain the reasoning behind this? Is the idea that most objects die young so try to increase the number of short run minor GCs and avoid bigger Major GCs? I am quite interested.

Edit: Regarding the cluster you were working on. Would you be able to give general dimensions to the number of nodes & partitions in your cluster + memory for each? Just trying to get a general guideline to aim for.


In general, I fix the newgen size mostly to avoid the optimizer choosing something braindead in a pathological case. 50/50 is safe, but not optimal.

In general, you should have enough unallocated memory on the box to cover your working dataset (it'll get used by caches and memmaps). If you can, find a way to exploit data locality. I shoot for (number of cores * 1-4)-ish partitions per box depending on workload. Using bigger boxes is usually better, because you can avoid communication latency and variance that arises from having tons of boxes.

If you want to know more, you can email me at kyle@onemorecloud.com.


Your Solr cluster kicked the bucket at 150QPS?

Jesus dude. I couldn't reproduce that with my single or multi-node ElasticSearch clusters if I wanted to.

How were the EBS backing stores setup for these EC2 nodes?

Edit: Also, when I was talking about "them" falling apart, I meant Postgres or Sphinx, not Solr/ElasticSearch.

Well-configured Solr and ElasticSearch clusters can work very well for most people.


EBS shouldn't really matter, because with a reasonable heap, he should have 40-50G of available filesystem cache, and 48G of data.


If you need a serious search engine, using Postgres and Sphinx won't last long. You'll end up moving to Solr or ElasticSearch. (I've used both, but use ElasticSearch now)


I am not codewright, but we had troubles with sphinx on search queries containing larger number of terms (for us hiccups started after 100 terms or so). Besides, setting up delta-indexing is PITA and extensibility/configurability is limited. We ended up using solr (which is a memory hog) but at least it works.


Yeah, Elasticsearch is dead easy to update documents with. Elasticsearch is also easier than Sphinx to set up as well (basically throw some data at it and it'll suss out a mapping for it).


The running meme among the engineers at my company is that ElasticSearch is our secret weapon we love to whip out for various problems.

I almost wish it was more of a standard data store. Here's to hoping RethinkDB can fill that void.


both are nice and will do the job without too much pain. I've been running an es cluster for about a year now. I appreciate how easy it is to setup but the documentation is terrible. es doc should be a cross between rethinkdb and redis. that would make life easier for everybody.


Couldn't agree more. I think the problem with ElasticSearch docs is they assume the user already understands the inner workings of the Lucene search engine (after all, ElasticSearch is just a nice restful wrapper on top of that.)

If, as was my case, the most complex search you've ever made before was a fulltext search on a database field then you'll be lost for a good couple days until you understand what's going on.


Surprised ElasticSearch doesn't support dynamic fields. That is one of the most useful featuresin SOLR.


This is incorrect. Dynamic templates are pretty much the same thing.

http://elasticsearch-users.115913.n3.nabble.com/Apply-dynami...


You can define a mapping that customizes fields based on wildcards. For instance, you can say that any field matching *_ts is treated as a date field.

That means that as new documents arrive with fields that were never seen before, if those fields end in _ts, they will be properly indexed.

That feels like dynamic field support to me.


It really doesn't need to since you aren't bound to a fixed schema. Just use whatever fields are necessary for your documents and map them to the appropriate types.


This is a typo that has been corrected. dynamic fields is indeed supported!


Having played with both, I personally find http://www.searchify.com/ much better than bost Solr And ElasticSearch. At least based off the search results.


It looks like searchify is a hosted search solution, whereas Solr and ElasticSearch are distributions of search servers that can be deployed on your own hardware.


Technically Searchify is based on the open-sourced code from the previously proprietary IndexTank. So in theory you can run your own: https://github.com/linkedin/indextank-engine.

I'm skeptical about claims of different quality of relevancy results, since IndexTank/Searchify is also based on Lucene (looks like 3.0.1 in the canonical repo), and should share all the same fundamental relevancy and scoring functionality.


Yep, but hosting it is a pain.

It's actually heavily tweaked (this is what the IndexTank team has told me), and apparently contains components of Solr


There are hosted services for Solr and ElasticSearch, too. Such as fizx's and my own http://websolr.com and http://bonsai.io


With elasticsearch, you are not able to change shard count after initial index creation. You are able to change replicas at any time.


Fixed. thanks..


This seems biased in favor of Solr... it tries very hard to keep Solr and ES balanced but in reality it's not that balanced.


Yokozuna is Riak + Solr, not ES.


Oops. Typo. Fixed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: