I've been playing with ElasticSearch + Tire[0] over the span of the last week. It's a joy to use. Sunspot + Solr isn't a bad alternative, though.
Tire's docs are a bit lacking but it maps more-or-less 1:1 with ElasticSearch, so it's not too bad.
Very pleased with the performance ElasticSearch provides. The installation was a bit foreign for me personally (a java service container? openjdk 6 or 7?) but it's lightning quick and very flexible.
I chose to go the ElasticSearch route mainly due to my need for Geospatial indexing. Solr does it too, but Foursquare[1] uses ElasticSearch and so that got me interested in learning more. Geo queries are really fast. My last experience with Geo involved GeoDjango and all kinds of obnoxious hacks to PostgreSQL to make it work. With ElasticSearch you tell it to index a point and boom you're off to the races.
I used Solr for the previous generation of FreshBSD[0], and migrated to ElasticSearch and Tire over a year ago and haven't looked back. The docs for both ElasticSearch and Tire leave something to be desired, but it's still so much nicer to use and a lot faster out of the box, at least for my modest needs.
Tire was what convinced me to choose elasticsearch over Solr. A lot of the features described in this document do not apply to my needs, but the speed in which we were up-and-running with elasticsearch because of tire was hard to match.
I love both Solr and ElasticSearch but the big missing comparison for me is: are there any books available? Or even comprehensive tutorials beyond the basics? I love ElasticSearch but it was a huge pain getting up-to-speed on everything. Figuring out things like EdgeNGrams (something I already knew how to do in Solr and Lucene) meant digging into the source code. I'm not shy about doing that myself, but giving that advice to a consulting client would be a non-starter. With the explosive growth of ES just in the last year or two, it's really time for someone to start working on a book. Packt, Manning, O'Reilly, any news?
Add geospatial to your comparison chart please. The way in which these implement support varies widely in performance and accuracy. I've yet to find one that actually uses R-Trees. Geohashes seem to be all the rage these days.
"I've yet to find one that actually uses R-Trees."
That's because R-Trees don't scale well with random write loads.
R-tree insertion performance is extremely dependent on insertion order (search for "sort tile recursive"). They're best used for problems where the data can be bulk-loaded and left alone. If random writes are an important part of the problem (as they are for most web-based tools), R-Trees are a bad idea.
Geohashes seem to play nicely to the strengths of an indexing search engine, because you can encode quadrants into a string of characters, and use ngram analysis to compare multiple levels of precision. There's almost no extra work that goes into using geohashes with a term-based index.
That said, they carry a lot of annoying edge cases when determining the adjacency of quadrants, so they're hardly a panacea to geospatial search. Lucene 3 and 4 make a lot of progress in spatial search, but there's still a fair bit of room for improvement.
Furthermore, with respect to schema changes, ElasticSearch will refuse to make backwards-incompatible changes. So for either search engine, you'll need to get comfortable at some point with the procedure for creating a new index with the new schema or mapping, reindexing your data, and hot-swapping the Solr Core or ElasticSearch Alias.
Given that this appears to be a community resource and not sponsored by either SOLR or Elastic Search people, I'm sure your specific critiques would be useful.
Looks like a sales/SEO play for a Solr/ElasticSearch consultant. Still seems pretty helpful as a community resource. I emailed the author to see if he's interested in setting up a public GitHub repo to take pull requests.
Personally, I'd like to see similar comparisons for other search engines, like Sphinx and Postgres Full-Text. When I talk to people about search engines, the first questions they ask me are to compare one against some other.
Not sure about codewright's use cases. However in my own brief experimentation with SOLR I ran into performance issues with garbage collection.
I setup a cluster of about 15 cc2.8xlarge machines (5 Shards with 3 replicas each) containing 240Gb worth of documents (48gb per shard). Each node was given on the order of 40GB heap space. While performing load tests with a relatively small load (~150 QPS) after a few minutes the garbage collector on nodes would kick in and run on the order of 15 to 30s. This had a cascading effect of causing zookeper to think nodes were down, start leader re-election, etc.
Admittedly I am quite inexperienced when it comes to dealing with applications using such large heap sizes. Though I tried a few different JVM options with respect to GC I was unsuccessful in resolving the problem.
If any folks here happen to have some good resources regarding GC and large Solr clusters I would definitely be interested.
That huge heap is extremely counterproductive, because large heaps have terrible GC performance, and you're actually stealing memory from the natively memory-mapped files that make up your index.
Edit: I was benchmarking a similarly sized (though very differently configured) Solr cluster for a well-known internet company, and was able to tune it to do 5000qps, with p50 ~2ms and p99 ~20ms.
Thanks for the tips. I was considering trying testing again with more partitions w/ smaller machines. Perhaps N x m1.xlarge w/ 8 GB heap space.
I was starting to think that since the heap space was so big perhaps I should be worrying about page sizes as well. While I tried various GC settings (UseConcMarkSweepGC, ConcGCThreads, UseG1GC, etc. ) I didn't take a stab at playing with the size of New Genearation. Could you explain the reasoning behind this? Is the idea that most objects die young so try to increase the number of short run minor GCs and avoid bigger Major GCs? I am quite interested.
Edit: Regarding the cluster you were working on. Would you be able to give general dimensions to the number of nodes & partitions in your cluster + memory for each? Just trying to get a general guideline to aim for.
In general, I fix the newgen size mostly to avoid the optimizer choosing something braindead in a pathological case. 50/50 is safe, but not optimal.
In general, you should have enough unallocated memory on the box to cover your working dataset (it'll get used by caches and memmaps). If you can, find a way to exploit data locality. I shoot for (number of cores * 1-4)-ish partitions per box depending on workload. Using bigger boxes is usually better, because you can avoid communication latency and variance that arises from having tons of boxes.
If you want to know more, you can email me at kyle@onemorecloud.com.
If you need a serious search engine, using Postgres and Sphinx won't last long. You'll end up moving to Solr or ElasticSearch. (I've used both, but use ElasticSearch now)
I am not codewright, but we had troubles with sphinx on search queries containing larger number of terms (for us hiccups started after 100 terms or so). Besides, setting up delta-indexing is PITA and extensibility/configurability is limited. We ended up using solr (which is a memory hog) but at least it works.
Yeah, Elasticsearch is dead easy to update documents with. Elasticsearch is also easier than Sphinx to set up as well (basically throw some data at it and it'll suss out a mapping for it).
both are nice and will do the job without too much pain. I've been running an es cluster for about a year now. I appreciate how easy it is to setup but the documentation is terrible.
es doc should be a cross between rethinkdb and redis. that would make life easier for everybody.
Couldn't agree more. I think the problem with ElasticSearch docs is they assume the user already understands the inner workings of the Lucene search engine (after all, ElasticSearch is just a nice restful wrapper on top of that.)
If, as was my case, the most complex search you've ever made before was a fulltext search on a database field then you'll be lost for a good couple days until you understand what's going on.
It really doesn't need to since you aren't bound to a fixed schema. Just use whatever fields are necessary for your documents and map them to the appropriate types.
Having played with both, I personally find http://www.searchify.com/much better than bost Solr And ElasticSearch. At least based off the search results.
It looks like searchify is a hosted search solution, whereas Solr and ElasticSearch are distributions of search servers that can be deployed on your own hardware.
Technically Searchify is based on the open-sourced code from the previously proprietary IndexTank. So in theory you can run your own: https://github.com/linkedin/indextank-engine.
I'm skeptical about claims of different quality of relevancy results, since IndexTank/Searchify is also based on Lucene (looks like 3.0.1 in the canonical repo), and should share all the same fundamental relevancy and scoring functionality.
It's something I threw together in a couple hours, and figured I'd iterate and improve over the next couple days, so please bear with the mistakes.
I fixed the more glaring errors (copy field, dynamic fields , Django etc), and will continue to do so as comments come in.