Why Elasticsearch? - Refactoring story part 3
As part of a system refactoring process, we replaced a legacy search system with Elasticsearch. Being a core component of our system it literally took us more than half a year to move to the new system and make all the features work again, and it required us being absolutely sure of Elasticsearch's competency.
In the previous posts I mentioned we wanted to keep using Lucene to build on top of existing knowledge and experience, but do this in scale reliably and without too much pain. Elasticsearch turned out to be a perfect fit for us, and over a year after the fact we are very happy with it.
I thought I'll do a write up to summarize what we found in Elasticsearch that made it our search engine of choice, and point at some helpful resources. This is a very high level post that doesn't intend on being all too technical; I'll be writing on our experience with some of the features in more detail in the future.
We start with the obvious one. I explained the complexity of the problem previously, and Elasticsearch really tackles it very nicely.
One server can hold one or more parts of one or more indexes, and whenever new nodes are introduced to the cluster they are just being added to the party. Every such index, or part of it, is called a shard, and Elasticsearch shards can be moved around the cluster very easily.
The cluster itself is very easy to form. Just bring up multiple Elasticsearch nodes on the same network and tell them the name of the cluster, and you are pretty much done. Everything is done automatically - discovery and master selection all done behind the scenes for you.
The ability of managing large Lucene indexes across multiple servers and have some reliable, tested piece of code do the heavy lifting for you is definitely a winner.
There are multiple gotchas and possible pain-points though, namely some prospect issues with unicast/multicast discovery, shard allocation algorithms and so on, but nothing that was a deal breaker for us.
Managing Elasticsearch instances is done purely via a very handy REST API. Responses are always in JSON, which is both machine and human readable, and complex requests are sent as JSON as well. It can't get any easier.
A recent addition to Elasticsearch is the "cat API", which gives insights and stats for a cluster in an even more human readable format. You can read more about it here. To me this shows how important it is for Elasticsearch to be easy to maintain and understand as a core feature - and that's very important.
Everything can be controlled via REST - from creating indexes to changing the number of replicas per index, all can be done on the go using simple REST API calls. An entire cluster, no matter how big, can be easily managed, searched or written to all through the REST API.
The documentation is great and lives on github side by side with the code itself, but there's more. The entire spec for the REST API is available on github. This is great news, since you can build any client or tool on top of that and be able to conform to version changes quickly.
In fact, this is exactly what some of the core tools do. I recommend using the excellent Sense UI for experimenting and also for day-to-day work with Elasticsearch. It is available as static HTML and also as a Chrome plugin, and is backed by the aforementioned REST API spec.
The great REST API also helps with rapid development like this tool shows. It really helps focusing on your business requirements and not the surroundings.
Lucene is an amazing search library. It offers state of the art tools and practices, and is rapidly moving forward. We chose Lucene mainly because we had a lot of experience with it, but if you're a new-comer you should really choose it because of what it can offer.
Since Lucene is a stable, proven technology, and continuously being added with more features and best practices, having Lucene as the underlying engine that powers Elasticsearch is, yet again, another big win.
Elasticsearch wraps Lucene and provides server abilities to it. I already covered the scaling-out abilities it provides, and the REST API for managing Lucene indexes, but there's more to it.
The REST API exposes a very complex and capable query DSL, that is very easy to use. Every query is just a JSON object that can practically contain any type of query, or even several of them combined.
Using filtered queries, with some queries expressed as Lucene filters, helps leverage caching and thus speed up common queries, or complex queries with parts that can be reused.
Faceting, another very common search feature, is just something that upon-request is accompanied to search results, and then is ready for you to use.
The number of types of queries, filters and facets supported by Elasticsearch out of the box is huge, and there's practically nothing you cannot achieve with them. Looking to the near future, the upcoming aggregation framework looks very promising and is probably going to change the way we aggregate data with Elasticsearch today.
Elasticsearch is a search server, and the Query DSL it provides is definitely one of the places it really shines. Much easier than any SQL statement or Lucene queries written in Java.
You can host multiple indexes on one Elasticsearch installation - node or cluster. Each index can have multiple "types", which are essentially completely different indexes.
The nice thing is you can query multiple types and multiple indexes with one simple query. This opens quite a lot of options.
Search functions like MoreLikeThis and Suggestions (and Elasticsearch's excellent custom suggesters) are all supported as well using a very handy REST API.
More advanced search tools like script support in filters and scorers, BM25 relevance, the analyze API for testing analyzers, term stats info via REST and much more expose all of Lucene's internals and advanced capabilities for many advanced usages, very easily.
For those times where you really need to bend Elasticsearch to do things your way, you can easily configure it. It is also very easy to extend it, and we have done so multiple times in various occasions.
Many of Elasticsearch configurations can be changed while Elasticsearch is running, but some will require a restart (and in some cases reindexing). Most configurations can be changed using the REST API too.
If you need to create additional REST endpoints to your Elasticsearch cluster, that is easily done as well.
Out of the box, Elasticsearch is pretty much feature complete. The endless extensibility options make it quite impossible to ever get stuck without a solution. This already has saved our day once or twice.
Percolation is a codename for Elasticsearch's ability to run many queries against one document, do this efficiently, and tell you which queries match this document. I have no idea why its called that, but it works, and works great.
This is a very useful feature for implementing an alerting system - like Google Alerts (is it still operational??) or if you are indexing logs you can create alerts to sysadmins when some metric doesn't align.
We use this extensively in an alerting system we have in place, and using Elasticsearch's extensible we were able to rewrite the percolator to add some optimizations of our own based on our business logic, so we are now running it 10 times faster. Brilliant.
Elasticsearch allows you to create indexes using merely configurations in JSON or YAML. It looks like this. This makes the requirement of rolling your analyzer in code much less common - but when its needed its also very easy to do. Here is a code example for doing this for a Hebrew analyzer.
The nice part about analyzers with Elasticsearch is you don't have to define them globally if you know different documents will need to be analyzed differently. You can simply leave the analyzer selection to indexing time by using the _analyzer field. This is super useful for multi-lingual indexes, and I'll have a proper blog post about this topic soon.
Many people use Elasticsearch today, and the ecosystem is steadily growing. The Elasticsearch team maintain several plugins on github (see https://github.com/elasticsearch/), but there's whole lot more. Partial list you can find here. For the rest, Google is your friend...
Other than plugins there a lot of monitoring and admin tools available, like this nice CLI tool for Unix. There's also Chef and Puppet recipes, dashboards, VMs and whatnot. If you need it, you can probably find it somewhere.
The community, other than creating nice tools and plugins, is very helpful and supporting. The overall vibe is really great, and this is an important metric of any OSS project.
There are also some books currently being written by community members, and many blog posts around the net sharing experiences and knowledge.
Elasticsearch is much more than an OSS project - today it's also a company which also serves as an umbrella for other projects like logstash, kibana and Hadoop integration. This, while still keeping all of its code available under the permissive Apache Software License.
Some of the engineers in Elasticsearch Inc are long-time Lucene committers, others are long time OSS contributors, and are overall great people and great engineers. Definitely the type of people you can trust a project with.
Elasticsearch is here to stay, and are pushing forward very hard - judging by the amount of work they do lately, and the type of projects they take on.
ThisI'm really looking forward to what will come next.