Building a distributed search engine - Refactoring story part 2
In my last post in the series I gave an introduction to search engines. I showed the concept of an Inverted Index and how it helps with performing searches on large amounts of texts quickly and efficiently. In this post I'll be discussing the issue of scaling out a search engine.
We chose Apache Lucene for implementing our products, which are very heavy on search. Lucene is a superb open-source search engine library, with a very active user and developers communities. This choice was an obvious one for us, as the system we set out to replace was already using Lucene, and we already had vast experience with it dating years back.
Once you grow big enough, like we did, you will get to a point where one index on one server cannot really accommodate all your data. If you are read-heavy (and most applications are heavier on reads than they are on writes) you will also notice there's a limit to the number of requests you are able to respond to in a timely manner, even when you don't have that much data. At some point you'd come to realise you need your search system to work with more than one server - and then you will wonder how to achieve that.
But how do you do that? Let's first solve the problem of replicating an index, which is somewhat easier. We will see this also helps us with defining a better strategy for spreading one large index across multiple servers.
Replicated Lucene index
Having a replica for a search-engine index effectively means cloning it to other servers, and propagating changes made to one copy to the other copies of it. The thing you clone is effectively the inverted index; that's what a search engine effectively _is_. But how do you propagate changes?
The immediate solution that comes to mind is replicating the actual files on disk. Lucene works in such a way where changes are being made incrementally on disk, using the notion of segments. Every set of changes committed to the index creates a new segment, which is saved on disk to files representing that segment alone. Previous segments are never touched once written, only new segments as new files on disk. The index as a whole is just the collection of all segments created over time, and the "generation" of the index is a number representing the times new segments were created.
So a possible thing to do is to replicate new segments once they have been written to disk. This will achieve what we needed quite nicely, but will have one serious drawback - changes will take time to propagate as segments may not be written frequently. It will also mean high network and IO usage with potentially large peaks (especially when a large number of replicas are involved), and failures which are bound to happen will be costly to recover from.
Replicating segment files will also mean messing with Lucene's merge policies to make sure segments aren't merged and deleted too soon, so we can be sure they have been replicated. This can become very tough manoeuvre, as having many segments can result in reduced speed (especially when there are deletes involved), and Lucene's defaults are usually quite good.
Real-time search also on replicas for consistency, and good performance overall are very important to us, and this is why we concluded the best way to propagate changes is by propagating the actual changes. Instead of having one node doing the indexing and replicating the index, we can have multiple independent nodes accepting the same change commands, and all performing indexing themselves.
Yes, this would entail some additional work to make sure all changes are propagated and committed correctly, but that is already a solved problem.
One important detail I haven't mentioned yet is the requirement of having one master server, replicating to all slaves. It is important to have only one node accepting writes from clients at every given moment, to avoid write conflicts scenarios.
Sharding a Lucene index
For many reasons, and mostly for simplicity, we wanted our data to be indexed continuously into one large index. Since we have millions of new documents added to our system daily, it was clear from day one we had to span our search engine on multiple servers. So we need to have one index span several servers - that's what we call a sharded index.
While a priori there can be multiple ways of sharding an inverted index across multiple servers, for example storing only several fields or ranges of terms on every shard, it become clear very quickly that you do want to have one shard containing the entire inverted data of one document. Without having this, searches will be very expensive as they would have to cross server boundaries for the majority of searches made.
Building on the same logic we already have for replicating an index, we can solve this problem as well. One index can be spread across many servers by simply considering each node as holding an independent index, and routing indexing commands to nodes by using a sharding function. The sharding function determines what node is responsible for each document that comes in, so only one node receives the incoming indexing request and processes it, thus spreading the data across multiple servers. This can be done by a hashing function on the document ID, or based on some field of the document.
A good sharding function will index the data evenly across all nodes, and will possibly try to ensure as many search operations as possible can be performed on as the smallest amount of shards as possible.
For example, if we index Facebook and twitter data, and we know most searches look for only Facebook or only twitter, then we can optimise our sharding function by asking it to shard based on the post ID and the origin, so all tweets are on one portion of our cluster, and all Facebook posts are on the rest of the nodes. This way, most search operations will have to access only half of the nodes.
It is important to note replication is orthogonal to sharding - once you achieved sharding you can replicate each shard individually.
As you may have figured out by now, indexing data on multiple servers means search queries will have to ask multiple servers for results. While Lucene is great at providing results ordered by relevance, this doesn't work well in a distributed scenario as scores given by one node can't really be compared to scores given by another. Merging two sets of results from 2 nodes and sorting by score will most of the time produce bad results. Or at least, we could do better. So this is a problem we had to take into account.
Fortunately for us, we didn't have to implement all that from scratch. Elasticsearch implements this and more in a highly optimised and flexible way, and we decided to use it as our search server technology, wrapping Lucene to provide us excellent scaling out capabilities.
With Elasticsearch it is very easy to create a large cluster of nodes all running one large index, or even multiple indexes. Replication is obviously supported, and a lot of the nitty gritty details of building a distributed system and search engine are already taken care of. And that includes proper scoring of results coming from multiple nodes.
In the next post I'll describe the main reasons which led us to choose Elasticsearch as our search-server technology, including some design decisions we had to make to make this coupling make sense.