Search engines 101 - Refactoring story part 1

Buzzilla, English Posts, Lucene, IR

Comments

5 min read

A couple of months ago I was about to start blogging about the work we did in Buzzilla the past couple of months. A few days after posting about the planned series, things got hectic and I had to let that go for a while. Now that things are a bit calmer I can finally attend to writing the series.

So, in Buzzilla we collect data from various sources - mostly mainstream websites, forums and social networks - and allow people to search on them. In addition to showing search results, we also give our users insights on the distribution of the results - where they were published, by whom and so on. We have quite a lot of smarts in place, but that's for another time.

It all began when scale started to become an issue. The amount of data we deal with daily is enormous, and is constantly growing. 12 months ago the system we had in place just couldn't handle the amount of data we had and so rebuilding it from scratch became our number one priority.

But how do you build a distributed search engine? first, let's quickly describe what a search engine does. After that we will be able to talk about building a distributed search engine.

How does a search engine work?

Remember the old times, when we were reading real paper books? some of those books, usually the more serious ones, had an index section as an appendix in the end of the book. That index was useful for people using the book as a reference or just to jump around. it would list all important terms, alphabetically ordered, and then say which pages of the book mentioned them.

So if for example you were reading a book about Magic, and you wanted to go to the discussions about wands, you'd open the back of the book, lookup "wand" in the index, and then go to all the pages listed there. In the index the focus shifts from the entire discussion to individual words. This is the concept behind search engines.

We as humans read texts by looking at a document, and reading it word by word. Sometimes a document contains multiple readable areas, for example the document title, a short description, the author name and the actual content - lets call those Fields. The indexing process - which is at the heart of every search engine, basically inverts this order; instead of looking at a document it lets you look at a word, or multiple words, and then tell you where you can find them. This is why the index that is in the core of all search engines is called an inverted index.

This makes a lot of sense when you come to think of it - because when we are looking for a word, unless we optimise this lookup process as described we would end up going through millions of documents, word by word. Once the logic is inverted we start by looking up a word, and then we get to optimise this further by using very efficient data-structures which are well-suited for the job, things like a Trie.

The indexing process requires a lot of computational power, and also a significant amount of I/O. It can get bothersome at times. But this also means most of the work is done before any searches are performed. Because most of the hard work has already been done while indexing, searches are super-fast, especially when it is possible to store the inverted index entirely in memory.

Lucene

All of that indexing and search work is, basically, what Lucene does. Lucene is a very mature open-source search engine library written in Java (with ports to .NET, C++ and more), that takes responsibility of performing the indexing process for you, and letting you search the index later.

While your content may be in various formats (Word documents, PDFs and so on), Lucene only knows how to work with plain texts. The parsing part you'd have to do yourself. Try Lucene's sister project Tika for help with parsing other types of content.

To index data with Lucene, you create a Lucene Document object, feed it with text in various fields ("content" will get the actual content, "title" will hold the title, and so on), and then pass it to the IndexWriter. The IndexWriter will split the text stream into individual words, sort them alphabetically, and update the entries in the inverted index.

In the end of the process, you'll have what can be thought of a table, with 2 columns - one for the list of words, and one for the list of occurrences of each word in the documents. Each such table will hold the list of words and their occurrences for a specific field, giving you the ability to limit searches for specific fields only.

The image below demonstrates an inverted index built out of 4 documents describing books, with multiple fields of texts in them. In this indexing process words from the title and the author name were not split, but the entire textual value got indexed as one word - just like the indexes in the old books. If the text was broken down to individual words, there would be many more terms in that table:

An inverted index

But what happens when your data grows so big you can't accommodate it on one server any more? for example if we had billions of books to index? and how can you ensure high-availability and implement load balancing?

I'll discuss this in the next post in the series.


Comments are now closed