Why Elasticsearch? - Refactoring story part 3

ElasticSearch, Lucene, IR, Buzzilla Comments (1)

As part of a system refactoring process, we replaced a legacy search system with Elasticsearch. Being a core component of our system it literally took us more than half a year to move to the new system and make all the features work again, and it required us being absolutely sure of Elasticsearch's competency.

In the previous posts I mentioned we wanted to keep using Lucene to build on top of existing knowledge and experience, but do this in scale reliably and without too much pain. Elasticsearch turned out to be a perfect fit for us, and over a year after the fact we are very happy with it.

I thought I'll do a write up to summarize what we found in Elasticsearch that made it our search engine of choice, and point at some helpful resources. This is a very high level post that doesn't intend on being all too technical; I'll be writing on our experience with some of the features in more detail in the future.

Easy to scale

We start with the obvious one. I explained the complexity of the problem previously, and Elasticsearch really tackles it very nicely.

One server can hold one or more parts of one or more indexes, and whenever new nodes are introduced to the cluster they are just being added to the party. Every such index, or part of it, is called a shard, and Elasticsearch shards can be moved around the cluster very easily.

The cluster itself is very easy to form. Just bring up multiple Elasticsearch nodes on the same network and tell them the name of the cluster, and you are pretty much done. Everything is done automatically - discovery and master selection all done behind the scenes for you.

The ability of managing large Lucene indexes across multiple servers and have some reliable, tested piece of code do the heavy lifting for you is definitely a winner.

There are multiple gotchas and possible pain-points though, namely some prospect issues with unicast/multicast discovery, shard allocation algorithms and so on, but nothing that was a deal breaker for us.

Everything is one JSON call away

Managing Elasticsearch instances is done purely via a very handy REST API. Responses are always in JSON, which is both machine and human readable, and complex requests are sent as JSON as well. It can't get any easier.

A recent addition to Elasticsearch is the "cat API", which gives insights and stats for a cluster in an even more human readable format. You can read more about it here. To me this shows how important it is for Elasticsearch to be easy to maintain and understand as a core feature - and that's very important.

Everything can be controlled via REST - from creating indexes to changing the number of replicas per index, all can be done on the go using simple REST API calls. An entire cluster, no matter how big, can be easily managed, searched or written to all through the REST API.

The documentation is great and lives on github side by side with the code itself, but there's more. The entire spec for the REST API is available on github. This is great news, since you can build any client or tool on top of that and be able to conform to version changes quickly.

In fact, this is exactly what some of the core tools do. I recommend using the excellent Sense UI for experimenting and also for day-to-day work with Elasticsearch. It is available as static HTML and also as a Chrome plugin, and is backed by the aforementioned REST API spec.

The great REST API also helps with rapid development like this tool shows. It really helps focusing on your business requirements and not the surroundings.

Unleashed power of Lucene under the hood

Lucene is an amazing search library. It offers state of the art tools and practices, and is rapidly moving forward. We chose Lucene mainly because we had a lot of experience with it, but if you're a new-comer you should really choose it because of what it can offer.

Since Lucene is a stable, proven technology, and continuously being added with more features and best practices, having Lucene as the underlying engine that powers Elasticsearch is, yet again, another big win.

Excellent Query DSL

Elasticsearch wraps Lucene and provides server abilities to it. I already covered the scaling-out abilities it provides, and the REST API for managing Lucene indexes, but there's more to it.

The REST API exposes a very complex and capable query DSL, that is very easy to use. Every query is just a JSON object that can practically contain any type of query, or even several of them combined.

Using filtered queries, with some queries expressed as Lucene filters, helps leverage caching and thus speed up common queries, or complex queries with parts that can be reused.

Faceting, another very common search feature, is just something that upon-request is accompanied to search results, and then is ready for you to use.

The number of types of queries, filters and facets supported by Elasticsearch out of the box is huge, and there's practically nothing you cannot achieve with them. Looking to the near future, the upcoming aggregation framework looks very promising and is probably going to change the way we aggregate data with Elasticsearch today.

Elasticsearch is a search server, and the Query DSL it provides is definitely one of the places it really shines. Much easier than any SQL statement or Lucene queries written in Java.

Multi-tenancy

You can host multiple indexes on one Elasticsearch installation - node or cluster. Each index can have multiple "types", which are essentially completely different indexes.

The nice thing is you can query multiple types and multiple indexes with one simple query. This opens quite a lot of options.

Multi-tenancy: check.

Support for advanced search features

Search functions like MoreLikeThis and Suggestions (and Elasticsearch's excellent custom suggesters) are all supported as well using a very handy REST API.

More advanced search tools like script support in filters and scorers, BM25 relevance, the analyze API for testing analyzers, term stats info via REST and much more expose all of Lucene's internals and advanced capabilities for many advanced usages, very easily.

Configurable and Extensible

For those times where you really need to bend Elasticsearch to do things your way, you can easily configure it. It is also very easy to extend it, and we have done so multiple times in various occasions.

Many of Elasticsearch configurations can be changed while Elasticsearch is running, but some will require a restart (and in some cases reindexing). Most configurations can be changed using the REST API too.

Elasticsearch has several extension points - namely site plugins (let you serve static content from ES - like monitoring javascript apps), rivers (for feeding data into Elasticsearch), and plugins that let you add modules or components within Elasticsearch itself. This allows you to switch almost every part of Elasticsearch if so you choose, fairly easily.

If you need to create additional REST endpoints to your Elasticsearch cluster, that is easily done as well.

Out of the box, Elasticsearch is pretty much feature complete. The endless extensibility options make it quite impossible to ever get stuck without a solution. This already has saved our day once or twice.

Percolation

Percolation is a codename for Elasticsearch's ability to run many queries against one document, do this efficiently, and tell you which queries match this document. I have no idea why its called that, but it works, and works great.

This is a very useful feature for implementing an alerting system - like Google Alerts (is it still operational??) or if you are indexing logs you can create alerts to sysadmins when some metric doesn't align.

We use this extensively in an alerting system we have in place, and using Elasticsearch's extensible we were able to rewrite the percolator to add some optimizations of our own based on our business logic, so we are now running it 10 times faster. Brilliant.

Custom analyzers and on-the-fly analyzer selection

Elasticsearch allows you to create indexes using merely configurations in JSON or YAML. It looks like this. This makes the requirement of rolling your analyzer in code much less common - but when its needed its also very easy to do. Here is a code example for doing this for a Hebrew analyzer.

The nice part about analyzers with Elasticsearch is you don't have to define them globally if you know different documents will need to be analyzed differently. You can simply leave the analyzer selection to indexing time by using the _analyzer field. This is super useful for multi-lingual indexes, and I'll have a proper blog post about this topic soon.

Rich ecosystem

Many people use Elasticsearch today, and the ecosystem is steadily growing. The Elasticsearch team maintain several plugins on github (see https://github.com/elasticsearch/), but there's whole lot more. Partial list you can find here. For the rest, Google is your friend...

Other than plugins there a lot of monitoring and admin tools available, like this nice CLI tool for Unix. There's also Chef and Puppet recipes, dashboards, VMs and whatnot. If you need it, you can probably find it somewhere.

Active community

The community, other than creating nice tools and plugins, is very helpful and supporting. The overall vibe is really great, and this is an important metric of any OSS project.

There are also some books currently being written by community members, and many blog posts around the net sharing experiences and knowledge.

Proactive company

Elasticsearch is much more than an OSS project - today it's also a company which also serves as an umbrella for other projects like logstash, kibana and Hadoop integration. This, while still keeping all of its code available under the permissive Apache Software License.

Some of the engineers in Elasticsearch Inc are long-time Lucene committers, others are long time OSS contributors, and are overall great people and great engineers. Definitely the type of people you can trust a project with.

Elasticsearch is here to stay, and are pushing forward very hard - judging by the amount of work they do lately, and the type of projects they take on.

ThisI'm really looking forward to what will come next.


Many to many relationships and RavenDB models

RavenDB, RavenDB in Action, EventsZilla Comments (1)

I've recently come across this question in StackOverflow titled "Many to many design for NoSql (RavenDB)", and I thought it is a really good case of a very frequently asked question worth featuring here:

I have two simple objects, 'user' and 'events'. A user can enter many events and an event can be entered by many users - a standard many to many relationship. I'm trying to get out of the relational database mindset!

Here are the queries/actions that I'd like to run against the database:

  • Get me all the events that a user has not entered
  • Get me all the events that a user has entered
  • Update all the events (properties like remaining spaces) very frequently (data polled from various external data sources).
  • Removing events when they've expired

The OP then lists 3 options of solving this:

  1. A "relational approach": Create a new object, that links user and events. For example, a "booking" object, that stores the userId, eventId
  2. Denormalise the events data within the user object, so there is a list of eventIds on the user.
  3. Don't use RavenDB for this, instead use a relational database.

Relational vs RavenDB modeling questions are very common, and the question about many to many relations is one in particular that I'm getting very frequently. That is, when people are aware design decisions differ between relational databases and document databases (and between document databases and RavenDB, and basically between every type of non-relational databases). Its also very common for people to find it hard to escape the relational thinking, as can be evident from the 3 options the OP listed.

I was just finishing writing chapter 5 of RavenDB in Action which discusses document-oriented modeling, so I grabbed my pen (well, keyboard) and wrote down the following answer. It was only a coincidence that I previously discussed almost the exact same domain in the past.

The event-registration domain is a great one for discussing RavenDB modeling, especially because there may be various use cases in play, and use cases dictate the approach to take when deciding on the model. This is pretty much what I tried explaining in my answer, using some nice indexing technique to solve one less-trivial query that was required:

Actually, RavenDB is a perfect fit for that. To properly do that, ask yourself what are the main entities in your model? each one of those will be a document type in RavenDB.

So in your scenario, you'd have Event and User. Then, an Event can have a list of User IDs which you can then easily index and query on. There are more ways to do that, and I actually discussed this in my blog some time in the past with some further considerations that might come up.

The only non-trivial bit is probably the index for answering queries like "all events user has not entered", but that's quite easily done as well:

public class Events_Registrations : AbstractIndexCreationTask<Event>
{
        public Events_Registrations()
        {
                Map = events => from e in events
                              select new { EventId = e.Id, UserId = e.Registrations.SelectMany(x => x.UserId) });

        }
}

Once you have such an index in place, you can do a query like below to get all events a specified user has no registrations to:

var events = RavenSession.Advanced.LuceneQuery<Event, Events_Registrations>()
                                .Where("EventId:*")
                                .AndAlso()
                                .Not
                                .WhereEquals("UserId", userId).ToList();

Then handling things like expiring events etc is very easily done. Definitely don't denormalize event data in the User object, it will make your life living hell.

There's some data missing though - for example how many registrations are allow per event. If there are too many you may want to break it out of the actual Event object, or to revert to Booking objects as you mention. I discuss this in length in my book RavenDB in Action.


The most efficient way of joining the BitCoin ride for making profit

BitCoin, English posts Comments (0)

I've been sitting on this one for a while now because I wanted to verify my theory. Well, it seems like I am on to something here and its time to share.

Here is how I joined the BitCoin ride with low costs, somewhat lower risks, and if I'm right - with much bigger potential for a great profit. You are welcome to do the same, but needless to say I'm taking no responsibility. All I ask is that you buy me a beer if the advice below made you a millionaire :)

Just in case you haven't been around the last couple of months, here is a quick recap: BitCoin is a financial product, striving to be a currency, where one BitCoin is basically one large number that satisfies some mathematical conditions. BitCoin is both the definition of the mathematical algorithms used to define one "coin" and the protocol for verifying coins and transactions. I'm being very inaccurate here, but the basic gist of BitCoin is having virtual money that is transparent and verifiable through network consensus. Note how I'm only saying its a financial product and not a currency - this is because it doesn't yet satisfy the conditions of a proper currency; it does however being traded and respected, and it does have real-world value.

There is quite lot that can be said on BitCoin, and honestly it can drive very interesting discussions on economy, forming a trust and consensus, modern society and the Internet. Is it a bubble? can it still be influenced by big players like governments? does it have the potential of putting our entire economy at risk in the future? These questions are all important but not what I'm going to discuss here. Instead, I want to ignore all of that and consider BitCoin to be just a financial product that you can invest it (or bet on, as some would say). But when one BitCoin is worth over $1,000, it is a bit hard to invest in it in large numbers, needless to say the risk is also quite big.

BitCoin's rate is going sky up

Turns out there are more cryptocurrencies in play. They work in a way almost identical to BitCoin yet differ in various ways, usually in the algorithms used under the hood. If you haven't heard of NameCoin, PeerCoin and LiteCoin you should definitely check them out. LiteCoin for example is going to have 4 times more coins generated when mining is complete, and transactions with it take 4 times less time to process due to different algorithms that make up its core. This means it has all the benefits of BitCoin, and a few extra very important unique benefits of its own. All of that without any real disadvantage.

LiteCoin is the second most traded cryptocurrency. Looking at charts of previous months, it can also be seen how LiteCoin is gaining momentum much faster than BitCoin. It is clearly riding on the waves of BitCoin's success, but it seems people are realizing LiteCoin has great potential on its own as another cryptocurrency, maybe even a better one that BitCoin itself. Think silver and gold, where people realize silver has some important benefits over gold. I'm far from being an expert of a prophet, but to me it seems there is room for more than one such currency, and even though BitCoin is the one most noticeable - there may be better options as well. I have been tracking this for some time now, and LiteCoin seems like a very smart investment - pretty much like buying BitCoins over a year ago when it was just $30 a piece.

So, if you want to take a smart bet, I'd go with buying large amount of LiteCoins when they are still cheap instead of investing in pricey BitCoins. I think BitCoin's rate is still going to go up. But LiteCoin, to my opinion, is going to go up very rapidly and reach hundreds or even thousands of dollars a coin in the next few months, which gives it much better ROI ratio. What will happen next I don't know, but at least you were able to catch a good wave early on. And if it crashes - you may be able to get away with smaller cuts. Other cryptocurrencies could work as well, but I feel strongly towards LiteCoins.

I bought quite a few of them when it was just $10 each, and now its around $40 already. I'm planning to sell some of them at a certain high price to return the initial investment, and then go with the flow.

And don't forget buying me a beer when you cash it.


The Hebrew calendar explained

Software design, English posts Comments (1)

This post was triggered by Jon Skeet's tweet about adding Hebrew calendar support to NodaTime. Jon complained that because the hour and day lengths change all the time in the Hebrew calendar, it requires some fundamental changes to NodaTime's inner workings.

Well Jon, I think you can save yourself the trouble.

Initially I intended to do a full write up on the Hebrew calendar, but the Wikipedia article on the topic is very good and thorough. You really should read it, and if you can read Hebrew make sure to check the Hebrew version of the page, it adds quite a lot.

What I will do instead is summarize the basics and inner workings of this calendar system, and use that to explain why most of those calculations can be safely ignored as they are only done ad-hoc in very specific scenarios.

We start with the definitions of time of day, and then move to discussing days months and years.

A Temporal Hour

The most basic unit in the calendar system is probably an hour, and it is defined very differently from what we are used to. An hour is the product of the following calculation: the time between sunrise and sundown divided by 12. This is referred to as a "temporal hour", because its actual length changes all the time.

While the length of a temporal hour is around 60 minutes, its length in Israel can be any value between 48 minutes (in the shortest of Winter days) and 72 minutes (in long summer days). This is obviously affected by the geographic location since the length of the day changes - around the mediator a temporal hour will always be close to 60 minutes, and as you go closer to the north pole the differences get much bigger.

There is actually a disagreement on what is the definition of sunrise and sundown in this context. There are 2 methods: one considers the astronomical sunrise and sundown to be the deciders, that is - sunrise is when the tip of the sun is seen at the east horizon, and when the tip of the sun disappears in the west is sundown. This approach is commonly referred to as The Gra's Method (שיטת הגר"א), the Gra being one of the recent rabbis advocating for it.

The other method is named after the Magen-Avraham's (מג"א), and considers sunrise to be the time in which the transition begins from night to day (when the skies begin to have some light in them - the dawn), and sundown to be when stars can be seen in the sky - when the transition to night completes.

Unlike the first definition, which is purely astronomical and where times can be calculated using scientific methods and navy tables, the second definition is a lot harder to "put the finger on". After many arguments in traditional Jewish texts there are 3 common definitions of the time that is referred to as "dawn" (Alot HaShacher - עלות השחר):

  1. 72 minutes before the astronomical sunrise (the time it takes to calmly walk 4 times 1 Talmudic mile - about 1km, so 18 minutes * 4). This is the most widely accepted method for the time of the dawn.
  2. 90 minutes before sunrise (5 * 18 minutes) is the method many in Jerusalem use.
  3. 120 minutes before sunrise is a lesser used method, but still many use it (Chabad for example).

Similarly, there are multiple methods to define the start of the night - how much time after the astronomical sundown it is definitely dark outside, and stars can be seen clearly. Most differences originate from arguments over how much time it takes to walk 1 Talmudic mile (most think its 18 minutes, but there are also other methods), and how many miles are to be considered for this calculation. There are 3 commonly used methods:

  1. Most widely accepted method is 18 minutes after the astronomical sundown. This is again the Gra's method, and there are some variants of it measuring 13.5 or 24 minutes instead.
  2. 72 minutes after sundown, also known as Rabenu Tam's method. Mostly used as a precaution time measurement, for example to tell when Shabbat or holidays are over.
  3. 40 minutes after sundown, Hazon Ish's method.

As usual with Jewish tradition, there's a lot of arguments over almost anything, and there are 2 or 3 commonly used methods which really set the tone. This argument over the length of daylight time affect the length of a temporal hour, but also the length of the day as we will just see. Temporal hours aren't being really used as a unit of time today as I'll explain in the last section.

One day's length

In the Hebrew calendar a day is not a product of the number of hours. Rather, a day starts and ends in sundown. But what is sundown?

It is widely agreed that the actual transition between days is sometime after sundown, but since we can't really put the finger on it, there are multiple opinions on how much time that actually is.

The core of the question is about the length of what is called "Bein Hashmashot" - between the suns, the sun that can be seen in the sky and the sun that gives the light to the sky. It's the same entity but two different concepts.

This is where the second definition from above for sundown is mostly used - when its clearly dark outside, some minutes after the astronomical sundown, its clear that the sun is completely absent. There's again a lot of details which I'll skip this time, but you get the idea.

Once stars are out, you can flip the day on the calendar.

One month's length

The Hebrew calendar is a lunisolar one - meaning it uses both solar and lunar calendar systems to calculate dates. Months follow the moon, years and seasons follow the sun.

Unlike other calendars (like the commonly used Gregorian), a month in the Hebrew calendar is measured by the time it takes the moon to complete a full cycle. This time is on average 29 Hebrew days, 12 hours and 793 hour-parts (a part is 1/1080 of the hour).

A month is the time it takes the moon to appear and disappear again

Because it doesn't make any sense to switch dates and months in the middle of the day, some rounding mechanism is applied, so Hebrew months are either 29 or 30 days long, in a way that it self-compensates over lost time.

To assure the holiday of Passover always happens in the autumn the Hebrew calendar has the concept of leap years to synchronize between the lunar calendar and the solar calendar system. While there is still a very small drift, this is achieved and the Hebrew and Gregorian calendars meet every 19 years.

Modern use and computerized systems

In Jewish tradition, the Hebrew calendar is used to tell when holidays and memorial days should be honored. Passover, for example, has the fixed Hebrew date of 15th of Nissan. The Gregorian date for this Hebrew date changes every year, and therefore the Hebrew calendar is used to tell when a holiday should be celebrated.

Since the Hebrew day starts at sundown, the Gregorian and Hebrew dates only partially overlap. As a result, Shabbat and holidays for example begin when the sun sets the evening of the day before. So if 15th of Nissan is, for example, on March 20th, Passover will actually start on March 19th at around 7pm, at sundown.

The Hebrew calendar is still used by many people around the world in day-to-day life, mostly by Jews. Of those aware of the Hebrew calendar, there are 3 types of people:

  1. People who are aware of the Hebrew calendar, maybe sometimes they can tell the current Hebrew date, but usually mostly unfamiliar with it. Most (or many-of) secular Jews fall into this category.
  2. Those who live their life close to the Hebrew calendar, but still use the Gregorian calendar for daily uses. Under this category are many secular Jews, but mostly religious Jews who are involved with the world through businesses etc.
  3. There are groups of ultra-orthodox Jews who only keep track of the Hebrew calendar, and while they are usually familiar with the Gregorian calendar, they usually are not keeping track of dates using it.

This is as far as tracking dates. As for telling time, nobody today uses temporal hours. Everybody, including those who never use the Gregorian calendar, have a standard clock. Temporal hours are used mainly to tell prayer times and other Jewish times of day - like when Shabbat starts or ends - and also then they are only used for the calculation, and the result is told in terms of "modern" hours.

Because all of this, computerized systems never really use the Hebrew calendar. Even when they do, its for very specific needs like telling today's Hebrew date, mostly for display, and then they can calculate the values ad-hoc using simple to use formulas and tables which are widely available.

Even if you wanted to serve customers who only observe the Hebrew calendar, you can get away with it very easily by a simple Hebrew-calendar-aware layer on top of Gregorian dates, which computerized systems already know how to work with (well, at least - lets not add an extra complexity to what is already quite complex, and Jon can testify on that).

TL;DR

The Hebrew calendar is fascinating with lots of historic reasons for every bit of it, but since most of its constructs aren't being used today, most of your code and libraries should just ignore it. Really. Stay away from complexity where it isn't needed, especially when it can be added where necessary fairly easily with a dedicated software layer.

I'll be happy to dive in further on any of the items above if there's interest.


RavenDB’s Index Store, Indexing Process and Eventual Consistency

RavenDB, RavenDB in Action, Lucene, Lucene.NET Comments (0)

In RavenDB, indexes play a crucial part in answering queries. Without them, it is impossible to find data on anything other than the document ID, and therefore RavenDB becomes just a bloated key/value store. Indexes are the piece of the puzzle that allows rich queries for data to be satisfied efficiently. In this article, based on chapter 3 of RavenDB in Action, we will explain how indexes work in RavenDB.

RavenDB has a dedicated storage for documents called the Document Store. The Document Store has one important feature - it is very efficient at pulling documents out by their ID. However, this is also its only feature, and the only way it can find documents. It can only have one key for a document, and that key is always the document ID; documents cannot be retrieved based on any other criteria.

When you need to pull documents out of the Document Store based on some search criteria other than their ID, the Document Store itself becomes useless. To be able to retrieve documents using some other properties they have, you need to have indexes. Those indexes are stored separately from the documents in what we call the Index Store.

In this article, you will learn about indexes and the indexing process in RavenDB.

The RavenDB indexing process

Let’s assume for one moment all we have in our database is the Document Store with a couple of million documents, and now we have a user query we need to answer. The document store by itself can’t really help us, as the query doesn’t have the document IDs in it. What do we do now?

One option is to go through all the documents in the system and check them one by one to see if they match the query. This is going to work, sure, if the user who issued the query is kind enough to wait for a few hours in a large system. But no user is. In order to efficiently satisfy user queries, we need to have our data indexed. By using indexes, the software can perform searches much more efficiently and complete queries much faster.

Before we look at those indexes, let’s consider for a moment when they are going to be built or updated with the new documents that come in. If we calculate them when the user issues the query, we again delay returning the results. This is going to be much less substantial than going over all the documents, but that still is a performance hit we incur to the user for every query he makes.

Another, perhaps more sensible option is to update the indexes when the user puts the new documents. This indeed makes more sense at first, but then when you start to consider what it would take to update several complex indexes on every put, it becomes much less attractive. In real systems, this means writes would take quite a lot of time, as now not only the document is being written, but all indexes have to be updated as well. There is also the question of transactions: what happens when a failure occurs while the indexes are being updated?

With RavenDB, a conscious design decision was made to not cause any wait due to indexing. There should be no wait at all, never when you ask for data, and also never during other operations—like with adding new documents to the store.

So when are indexes updated?

RavenDB has a background process that is handed new documents and document updates as they come in, right after they were stored in the document store, and it passes them in batches through all the indexes in the system. For write operations, the user gets an immediate confirmation of their transaction even before the indexing process started processing these updates—without waiting for indexing but certain the changes were recorded in the database. Queries do not wait for indexing; they just use the indexes that exist at the time the query is issued. This ensures both smooth operation on all fronts and that no documents are left behind. You can see this in figure 1.

Figure 1 RavenDB’s background indexing process does not affect response time for updates or queries.

It all sounds suspiciously good, doesn’t it? Obviously, there is a catch. Since indexing is done in the background when enough data comes in, that process can take a while to complete. This means it may take a while for new documents until they appear in query results. While RavenDB is highly optimized to minimize such cases, it can still happen. When this happens, we say the index results are stale. This is by design, and we discuss the implications of that in the end of this article.

What is an Index?

Consider the following list of books:

If I asked you about the price of the book written by J.K. Rowling, or to name all the books with more than 600 pages in them, how would you find the answer to that? Obviously going through the entire list is not too cumbersome when there are only 10 books in it, but it becomes a problem rather quickly as the list grows.

An index is just a way to help us answer such questions more quickly. It is all about making a list of all possible values grouped by their context and ordering it alphabetically. As a result, the list of books becomes the following lists of values, each value accompanied by the book number it was taken from.

Figure 2 A list of books (left) and lists of all possible searchable values, grouped by context

Since the values are grouped by context (the title, author name, and so on) and are sorted alphabetically, it is now easy to find a book by any of those values even if we had millions of them. You simply go to the appropriate list (say, author names) and look up the value. Once the value has been found in the list, the book number that is associated with it is returned and can be used to get the actual book if you need more information on it.

RavenDB uses Lucene.NET as its indexing mechanism. Lucene.NET is the .NET port of the popular open—source search engine library Lucene. Originally written in Java and first released in 2000, Lucene is the leading open-source search engine library. It is being used by big names like Twitter, LinkedIn, and others to make their content searchable and is constantly being improved.

A Lucene index

Since RavenDB indexes are in fact Lucene indexes, before we go any deeper into RavenDB indexes, we need to familiarize ourselves with some Lucene concepts. This will help us understand how things work under the hood and allow us to work better with RavenDB indexes.

In Lucene, the base entity that is being indexed is called a document. Every search yields a list of matching documents. In our example, we search for books, so each book would be a Lucene document. Just like books have title, author name, page count, and so on, oftentimes we need to be able to search for documents with more than one piece of data taken from each. For this end every document in Lucene has the notion of fields, which are just a logical separation between different parts of the document we indexed. Each field in a Lucene document can contain different pieces of information about the document that we can later use to search on.

Applying these concepts to our example, in Lucene, each book would be a document, and each book document would have the title, author name, price, and pages count fields. Lucene creates an index with several lists of values, one for each field, just as shown in figure 3. To complete the picture, each value that is put in the index in one of those lists (for example, “Dan Brown” from 2 books into the author names field) is called a term.

Figure 3 The structure of a typical Lucene index

Searches are made with terms and field names to find matching documents, where a document is considered a match if it has the specified terms in the searched fields, as in the following pseudo-query: all book documents with the author field having a value of “Dan Brown.” Lucene allows for querying with multiple clauses on the same field or even on different fields, so queries like “all books with author Dan Brown or J.K. Rowling, and with price lower than 50 bucks” are fully supported.

An index in RavenDB is just a standard Lucene index. Every RavenDB document from the document store can be indexed by creating a Lucene document from it. A field in that Lucene document is then going to be a searchable part of the RavenDB document we are indexing. For example, the title of a blog post, the actual content, and the posting date, each will be a field.

Queries are made against one index, on one field or more, using one term or more per field. Next up, you’ll see how exactly that falls into place.

The process in which Lucene documents are created from documents stored in RavenDB - from raw structured JSON to a flat structure of documents and fields - is referred to as map/reduce. This is a two-stage process where first the actual data is projected out of the document (aka being Mapped), and then optionally gets processed or transformed into something else (aka being Reduced). Starting in the next section we will go through RavenDB’s map/reduce process and work our way to properly grokking it.

Eventual consistency

Before we go and have a look at actual indexes, allow me to pause for a minute or two to discuss the implications of indexing asynchronously. As we explained in the beginning of the chapter indexing in RavenDB happens in the background, so on a busy system new data may take a while to appear in query results. Databases behaving this way are said to be eventually consistent, meaning that at some point in time new data is guaranteed to appear in queries but it isn’t guaranteed to happen immediately.

At first glance, getting stale results for queries doesn’t seem all too attractive. Why would I want to work with a database that doesn’t always answer queries correctly?

This is mostly because we are used to databases that are implemented with the (atomicity, consistency, isolation, and durability) ACID properties in mind. In particular, relational databases are ACID and always guarantee consistent results. To our context what that means is every query you send will always return with the most up-to-date results, or, in other words, they are immediately consistent. If a document exists in the database when the query is issued, you are guaranteed it will be returned in all matching queries.

But is that really required? Document store is immediately consistent, queries are eventually consistent: with RavenDB, eventual consistency is only the case when querying. Loading a document by ID is always Immediately Consistent, and fully ACID compatible.

Even when results are known to be 100 percent accurate and never stale like they are in any SQL database, during the time it takes the data to get from the server to the user’s screen plus the time it takes the user to read and process the data and then to act on it, the data could have changed on the server without the user’s knowledge. When there is high network latency or caching involved, that’s is even more likely. And, what if the user went to get coffee after asking for that page of data?

In the real world, when it comes down to business, most query results are stale or should be thought of as such.

Although the first instinct is to resist the idea, when it actually happens to us we don’t fight it and usually even ignore it. Take Amazon for example: having an item in your cart doesn’t ensure you can buy it. It can still run out of stock by the time you check out. It can even run out of stock after you check out and pay, in which case Amazon’s customer relations department will be happy to refund your purchase and even give you a free gift card for the inconvenience. Does that mean Amazon is irresponsible? No. Does that mean you were cheated? Definitely not. It is just about the only way they could efficiently track stock in such a huge system, and we as users almost never feel this happen.

Now, think about your own system and how up to date your data should really be on display. Could you accept query results showing data that is 100 percent accurate as of a few milliseconds ago? Probably so. What about half a second? One second? Five seconds? One minute?

If consistency is really important, you wouldn’t accept even the smallest gap, and if you can accept some staleness, you could probably live with also some more. The more you think of it, the more you come to realize it makes sense to embrace it rather than fight it.

At the end of the day, no business will refuse a transaction even if it was made by email or fax and the stock data have changed. Every customer counts, and the worst that could happen is an apology. And that is what stale query results are all about.

As a matter of fact, a lot of work has been put making sure the periods in which indexes are stale are as minimal as possible. Thanks to many optimizations, most stale indexes you will see are new indexes that were created in a live system with a lot of data in it or when there are many indexes and a lot of new data keeps coming in consistently. In most installations, indexes will be non-stale most of the time, and you can safely ignore the fact they are still catching up when they are indeed stale. For when it is absolutely necessary to account for stale results, RavenDB provides a way to know when the results are stale and also to wait until a query returns non-stale results.

Summary

Having a scalable key/value store database is nice, but indexes are what really make RavenDB so special. Indexes make querying possible and efficient, and the more flexible indexes are, the more querying possibilities you have.

In this article, we laid the basics for understanding indexes in RavenDB and became familiar with RavenDB’s novel approach to indexing. We talked about the asynchronous indexing process and possibility of getting stale results, although in most real-world cases you will hardly even notice that.


Videos of my Oredev talks online

RavenDB, ElasticSearch, Talks, Lucene, NoSQL Comments (0)

Last week I spoke at Oredev, an awesome conference in cold Sweden. It was fun meeting all those great people, and the atmosphere was really great.

My first talk was an introductory level one on Elasticsearch - what it is, how it can be used and what it can do. I really like this shiny piece of technology, and it probably shows.

I'm usually doing a lot of live coding on stage, but this time I decided to take it easy with that and used only slides in both talks. I'm still not sure how I feel about live coding; it is fun and challenging for sure and for some people it gives a lot of added value, but I think people can lose focus rather quickly and then find themselves bored. This is why I tried giving actual hands-on guidance without doing any actual demo, using video within slides and no code at all. Hopefully I succeeded with that.

The second talk was a lightning talk on RavenDB, about 15 minutes long in total including a short Q&A. I was co-presenting with Peter Neubauer and Joel Jacobson of Neo4j and Riak respectively, and together we tried giving a short intro to NoSQL by talking about 3 completely different database products.

10 minutes can hardly do justice to any piece of technology and the 3 of us felt we barely scratched the surface of what you can do with the 3 technologies, but hopefully this overview talk helps in wrapping your head around what NoSQL is and on possible use cases for RavenDB, Neo4j and Riak.


Continuous deployment in .NET using Project Kudu and git

Project Kudu, .NET, Continuous Deployment Comments (1)

Project Kudu is a great open-source initiative for enabling git deployments of websites. That is, you can push your source code or binaries to git as means of deployment. If it's the sources you pushed, Kudu will compile them and run tests and will only deploy if successful. This is like Heroku, AppHarbor and Azure websites handle deployments, only you can have it in your own web server. Actually, Project Kudu is what powers Azure Websites deployments.

As of why you'd want that, check out this article as an example. Or Google "Continuous Deployment".

I gave Project Kudu a spin a couple of months ago, but wasn't able to get it working properly then. I gave it another go tonight with great success, using the following steps:

  1. Clone the git repo, and make sure you have IIS 7, node.js 0.8x, git and Mercurial installed (latter is not mandatory I think, but it is being used in some places). You will also need them installed on the server - make sure node and git are available from the PATH on both your machine and the server.
  2. Open in Visual Studio as Admin. I have VS 2012 and I'm running on Windows 7. I had to do nuget restore manually from the project root folder to get it compile all the way (stupid Nuget 2.7).
  3. Here I got some post-compilation errors relating to node components. I solved them by executing Setup\KuduDevSetup.cmd and Kudu.Services.Web\updateNodeModules.cmd manually from PowerShell as admin. You need to have npm in your PATH.
  4. Rebuild solution - now it should complete with no errors.
  5. The Kudu project has 2 parts, and you will need both of them on your webserver. My favorite way of deploying websites is using the Publish tool of Visual Studio. The Kudu.Web project you can just Publish to a local folder and upload to the server, but the artifacts of Kudu.Services.Web you have to take from the bin folder as it needs the exe and some other things there which are not being copied on Publish. I actually published it and then copied the bin folder over (compiled as Release).
  6. Follow the steps here and here to complete installation and create a new application (most you've already done if you got here).
  7. Creating a new application from Kudu's web interface will create 2 IIS websites, one for the actual website and the other for a service that also functions as a git repository you can push to. The two websites are on random ports. You will need to open TCP access to them in your Windows Firewall.

Overall the experience is great. This is how a deployment process looks like, pushing binaries to a remote git repository on my web server:

Successful deployment with Project Kudu

Project Kudu is really great, you should check it out!

Pain points

Probably because Project Kudu is an Azure Websites thing, it is being built with only that use case in mind. There are some places where it shows, and I really hope the dev team will take some time making it a good fit for general use.

Security

Both the web interface for managing applications deployed with Kudu, its background service that has some sort of a monitoring UI, and the backing service for every website deployed with Kudu - all are entirely exposed to the web. It will greatly help if they could allow for authentication.

There may be a way to use IIS to control access to those websites (HTTP auth, or using Windows credentials which I'd rather avoid), but I haven't looked into it yet. It's just too much setup to take care of, and would be nice to have that supported OOTB.

Ports management

The 2 main websites (Kudu.Web and Kudu.Services.Web) are a one time configuration, and are easy to configure under kudu.mydomain.com or something using standard 80/81 ports so no extra configuration is required.

However, every new website you create involves creating 2 new websites in IIS - one is kudu_mysite and the other kudu_mysite_service. Both are created on 2 random ports, which means you have to add rules to your firewall. Annoying.

The way I work now is this. After creating a new website using Kudu I just change the bindings of kudu_mysite to use the standard 80 port and the domain name or subdomain I have for it. The service website I push to using git and then stop, at least until I have enabled authentication for it. But that still means I have to open that port for the service on my firewall.

It would be great if Kudu would have one service for managing all websites, so there is only one port to be opened. Also, when creating a new website, allow me to set the bindings myself, in case I do want to use port 80 with some custom domain configurations.

There may be a better solution for both issues, using a proxy service or something, would love to hear about it if you got it working.

Automated installation process

Someone said WebPI?


Building a distributed search engine - Refactoring story part 2

ElasticSearch, Lucene, IR, Buzzilla Comments (0)

In my last post in the series I gave an introduction to search engines. I showed the concept of an Inverted Index and how it helps with performing searches on large amounts of texts quickly and efficiently. In this post I'll be discussing the issue of scaling out a search engine.

We chose Apache Lucene for implementing our products, which are very heavy on search. Lucene is a superb open-source search engine library, with a very active user and developers communities. This choice was an obvious one for us, as the system we set out to replace was already using Lucene, and we already had vast experience with it dating years back.

Once you grow big enough, like we did, you will get to a point where one index on one server cannot really accommodate all your data. If you are read-heavy (and most applications are heavier on reads than they are on writes) you will also notice there's a limit to the number of requests you are able to respond to in a timely manner, even when you don't have that much data. At some point you'd come to realise you need your search system to work with more than one server - and then you will wonder how to achieve that.

But how do you do that? Let's first solve the problem of replicating an index, which is somewhat easier. We will see this also helps us with defining a better strategy for spreading one large index across multiple servers.

Replicated Lucene index

Having a replica for a search-engine index effectively means cloning it to other servers, and propagating changes made to one copy to the other copies of it. The thing you clone is effectively the inverted index; that's what a search engine effectively _is_. But how do you propagate changes?

The immediate solution that comes to mind is replicating the actual files on disk. Lucene works in such a way where changes are being made incrementally on disk, using the notion of segments. Every set of changes committed to the index creates a new segment, which is saved on disk to files representing that segment alone. Previous segments are never touched once written, only new segments as new files on disk. The index as a whole is just the collection of all segments created over time, and the "generation" of the index is a number representing the times new segments were created.

So a possible thing to do is to replicate new segments once they have been written to disk. This will achieve what we needed quite nicely, but will have one serious drawback - changes will take time to propagate as segments may not be written frequently. It will also mean high network and IO usage with potentially large peaks (especially when a large number of replicas are involved), and failures which are bound to happen will be costly to recover from.

Replicating segment files will also mean messing with Lucene's merge policies to make sure segments aren't merged and deleted too soon, so we can be sure they have been replicated. This can become very tough manoeuvre, as having many segments can result in reduced speed (especially when there are deletes involved), and Lucene's defaults are usually quite good.

Real-time search also on replicas for consistency, and good performance overall are very important to us, and this is why we concluded the best way to propagate changes is by propagating the actual changes. Instead of having one node doing the indexing and replicating the index, we can have multiple independent nodes accepting the same change commands, and all performing indexing themselves.

Yes, this would entail some additional work to make sure all changes are propagated and committed correctly, but that is already a solved problem.

One important detail I haven't mentioned yet is the requirement of having one master server, replicating to all slaves. It is important to have only one node accepting writes from clients at every given moment, to avoid write conflicts scenarios.

Sharding a Lucene index

For many reasons, and mostly for simplicity, we wanted our data to be indexed continuously into one large index. Since we have millions of new documents added to our system daily, it was clear from day one we had to span our search engine on multiple servers. So we need to have one index span several servers - that's what we call a sharded index.

While a priori there can be multiple ways of sharding an inverted index across multiple servers, for example storing only several fields or ranges of terms on every shard, it become clear very quickly that you do want to have one shard containing the entire inverted data of one document. Without having this, searches will be very expensive as they would have to cross server boundaries for the majority of searches made.

Building on the same logic we already have for replicating an index, we can solve this problem as well. One index can be spread across many servers by simply considering each node as holding an independent index, and routing indexing commands to nodes by using a sharding function. The sharding function determines what node is responsible for each document that comes in, so only one node receives the incoming indexing request and processes it, thus spreading the data across multiple servers. This can be done by a hashing function on the document ID, or based on some field of the document.

A good sharding function will index the data evenly across all nodes, and will possibly try to ensure as many search operations as possible can be performed on as the smallest amount of shards as possible.

For example, if we index Facebook and twitter data, and we know most searches look for only Facebook or only twitter, then we can optimise our sharding function by asking it to shard based on the post ID and the origin, so all tweets are on one portion of our cluster, and all Facebook posts are on the rest of the nodes. This way, most search operations will have to access only half of the nodes.

It is important to note replication is orthogonal to sharding - once you achieved sharding you can replicate each shard individually.

As you may have figured out by now, indexing data on multiple servers means search queries will have to ask multiple servers for results. While Lucene is great at providing results ordered by relevance, this doesn't work well in a distributed scenario as scores given by one node can't really be compared to scores given by another. Merging two sets of results from 2 nodes and sorting by score will most of the time produce bad results. Or at least, we could do better. So this is a problem we had to take into account.

Elasticsearch

Fortunately for us, we didn't have to implement all that from scratch. Elasticsearch implements this and more in a highly optimised and flexible way, and we decided to use it as our search server technology, wrapping Lucene to provide us excellent scaling out capabilities.

With Elasticsearch it is very easy to create a large cluster of nodes all running one large index, or even multiple indexes. Replication is obviously supported, and a lot of the nitty gritty details of building a distributed system and search engine are already taken care of. And that includes proper scoring of results coming from multiple nodes.

In the next post I'll describe the main reasons which led us to choose Elasticsearch as our search-server technology, including some design decisions we had to make to make this coupling make sense.


Video and code from my DevDay talk on Lucene & Elasticsearch

Lucene, Lucene.NET, ElasticSearch, IR, Talks Comments (0)

A few weeks ago I gave a talk in DevDay 2013 in Krakow, Poland, in what could be described as a totally awesome conference. I wish it lasted more than one day... Way to go Raf, Michael and team!

My talk was a quick introduction to full-text search with Lucene and ElasticSearch. Other than focusing on technicalities, the talk teaches some basics and then moves on to real-world usages, which I find are quite neat and useful in many scenarios. It ended with a demo of a Wikipedia search app implementing some of those features, which I wasn't able to show in full due to lack of time, but is available from github: from github at https://github.com/synhershko/WikipediaSearchDemo.

The video of the talk is now live on YouTube:

The "official" talk description:

Apache Lucene is a high-performance, full-featured text search engine library. But it can do an awful lot more than just enable searching on documents or DB records.

In this hands-on session we will get to know Lucene and how to use it, and all the great features it provides for applications requiring full-text search capabilities. After understanding how it works under the hood we will learn how to make use of it to do other cool stuff that would amaze our users, get the most out of your applications, and maximize profit.</p>

Delivered by Apache Lucene.NET committer Itamar Syn-Hershko, this talk is aimed at providing you tools for starting to work with Lucene, and also to show you how it can be used to provide better UI and UX and to leverage data in use by your application to provide better experience to the user.


Google, cluster management and the red pill

Software design Comments (0)

I was referred to the following video by a friend, and it completely blew my mind. It got my attention at the beginning when the speaker, ex-HP researcher, said joining Google was like taking the red pill:

To me, there are two types of infrastructures - those who manage their servers manually, and those who do that automatically using provisioning tools like Foreman and Puppet. This discussion is about a third type, about growing computers in a farm like they were plants. About thinking abstractly in so many level above what's being discussed today. And this talk is over 2 years old already.

And it gets even better. Make sure to read the follow-up post on Wired from March this year, where they talk more on the subject, and mention Twitter are building their system to mimic Google's Borg and Omega. It is called Mesos - it's a cluster management software that will do its best to make the most out of the servers, and its completely open-source.

The clusters I'm working with are so tiny compared to the ones discussed, that I find it highly unlikely I'll be even experimenting with such a tool any time soon. Yet, I find it highly impressive and inspiring learning about those efforts, and it actually provides a lot of food for thought about any distributed system design.


Showing 10 posts out of 78 total, page 2

Previous page Next page