Multi-lingual search with Lucene and Elasticsearch

Lucene, ElasticSearch, IR, English posts Comments (0)

Last night I gave a talk at SkillsMatter London on multi-lingual search with Lucene and Elasticsearch. The talk covered various challenges with indexing texts in various languages: tokenization, term normalization and stemming. I started with demonstrating the challenges on individual languages, and ended with discussing the ability of mixing texts in various languages in one index - whether it is at all possible, and how to approach that.

We had some issues with the recording so I had to repeat the first few slides (this is why I go very quick in the first minutes...) and the audio quality could be better, nevertheless the talk presents real-world issues and offers what I believe to be good paths for solving those issues. Since this is quite a lot to write blog posts about I think I will just leave it in its video existence for now.

The video is available here: https://skillsmatter.com/skillscasts/4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucen.


What is real-world search anyway?

Lucene, ElasticSearch, IR, English posts Comments (0)

Let’s face it - not all the data we handle is easy to query. In fact, most of it is actually pretty tough to work with. This is often times because a lot of the data we process and handle is unstructured. Be it logs, archived documents, user data, or text fields in our database that we know contain information that can be useful, but we just don’t know how to get to it.

As developers, we tend to fight that. Our first reaction will always be to try and structure the unstructured. This is the challenge we like to rise to as professionals, and that is truly great. But sometimes it just makes sense to stop fighting reality and use a set of tools that is more suited for this task. In some cases this will save many resources and hair-pulling. In other cases, I don’t know whether they are better or worse, we didn't even realize we had a gold mine of information at our fingertips so we haven’t even tried doing something with it.

During the past 10 years or so the field of information retrieval - text retrieval and search engines in particular - has evolved greatly. Search engines have been built and scaled, and within a few years did the impossible. Nobody thought we could handle that scale of data, or to make sense out of it all. Would you have invested in Google before 2000?

Search engines do not exist only on 3rd party websites like Google or Bing. Quite a few search engine libraries that are meant to be used in both open- and closed-source projects were released under various licenses. The most notable of all is probably Apache Lucene, a search engine library released as open-source for the first time in 1999. Since then, Lucene has made giant steps and is developed actively to this day, making new landmarks every few months by releasing new features or major improvements.

But Lucene is just a search library. To scale it out so it can handle large amounts of data you need to have inter-server communications, and some logic to split your data between them. For that Lucene offers Solr, a search server that acts as a wrapper around Lucene indexes. Another option, created by other Lucene project members, is Elasticsearch. Both Solr and Elasticsearch are released under the same open-source license as Lucene’s, with my personal favorite being Elasticsearch, due to its novel approach for scaling out indexes and super-easy to use API (everything is doable using REST calls over HTTP).

Using these technologies (Lucene, Solr or Elasticsearch) it is very easy to add full-text search capabilities to any type of application - running on the desktop, web, cloud or mobile. There are a few things to figure out - like how to feed the data from your data sources, how to make sure the search engine has the last version of our data at all times, and how to process it correctly so common searches are effective and perform well. Every project has a different best practice to those challenges, they are hardly ever the same. But once you figured those out, browsing your data is suddenly a breeze.

As it turns out, full-text search capabilities are only the tip of the iceberg. As people started using search technologies to perform full-text searches, new capabilities came about. Leveraging the data and insights search engines can provide on our data, we can do a lot of interesting stuff. For example, we can detect typos and offer corrections; we can find similar documents so we can remove or merge them (also known as record linkage); or we can use this to offer customers at our shop similar products they can add to their cart.

Other, more advanced, modern usages of search technologies that worth noting include geo-spatial search (using shapes like points, circles or polygons representing locations on Earth to find data tagged with more shapes; for example finding the nearest restaurant to the user’s location), image search by color scheme, entity extraction and other Natural-Language-Processing methods to further analyze texts and improve insights on them.

There is a great set of tools at our disposal when using search technologies, far more than we can even list in this blog post. Nowadays this is not only about full-text search anymore (although obviously this is definitely still supported and is better than ever before!). Being familiar with those tools and with best practices for using them, we can start giving thought on how we can use them in our project - whether in an automated process or exposed via some UI to our users to give them (and us!) added value.

Modern search engines are built to be scalable and performant. With correct planning you can handle large amounts of data easily (even BigData, if you don’t mind the buzzword), as well as many concurrent users issuing many requests, by spreading your data across multiple servers. Because they are so performant, they can offer real-time search capabilities even on large sets of data. The most impressive use of this is most likely Elasticsearch’s Kibana dashboard to plot graphs in real-time out of an intensive stream of raw data, for example Apache HTTP server logs.

The field of search engines and information retrieval is moving ahead very fast. There are still many challenges to tackle, but there’s already a lot to gain from this quickly evolving set of technologies. Just a quick look at recent history will show you companies that were sold in billions not because they have a great product, but because they were able to collect a lot of data and extract insights out of it.

I’m happy to have given the opportunity to explore and teach this topic in depth in my 2-day course - available in both London and New-York. In the course we will learn mostly about Elasticsearch, and by learning some of the theory behind it and then digging to its core we will understand how to use it correctly, how to bend it to our needs, and what is the set of tools that is at our disposal. The course is designed to provide you the tools to get you started with search technologies right away, and to make sure you are up to date with the greatest and latest.


Debugging a failing unit-test which interacts with RavenDB

RavenDB, English posts Comments (1)

This week I'm working with Particular Software of NServiceBus fame. There's a shiny new platform for designing, managing and debugging distributed systems coming up and while doing some work on it I hit the following failing test. The documentStore is an EmbeddableDocumentStore which was properly initialized, and the index used was registered properly. Nevertheless, the first Assert is failing. Can you figure out what's wrong?

    [Test]
    public void Order_by_critical_time()
    {
        session.Store(new ProcessedMessage
        {
            Id = "1",
            MessageMetadata = new Dictionary<string, object> { { "CriticalTime", TimeSpan.FromSeconds(10).Milliseconds} }
        });

        session.Store(new ProcessedMessage
        {
            Id = "2",
            MessageMetadata = new Dictionary<string, object> { { "CriticalTime", TimeSpan.FromSeconds(20).Milliseconds} }
        });
        session.SaveChanges();

        var firstByCriticalTime = session.Query<MessagesViewIndex.SortAndFilterOptions, MessagesViewIndex>()
              .OrderBy(x => x.CriticalTime)
              .Customize(x => x.WaitForNonStaleResults())
              .AsProjection<ProcessedMessage>()
              .First();
        Assert.AreEqual("1", firstByCriticalTime.Id);

        var firstByCriticalTimeDesc = session.Query<MessagesViewIndex.SortAndFilterOptions, MessagesViewIndex>()
              .OrderByDescending(x => x.CriticalTime)
              .AsProjection<ProcessedMessage>()
              .First();
        Assert.AreEqual("2", firstByCriticalTimeDesc.Id);
    }

I wasn't able to figure out what was going wrong. Results were just coming back out of order, and I couldn't tell if it was something bad with the index, or the documents inserted by the test were faulty?

Luckily, there is an easy way out - and it's exactly what I used. You can suspend a unit-test while running it in Debug by calling WaitForUserToContinueTheTest(documentStore); - this method as well as some other helper methods are available in this nuget package, or in the source form here. Just add it to your unit testing classes and call it when needed.

This method is super useful - it will suspend the test and open up the management studio of the embedded database instance used by the test, so you could go through it and look at the documents it has, the index definitions, execute queries against it and so on.

So I added those 2 lines to my test, just before the first query:

    WaitForIndexing(documentStore);
    WaitForUserToContinueTheTest(documentStore);

And executed the test again. Now the Management Studio opened up, and the culprit was immediately visible by having a quick look at the documents in the document store - the CriticalTime property was 0 on both documents, not what I was expecting. I've spent too much time in the Java land apparently, and forgot that I had to use the Ticks property of the TimeSpan class to get the actual time value as a long.

Making this change made the test green again, and off I went to working on the next feature. I sure have saved myself quite a lot of time and hair.


RavenDB's hidden features

RavenDB, English posts Comments (0)

In this article, which got published some time ago in a developers' magazine in Norway, we are going to explore two of the lesser-known features of RavenDB, and show how they can greatly improve your application and your business.

Suggesting alternate spellings for search terms

Imagine the following scenario: You go on a website to look for some info on an acquaintance. You search once, twice, but can’t find it, so you give up. A few days later you discover Danny is actually spelled Danni, but it is already too late. The website you gave up on lost you – a potential user, or in some cases a paying customer.

Sounds familiar? This happens time and again for too many websites and applications. Trying to guess what the user was actually looking for and trying to provide him with meaningful alternatives is being considered by many developers overkill. By doing so, they don’t realize the full potential of their application and lose both customers and business.

A very popular approach by search engines is to try and guess term suggestions when they detect search results may not be satisfactory. You are probably familiar with Google’s “did you mean?” when you make an accidental typo. Well, it’s not that Google is trying to mock you or anything; it’s just that it was able to find a higher scoring term with a certain edit distance from the term you actually typed.

RavenDB provides a very easy and intuitive way of providing alternative terms for queries that returned little or no results, just like Google’s “Did You Mean?”. When a full-text query returns zero results, or when you have other indication of bad results being returned for a query, you can ask RavenDB to provide suggestions for the term or terms used in that query:

var query = session.Query<Book>("BooksIndex").Search(x => x.Author, "brwon");
var results = query.ToList();

if (results.Count == 0)
{
    var suggestions = query.Suggest();
    foreach (string suggestion in suggestions.Suggestions)
   {
       Console.WriteLine(suggestion);
    }
}

In the code above, we created a query and issued it to get results, and we keep it aside so we can reuse it for suggestions if necessary. If no results are found for “brwon" in our data set, so we ask RavenDB for suggestions. Suggestions will return a list of terms. Each of them can be used to notify the user, or even re-issue the query, all depends on what you see fit for your application.

Many big online stores - Amazon for instance - maximize their profits by showing users “related products” in product pages and during check-out. Similarly, websites like CNN can get users to spend more time on their website by showing links to related content at the bottom of an article. Unlike what you might think, this doesn’t require a full editorial staff to through all your content. It can be done fairly easily by comparing data in one content entity to the rest of the content, and showing the highest ranking ones to the user. So the question remains – how can you do this easily and efficiently?

RavenDB exposes Lucene’s MoreLikeThis functionality, which creates a full-text search query from a document and uses that to find similar documents. The result is documents that are similar to the original document, based on the terms in the query document and their frequency. To do this you need to approach the RavenDB server with a document ID, and tell it which index to use for the comparison:

var list = session.Advanced.MoreLikeThis<Book>("BooksIndex",
    new MoreLikeThisQuery
   {
        DocumentId = "books/2",
        Fields = new[] {"Title", "Author", "Description"},
        MinimumWordLength = 2,
   }
);

The result of calling this method is an array of book objects RavenDB deems similar to the book used as a query. To get the most out of this feature, you want the lookup to be performed on text properties like title and description, and make sure they were indexed as Analyzed. Doing so will utilize RavenDB’s full-text search capability behind the scenes, and will maximize relevance of the products considered relevant.

It is also possible to perform fine-tuning and adjustments, for example to hand-pick what properties to use for the actual comparison, and what is the minimum or maximum word length. All of those are passed as parameters to the MoreLikeThisQuery object that is passed to the method, like shown above.


Modelling hierarchical data with RavenDB

RavenDB, Software design, English posts Comments (1)

A very common requirement with any database engine is to be able to store and query hierarchical data. This usually comes up in the context of categories for products in an e-commerce website, in a site-structure context (this page needs to be a parent of that page) or when trying to model company hierarchies tree (for deciding on permissions, for example).

While RavenDB applications need to solve the same kind of problems, a good solution often times looks very different from what you'd expect.

The immediate solution seems quite obvious, especially if you've solved a similar problem before with other types of databases (relational, for example). A Category document (shown here as a POCO) will contain its parent ID, and using some advanced indexing techniques one can get standard hierarchical queries working:

public class Category
{
    public string Id { get; set; }
    public string ParentId { get; set; }
    public string Name { get; set; }
}

However, there's a significant indexing latency and performance hit involved using this technique. All-in-all, this is usually a method unfit for document databases, as it employs relational thinking and takes no advantage whatsoever of the features available provided by the document database.

Instead, try thinking of a hierarchy tree as an entity of its own.

With a document-database, hierarchies are usually best modeled in one document that defines the hierarchy. In our scenario of categories that would be to define the categories tree, where the categories themselves can be represented by standalone documents (and thus hold Name, Description etc, and allow for other collection to reference them), or not if you don't need them to exist separately (less common, and will somewhat bloat the tree document).

Modeled from code, a Category document would look something like this:

public class Category
{
    public string Id { get; set; }
    public string Name { get; set; }
    // other meta-data that you want to store per category, like image etc
}

And the hierarchy tree document can be serialized from a class like the following, where this class can have methods for making nodes in it easily accessible:

public class CategoriesHierarchyTree
{
    public class Node
    {
       public string CategoryId { get; set; }
       public List<Node> Children { get; set; }
    }

    public List<Node> RootCategories { get; set; }

    // various methods for looking up and updating tree structure
}

This approach of hierarchy-tree has several important advantages:

  1. One transactional scope - when the tree changes, the tree changes in one transaction, always. You cannot get affected by multiple concurrent changes to the tree since you can leverage optimistic concurrency when editing this one document. Using the approach you propose it is impossible to guarantee that therefore harder to guarantee the completeness and correctness of the hierarchy tree over time. If you think of a hierarchy as a tree, it actually makes a lot of sense to have each change lock the entire tree until it completes. The hierarchy tree is one entity.
  2. Caching - the entire hierarchy can be quickly and efficiently cached, even using aggressive caching which will minimize the times the server is accessed with queries on hierarchy.
  3. All operations are done entirely in-memory - since its one document, aka object, all queries on the hierarchy (whose the parent of, list of children etc) are made entirely in-memory and effectively cost close to nothing to perform. Using an index with Recurse() to answer such queries is order of magnitude costlier (network costs and computational). If performance is your biggest concern - this is definitely a winner.
  4. Multiple parents per category, no denormalization - if a category document is saved outside the hierarchy tree, like demonstrated above, you can effectively put a category under multiple parents without the need to denormalize. All category data is in one place, in a document outside of the tree, and the tree only holds a reference to the category.

I highly recommend going with this approach. It is a bit of a shift from the relational mindset, but its so worth it, even when the tree grows big.


Best of my 2013 in pictures

English posts Comments (0)

I've had a blast 2013 - been to new cool places, met many amazing new people and learnt a ton of new stuff. Always with me was my Canon 550D, and I was able to catch quite a few nice pictures with it.

Some of the best pictures I took, of the best moments in my year 2013 are shown below. No filter, no photoshop. May 2014 be an even better year!


Upcoming speaking engagements - 2014

Elasticsearch, Lucene, Talks, RavenDB Comments (0)

2014 hasn't even started yet, and I already have a full schedule for the next couple of months. Here is a list of talks and courses I'm doing publicly, on RavenDB and search technologies (Lucene and Elasticsearch). I will update this list as new events come in.

  1. December 30th, Tel-Aviv: the Inaugural Elasticsearch Tel Aviv Meetup, 7:30PM in Azrieli Towers where I'll present the experience we had with Elasticsearch in a recent project. How we used various features, gotchas and so on.
  2. January 16th, Jerusalem: presenting Lucene and Elasticsearch in SIGTRS - an Israeli user-group whose its focus is information retrieval.
  3. February 19th, Ra'anana, Israel: A 3-hour Elasticsearch crash course for Israel Dot NET Developers User Group, in Microsoft Ra'anana. Details and registration here.
  4. March 5/6th, London: A talk in SkillsMatter London, first week of March, on multi-lingual search with Lucene and Elasticsearch. It hasn't been published yet.
  5. May 21-22, Stockholm, Sweden - I will give a 2-hour RavenDB session in DevSum, building a StackOverflow clone live on stage.

Additionally, I will deliver my real-world search course in London and New-York several times this year. The next dates are for London 23-24/4, and in New-York 5-6/5.


Hebrew search done right

Hebrew search, HebMorph, Lucene, ElasticSearch, IR Comments (0)

It's been a few years since I started experimenting with Hebrew search, and during that time I was able to create HebMorph and see it get adopted in many places for various uses. I was involved in several projects - open-source and commercial - that needed Hebrew search capabilities, and learned a lot about what it means to perform Hebrew searches properly in various contexts.

Perhaps the most interesting usage of HebMorph is done by Buzzilla, the company I was recently employed by.

The queries in Buzzilla are written very carefully because to them every discussion counts. After issuing the query, statistical analysis is made on the number of discussions founds as well as some further analysis on a random set of discussions from the results set. Having irrelevant discussions in the results, or missing highly relevant discussions, can have severe effects. Therefore, queries are usually lengthy and need to be very precise. While stemming (rather, lemmatization) can and should be used, it needs to be applied selectively so it doesn't bloat search results with irrelevant discussions (high precision). Doing this is challenging when taking into account recall (the amount of relevant documents we missed).

This short guidance video (in Hebrew) demonstrates a very nice method in which we achieve exactly that. On one hand lemmatization is used and thus high recall is avoided, because we are able to overcome most of Hebrew's challenges when it comes to full-text search. On the other hand precision is kept high, because we can fine tune the query and use it like a laser-beam to find exactly the data that we are interested in, and only it.

With good UX and via user interaction we were able to come up with a solution to solve ambiguity problems as well as the Hebrew prefixes problem in search. This solution reflects very clearly what is being done by the search engine, and as a result the user can refine his query very easily. There are also custom dictionaries involved and some other various optimizations in place that aren't shown in this video.

The search engine showed in the video is using Elasticsearch and a custom Hebrew analyzer plugin that is based on HebMorph. It took us a while to get this right, but once we did response from our users were highly positive. This is really Hebrew search done right. The same techniques can be used for a 2-3 word Google-like searches by employing a bit different UX approach, I will have something to show this by soon.

I'll be blogging about the features shown above from the technical perspective in the near future, especially about selective-stemming and multi-lingual content.


I'm looking for my next challenge

English posts Comments (1)

A couple of days ago I gave notice, and as soon as I’ve finished wrapping up everything in my current position I’ll be free again and looking for new challenges and experiences.

For the time being I’m NOT looking for any permanent position. First I want to do dedicate my time to some projects - my own or OSS - and do some intensive learning. It’s been a while since I did both. When I feel I made the most out of this period I’ll start weighting my options.

Because I’m basically unemployed, and because I really enjoy dropping in to help other projects and companies, I’ll be allowing some time to do consulting and software development contracts.

Who am I?

Author of RavenDB in Action published by Manning, I was a core developer of RavenDB for a while and a long time Lucene expert. I'm an Apache Lucene.NET committer, CLucene veteran, and Elasticsearch savant.

I've been speaking in local user groups and international conferences for a while now, and doing a lot of consultancy and code-for-hire work world-wide. I've also co-authored and delivered the official RavenDB workshop several times, and I now deliver a real-world search course in London a few times a year.

I do search engines, databases and software architecture, and I enjoy that very much. I'm a fast learner, and I'm never afraid of going into dark code corners or getting out of my comfort zone.

I’m available for short-term contracts

Most of my time in the next couple of months I’m going to dedicate to learning and working on personal projects, but I have availability for consultancy and short-term contracts world-wide (ranging from a few days to several weeks).

I’m happy to hop on every interesting project whether it is something well in my comfort zone (Lucene, web applications, RavenDB, Elasticsearch, architecture) but especially if it isn’t. I’m a curious developer and a fast learner, and will definitely enjoy learning new stuff. Remote work or on-site are both possible.

I already secured a few gigs, but still have availability starting February. If you’re interested let’s talk. Ping me by email (itamar at this domain) or Skype (itamarsyn). If I didn’t respond to your request, I haven’t got it!

My plans for the near future

I have a few projects of my own I want to dedicate some time to, and its been a while since I did OSS work so I plan to do this as well. Lucene.NET can definitely use some love towards a v4 release, as well as other projects I maintain like HebMorph and NAppUpdate. I’ve always enjoyed working on OSS projects, and this is my chance to have some fun again.

I also plan to do a whole lot of learning. Mastering functional programming and Machine Learning have long been on my list, and its time to do this. I expect to be blogging quite a bit as I progress in my learning.

Permanent position

I will only start looking for a permanent position in a couple of months. Nevertheless, I’m interested in hearing about opportunities.

I don’t know yet what type of job I will be looking for, but I do know it needs to be technically challenging and solve an interesting problem. I do look for doing things which are far from my comfort zone.


Why Elasticsearch? - Refactoring story part 3

ElasticSearch, Lucene, IR, Buzzilla Comments (1)

As part of a system refactoring process, we replaced a legacy search system with Elasticsearch. Being a core component of our system it literally took us more than half a year to move to the new system and make all the features work again, and it required us being absolutely sure of Elasticsearch's competency.

In the previous posts I mentioned we wanted to keep using Lucene to build on top of existing knowledge and experience, but do this in scale reliably and without too much pain. Elasticsearch turned out to be a perfect fit for us, and over a year after the fact we are very happy with it.

I thought I'll do a write up to summarize what we found in Elasticsearch that made it our search engine of choice, and point at some helpful resources. This is a very high level post that doesn't intend on being all too technical; I'll be writing on our experience with some of the features in more detail in the future.

Easy to scale

We start with the obvious one. I explained the complexity of the problem previously, and Elasticsearch really tackles it very nicely.

One server can hold one or more parts of one or more indexes, and whenever new nodes are introduced to the cluster they are just being added to the party. Every such index, or part of it, is called a shard, and Elasticsearch shards can be moved around the cluster very easily.

The cluster itself is very easy to form. Just bring up multiple Elasticsearch nodes on the same network and tell them the name of the cluster, and you are pretty much done. Everything is done automatically - discovery and master selection all done behind the scenes for you.

The ability of managing large Lucene indexes across multiple servers and have some reliable, tested piece of code do the heavy lifting for you is definitely a winner.

There are multiple gotchas and possible pain-points though, namely some prospect issues with unicast/multicast discovery, shard allocation algorithms and so on, but nothing that was a deal breaker for us.

Everything is one JSON call away

Managing Elasticsearch instances is done purely via a very handy REST API. Responses are always in JSON, which is both machine and human readable, and complex requests are sent as JSON as well. It can't get any easier.

A recent addition to Elasticsearch is the "cat API", which gives insights and stats for a cluster in an even more human readable format. You can read more about it here. To me this shows how important it is for Elasticsearch to be easy to maintain and understand as a core feature - and that's very important.

Everything can be controlled via REST - from creating indexes to changing the number of replicas per index, all can be done on the go using simple REST API calls. An entire cluster, no matter how big, can be easily managed, searched or written to all through the REST API.

The documentation is great and lives on github side by side with the code itself, but there's more. The entire spec for the REST API is available on github. This is great news, since you can build any client or tool on top of that and be able to conform to version changes quickly.

In fact, this is exactly what some of the core tools do. I recommend using the excellent Sense UI for experimenting and also for day-to-day work with Elasticsearch. It is available as static HTML and also as a Chrome plugin, and is backed by the aforementioned REST API spec.

The great REST API also helps with rapid development like this tool shows. It really helps focusing on your business requirements and not the surroundings.

Unleashed power of Lucene under the hood

Lucene is an amazing search library. It offers state of the art tools and practices, and is rapidly moving forward. We chose Lucene mainly because we had a lot of experience with it, but if you're a new-comer you should really choose it because of what it can offer.

Since Lucene is a stable, proven technology, and continuously being added with more features and best practices, having Lucene as the underlying engine that powers Elasticsearch is, yet again, another big win.

Excellent Query DSL

Elasticsearch wraps Lucene and provides server abilities to it. I already covered the scaling-out abilities it provides, and the REST API for managing Lucene indexes, but there's more to it.

The REST API exposes a very complex and capable query DSL, that is very easy to use. Every query is just a JSON object that can practically contain any type of query, or even several of them combined.

Using filtered queries, with some queries expressed as Lucene filters, helps leverage caching and thus speed up common queries, or complex queries with parts that can be reused.

Faceting, another very common search feature, is just something that upon-request is accompanied to search results, and then is ready for you to use.

The number of types of queries, filters and facets supported by Elasticsearch out of the box is huge, and there's practically nothing you cannot achieve with them. Looking to the near future, the upcoming aggregation framework looks very promising and is probably going to change the way we aggregate data with Elasticsearch today.

Elasticsearch is a search server, and the Query DSL it provides is definitely one of the places it really shines. Much easier than any SQL statement or Lucene queries written in Java.

Multi-tenancy

You can host multiple indexes on one Elasticsearch installation - node or cluster. Each index can have multiple "types", which are essentially completely different indexes.

The nice thing is you can query multiple types and multiple indexes with one simple query. This opens quite a lot of options.

Multi-tenancy: check.

Support for advanced search features

Search functions like MoreLikeThis and Suggestions (and Elasticsearch's excellent custom suggesters) are all supported as well using a very handy REST API.

More advanced search tools like script support in filters and scorers, BM25 relevance, the analyze API for testing analyzers, term stats info via REST and much more expose all of Lucene's internals and advanced capabilities for many advanced usages, very easily.

Configurable and Extensible

For those times where you really need to bend Elasticsearch to do things your way, you can easily configure it. It is also very easy to extend it, and we have done so multiple times in various occasions.

Many of Elasticsearch configurations can be changed while Elasticsearch is running, but some will require a restart (and in some cases reindexing). Most configurations can be changed using the REST API too.

Elasticsearch has several extension points - namely site plugins (let you serve static content from ES - like monitoring javascript apps), rivers (for feeding data into Elasticsearch), and plugins that let you add modules or components within Elasticsearch itself. This allows you to switch almost every part of Elasticsearch if so you choose, fairly easily.

If you need to create additional REST endpoints to your Elasticsearch cluster, that is easily done as well.

Out of the box, Elasticsearch is pretty much feature complete. The endless extensibility options make it quite impossible to ever get stuck without a solution. This already has saved our day once or twice.

Percolation

Percolation is a codename for Elasticsearch's ability to run many queries against one document, do this efficiently, and tell you which queries match this document. I have no idea why its called that, but it works, and works great.

This is a very useful feature for implementing an alerting system - like Google Alerts (is it still operational??) or if you are indexing logs you can create alerts to sysadmins when some metric doesn't align.

We use this extensively in an alerting system we have in place, and using Elasticsearch's extensible we were able to rewrite the percolator to add some optimizations of our own based on our business logic, so we are now running it 10 times faster. Brilliant.

Custom analyzers and on-the-fly analyzer selection

Elasticsearch allows you to create indexes using merely configurations in JSON or YAML. It looks like this. This makes the requirement of rolling your analyzer in code much less common - but when its needed its also very easy to do. Here is a code example for doing this for a Hebrew analyzer.

The nice part about analyzers with Elasticsearch is you don't have to define them globally if you know different documents will need to be analyzed differently. You can simply leave the analyzer selection to indexing time by using the _analyzer field. This is super useful for multi-lingual indexes, and I'll have a proper blog post about this topic soon.

Rich ecosystem

Many people use Elasticsearch today, and the ecosystem is steadily growing. The Elasticsearch team maintain several plugins on github (see https://github.com/elasticsearch/), but there's whole lot more. Partial list you can find here. For the rest, Google is your friend...

Other than plugins there a lot of monitoring and admin tools available, like this nice CLI tool for Unix. There's also Chef and Puppet recipes, dashboards, VMs and whatnot. If you need it, you can probably find it somewhere.

Active community

The community, other than creating nice tools and plugins, is very helpful and supporting. The overall vibe is really great, and this is an important metric of any OSS project.

There are also some books currently being written by community members, and many blog posts around the net sharing experiences and knowledge.

Proactive company

Elasticsearch is much more than an OSS project - today it's also a company which also serves as an umbrella for other projects like logstash, kibana and Hadoop integration. This, while still keeping all of its code available under the permissive Apache Software License.

Some of the engineers in Elasticsearch Inc are long-time Lucene committers, others are long time OSS contributors, and are overall great people and great engineers. Definitely the type of people you can trust a project with.

Elasticsearch is here to stay, and are pushing forward very hard - judging by the amount of work they do lately, and the type of projects they take on.

ThisI'm really looking forward to what will come next.


Showing 10 posts out of 87 total, page 2

Previous page Next page