Elasticsearch One Tip a Day: Filtered use of filters

Elasticsearch queries are great - the query DSL is very rich and keeps evolving continuously. It's hierarchical structure allows for very sophisticated queries to be written and executed, and for handy UIs to wrap it with auto-complete features, for example.

However, like with every DSL, Elasticsearch's has sweet spots and pitfalls. Perhaps the most important thing to know about Elasticsearch's DSL is that every type of query boils down to a set of Term Queries and Boolean Queries. For example, a Range Query will essentially be an aggregated query for all the terms between the two specified values (the gt and lt values - the from and to of the range); many TermQuerys chained in a BoolQuery's should clause.

In fact, this should be taught as the first fact in every Elasticsearch training. Every slow Elasticsearch query has at least one query that looks innocent but in fact expands to hundreds if not thousands and more of clauses. Date range queries in event-based systems are the prime examples.

Avoiding such queries is not always possible - what if I wanted to get all documents between two dates? there's no way around a range query on a timestamp field (well, if you just want the latest documents you can just sort by that field in descending order, but there's no way around it if you want to be able to get documents in an arbitrary time slot).

To support this kind of functionality better Elasticsearch provides you with Filters. Filters are essentially a precomputed query: given a query, you can produce a filter that remembers which documents answer it. Then, no matter how expensive the query is you can get instant answers when you need them because the query was already executed beforehand.

This essentially means the filter remembers which documents are match for it's query, and which aren't. And this implies a few important things:

  1. There is memory cost associated with caching of those results. Generally speaking it is about 1 bit per document in the system for every filter used. The more filters you use and the more documents you maintain, the more memory you will need (sometimes measured in many GBs).

  2. Filters can only be defined to answer a specific question, so a RangeFilter for one date range (say, Jan 1st to 2nd 2015) is completely different than a RangeFilter for a different date range (even Jan 1st to 3rd 2015), even if one is a subset of the other.

  3. Filters use results of a pre-executed query and as an additional step after the query was executed. This means they don't participate in results scoring.

  4. The first time you run a filter, it will have to execute the actual query. So filters are slow the first time you use them. Subsequent uses of the filter are very very fast (basically, a BitSet AND operation).

Therefore, only use filters when you can re-use them, and when you don't care about them not influencing score. If you are making a one-time query, don't create a filter for it. If you a query you re-use, make sure the filters in it are re-usable as well.

The secret with effective use of Elasticsearch filters is in proper memory management. In the next tip I'll discuss how to plan for smart re-use of filters, and how to make sure they are cached properly.


Comments are now closed