All you need to know about Elasticsearch 5.0: Scripting

Elasticsearch has known ups and downs with it's support of scripting languages. Two versions ago (that is, in the 1.x series) it added and removed support of various scripting languages in almost every minor release. I have also recommended avoiding the use of them myself for some specified reasons.

Due to security implications Elasticsearch 2.x disabled all dynamic scripting and only allowed a special scripting language called Lucene Expression. And yup - I've seen quite a few Elasticsearch clusters on AWS that were hacked using malicious mvel and groovy scripts. The reason Lucene Expression was allowed is simply because it's not being run via a scripting Virtual Machine, but it's rather being compiled to byte code and executed efficiently.

Lucene Expression is quite a simple scripting language though, and it has quite a few limitations. For once, it can only operate on numeric fields. It can't do date math operations. It can't have loops or methods (which may be a plus, actually). And it can only be written as a one-liner (variables are not supported), so some expressions are either too complex to read or just can't be expressed with Expression.

On version 5.0 Elasticsearch is introducing a new scripting language called Painless, with syntax very similar to Groovy's.

Painless is using the same concept of Expression - a scripting language that's compiled to byte-code before execution - but this time it contains a whole lot more features: variables, type safety, loops and conditionals, and of course high performance.

So how can you use Painless?

Custom scoring function

Using the Function Score Query, one can modify the score of documents returned by the query it wraps. While there are several built-in functions (e.g. decay functions), you can write your own script (e.g. _score * 2) and also reference document values in your computation (e.g. _score * doc['my_numeric_field'].value).

Computed fields

You can add a field to every query result ("hit") dynamically, during query time, and define it's value using a script. It can be a computation result of other fields in that document. Or whatever you feel like doing :)

Those are called Script Fields, and are provided side-by-side with the query itself.

Aggregations

Scripts can also be used in aggregations in several ways.

The Scripted Metric Aggregation is similar to the Script Fields feature, only it works on aggregations. It allows creating a new metric in a bucket based on init/map/combine/reduce scripts - those are executed on bucket, document and shard contexts respectively. Not as easy, but definitely effective.

Similarly, the Bucket Script Aggregation is a Pipeline Aggregation type which can perform per-bucket computation and add the result as a metric to each bucket returned to the user.

An even cooler feature is the Bucket Selector Aggregation. It essentially provides a paging mechanism around bucket aggregations. In fact, it's much more effective, since it allows you to filter buckets from the results based on parameters via a script.

Updating documents

Painless scripts can also be used to execute update operations on a single document by ID, or using the Update By Query API. Instead of providing an actual value for the field to update, you can provide a script and it's output will be the field's value.

Comments

Leave a Comment