Elasticsearch new features: 2020 year in review(Cross-posted from BigData Boutique Blog)
What a year 2020 has been! Social distancing and a lot of very weird situations. For some it was a year full of difficulties, and hopefully a lot of growth and some good things too.
It has definitely been an interesting year for Elasticsearch. Many things happened, new features added and the product evolved significantly. We wanted to recap and share highlights of new features and usage recommendation. This post is about the things we consider as big changes, and important steps forward, based on our experience and what we see as important while actively working with hundreds of customers on Elasticsearch clusters of all shapes and sizes, from full-text search to log analytics and anomaly detection.
As always, upgrading to a more recent Elasticsearch version will grant you significant performance boosts, depending on your usage. This year as well, Elasticsearch has seen many performance and memory optimizations, most notably in the 7.8 and 7.9 releases. Those are too many and too low-level to list in detail here, so we won't.In this post we will focus on high-level technical details that we believe could make a big difference to you - a developer working with Elasticsearch. For the sake of brevity, we will only highlight, and give you links for further investigation.
One important note about licensing, before we start: recently, many new features are added under the X-Pack license, and not as part of the open-source product. Most are indeed free (X-Pack Basic) but some require a paid subscription license. In this post, we are focusing on the features that we believe are important to know about, and will be useful to you - but are still in the free tier and usable without purchasing a license.
Here it goes.
Better results scrolling using PIT
There are three ways to page through search results (aka "do paging") in Elasticsearch: from/size, scroll, and search_after. Each has it's downsides (slowness, instability of result sets, the requirement to sort on a field, and so on). In recent Elasticsearch versions a new API was added, namely the PIT (Point in Time) API. Essentially, it is a lightweight version of the scroll API and is meant to replace it, making deep paging possible and more efficient but still some limitations remain.
Our recommendation is to still not over-use it - and in general avoid deep paging with Elasticsearch. But when you do need to scroll, e.g. for copying data to external jobs and so on - prefer the new PIT API over the (now discouraged) scroll API. A good summary and discussion are in the official documentation for the PIT API.
Ingestion using a single Agent, and Fleet
In 2020 Elastic has moved off the premise of having many beats (Filebeat, Packetbeat, Metricbeat and so on), to supporting a single data collection agent that can deliver all that functionality. Under the hood the Elastic Agent runs all the beats, but it's going to be easier to deploy and manage the unified Elastic Agent. Also new is the option of managing them through Fleet, and a centralized Ingest Manager in Kibana. Read more here.
In essence, Data Streams is just an improved API and a better user-experience for using the Rollover API for partitioning data into more indexes. In the last couple of years we stopped using shards and moved to index-level sharding - essentially partitioning the data to different indexes, and the new Data Streams API plays an important role in making your clusters more efficient and more stable.
Composable index templates
Composable index templates were introduced in Elasticsearch 7.8, and are in fact a results of a 4-year long discussion started in 2016. With composable index templates you finally get a fine-grained control over how templates are built and applied, see the docs for a comprehensive discussion and example.
The Transform APIs became GA since 7.7, and in their core they are essentially a managed ETL process run by Elasticsearch within the Elasticsearch cluster, to copy and transform data from one index to another.
This allows for compacting ("roll up"), executing self-joins on write (e.g. folding events into a session), enriching and cleaning up data from the ingest layer into indexes that will serve the serving layer. See the docs here.
Our recommendation is to keep using dedicated ETL tools for any real-world or heavy use-cases, and to only use the Transforms API for minor use-cases, or during development. We usually prefer running ETLs with tools that were built to do it - like Flink, Airflow, Nifi and such, as they provide significantly better development experience, ability to debug, write testable code, and much more predictable and manageable ETL processes.
Asynchronous search API
The async search API lets you asynchronously execute a search request, monitor its progress, and retrieve partial results as they become available. It was introduced in Elasticsearch 7.7, and is designed to support more responsive search experience. See the docs here.
Improved support and performance for geo-search
Geo-spatial search got a lot of love in 2020, most notably by making geo-shapes BKD-backed in Elasticsearch, which translates to significant precision, efficiency and speed gains. This also allowed to add more geo-types, queries and aggregations.
New data types
* Histogram - A field to store pre-aggregated numerical data representing a histogram. This data is defined using two paired arrays - values and counts.
* Constant keyword data type - Constant keyword is a specialization of the keyword field for the case that all documents in the index have the same value.
* Arbitrary Shape and Point data types - The shape data type facilitates the indexing of and searching with arbitrary x, y Cartesian shapes such as rectangles and polygons. It can be used to index and query geometries whose coordinates fall in a 2-dimensional planar coordinate system.
* Wildcard field - A wildcard field stores values optimized for wildcard grep-like queries. Wildcard queries are possible on other field types but are usually slower and has analysis limitation (case sensitivity and others). The Wildcard field is an important addition - it finally allows a case insensitive wildcard search, and when used in conjunction with the search_as_you_type field type it allows to provide great search experiences.
New Aggregation types
* Histogram Aggregation - A multi-bucket values source based aggregation that can be applied on numeric values or numeric range values extracted from the documents
* Normalize Aggregation - A parent pipeline aggregation which calculates the specific normalized/rescaled value for a specific bucket value.
* Moving Percentiles Aggregation - Given an ordered series of percentiles, the Moving Percentile aggregation will slide a window across those percentiles and allow the user to compute the cumulative percentile.
* Moving Function Aggregation - Given an ordered series of data, the Moving Function aggregation will slide a window across the data and allow the user to specify a custom script that is executed on each window of data.
* Pipeline inference aggregation - A parent pipeline aggregation which loads a pre-trained model and performs inference on the collated result fields from the parent bucket aggregation.
* T-test aggregation - A new aggregation that will tell you if the difference between two population means are statistically significant and did not occur by chance alone.
* Box plot aggregation - The boxplot aggregation returns essential information for making a box plot: minimum, maximum, median, first quartile (25th percentile) and third quartile (75th percentile) values.
New Query Language - EQL
The Event Query Language (EQL) is a query language for event-based time series data, such as logs, metrics, and traces. It's a new query language that was recently added to Elasticsearch, mostly aimed for SIEM users but not only. It joins the traditional Lucene query syntax and Elasticsearch JSON queries, and also the SQL support and Kuery - the Kibana query language. Full docs here.
For years Kibana was about creating simple visualizations that can be added as components to a dashboard, and only recently they started adding additional capabilities like Timelion, Visual Time Series Builder and Vega. Lens is the most recent addition, with the purpose of simplifying building dashboards even further.
Lens is the easiest and most intuitive way to visualize data in Elasticsearch with a simple drag-and-drop interface that lets anyone instantly begin exploring data for insights, regardless of their previous Kibana experience, which we think is a great new addition that will allow reducing the barrier of entry for new users beginning with Kibana.
For a very long while, the Elastic Stack lacked support for alerting. There was Watcher, which still exists, but it always was very limited and not trivial to use. It did start receiving a lot of improvements in the last 2 years or so, and very much so in 2020 when Alerting has become more streamlined all throughout Kibana. In 2020, a lot of features were added that help creating alerts, and good alerts, from pretty much everywhere in Kibana. However, most of the new Alerting features are on the premium (paid) tiers, and still has a lot missing for it.
This is good progress, but we still think there is a lot to go. A couple of months ago we reviewed all the possibilities to enable alerting on top of Elasticsearch in a webinar, and this content is still relevant today as well:
Our team of experts has the most experienced Elasticsearch consultants out there, with over 8 years of experience with Elasticsearch and the Elastic Stack. Find out if your cluster is fully optimized for spend, stability, and performance. Contact us for a quick, complementary review at firstname.lastname@example.org.