All you need to know about Elasticsearch 5.0: Index management
We see two main usage patterns with Elasticsearch indexes - monolith indexes and rolling indexes. Over the years Elasticsearch added features that greatly improve the experience of working in these patterns. Elasticsearch 5 introduces several new features that build on that further, and result in a very good index management story.
In this blog post we will discuss those two patterns, and show how new Elasticsearch 5 features - namely the Shrink and Rollover APIs - can greatly help effectively maintaining your production indexes.
Be sure to check the previous article in the series, All you need to know about Elasticsearch 5.0: Search, if you haven't already. In future blog posts in the series we will discuss more topics like data ingestion strategies and more.
A monolith index
One pattern that is mostly common when Elasticsearch is used for search is indexing into a monolith index. Usually this is a copy of data that resides elsewhere and is indexed to Elasticsearch for search and performing aggregation operations. This index is then scaled out to multiple nodes in the cluster via sharding and replication to accommodate search requests at scale.
The intent would usually be to optimize for search speeds on this index, and indexing into it will usually be occasional. There is one exception though - the monolith index will usually be recreated or bulk-updated periodically to keep it up to date with the source of truth, or when mapping changes are necessary. Once such indexing operation is required, you might want to scale out indexing to get it done faster, and have more nodes participate in the indexing process - sometimes more nodes than you usually have available for search.
Sharding helps in mitigating indexing and search speed issues. Sharding is what we call breaking one index into many smaller pieces under the hood, which are managed by Elasticsearch as one big index. Since each node can process up to a certain number of write requests per second, sharding the index can allow multiple nodes to participate in mass-indexing, assuming a balanced sharding function (which is the default). All this, while maintaining decent-size shards that are not too big or too small is important to optimize for search performance (I usually recommend a 1-million documents shard and a maximum 5-10GB size on disk).
There is one gotcha though - the number of shards can't be changed after index creation. Until now.
The new Index Shrink feature allows to "shrink" an index with X shards, to an index with less shards. The requested number of primary shards must be a factor of the number of shards in the original index. For example an index with 8 primary shards can be shrunk into 4, 2 or 1 primary shards or an index with 15 primary shards can be shrunk into 5, 3 or 1.
While this is not "resharding" per se, it is a great feature that addresses a real need. Now you can create an index with many shards to support intensive data ingestion, and then shrink it down to lesser shards to save on resources and optimize for search speeds. Shrinking an index doesn't do reindexing, it only relinks underlying index segments, and as such it is an efficient operation. It does, however, require the index to be read-only before shrinking - and mostly monolith indexes could indeed allow for that.
On that note, it's worth mentioning the relatively new Reindex API is highly useful in this usage pattern - whenever the reindexing operation is not due to data change, but an index mapping change, you can leverage Elasticsearch to issue a reindexing from the old index into a new one with the new mapping defined. There is no need for sophisticated parallel ETL processes in this case, since the old index already contains all the data needed.
An even more common pattern nowadays is the "rolling indexes" case. Usually time-series data that is indexed in time-based indexes, e.g. daily indexes with names like
logstash-2016.11.16 - and you will mostly see this pattern with logs, which is the main usage for the ELK stack today. In this pattern new indexes are being created continuously and after certain period of time they are not being written to anymore. Usually those indexes are removed from the cluster after a while, copied to a backup location and then deleted or just deleted if the data is not important enough to keep forever.
The time-series data case often times involves high ingestion rates 24/7 - think logs of an active system, or "IoT" cases. This means you want to optimize for writes to the active index at any given time, which in turn means as many shards as your nodes can support. Over-sharding will help you ingest more data in real-time and avoid Elasticsearch from pushing back or lagging behind on indexing due to overwhelming amount of indexing requests.
But this approach has several issues:
Past indexes that are not written to anymore but are searched on will be over-sharded, and that means degraded search performance for search as the less number of shards the better, and since the shard sizes will most likely be smaller than the optimal size for efficient search.
Not all indexes are created equal. While Elasticsearch treats them all the same, the world might not. During normal operations, some days could busier than others and produce twice the amount of events, while there could be weeks of downtime that would result in practically empty indexes.
Granted, overtime the number of documents you ingest in any given day will grow, which will effectively result in bloated indexes and shards - again damaging search performance. Changing the indexes naming from daily to hourly is currently a rigorous process that requires too many changes in too many places.
As you've guessed, #1 is easily fixed with the Shrink API. As we have just seen, once an index is stopped being written to you can shrink it to have lesser number of shards and thus optimize it for search and aggregations. Furthermore, because in the rolling indexes use-case this index will never be written to again, you can force-merge it (but make sure you don't end up with shards too big!), compress and then mark it as read-only. This will ensure highly effective searches on those indexes.
The Index Rollover API addresses the rest. It's a neat new feature that leverages aliases to pose quotas to indexes based on the number of documents in them, or based on time from the first indexed document. An alias to an index can be set such as once a quota has been reached on the index the alias will switch to indexing to a new index while still enabling search on this index and all previous ones. This goes a long way in allowing to balance index sizes also in the rolling indexes use case.
A nice and useful tool for managing indexes, especially in the rolling indexes scenario, has long been Curator. Using Curator in conjunction with index templates the Rollup API can now give you a very good index management experience also for rolling indexes.
You can learn more about this new API in an official blog post here. The Shrink and rollover APIs now allow using multiple shards to make the most of your hardware resources during indexing, then shrinking indices down to a single shard (or a few shards) for efficient storage and search.