The definitive guide for Elasticsearch 2.x on Microsoft Azure

April 14th, 2016 English posts, .NET, Elasticsearch, Windows Azure, Azure, Cloud

14 min read

This is an updated version of my blog post from two years ago. It features both the new Azure Portal, and the new Elasticsearch 2.x series. It also incorporates some helpful additions based on my experience since. Thanks to Yves Goeleven for providing some sound Azure advice.

So...

Before we start, it is worth noting there are Elasticsearch installations available from Azure Marketplace. While those are useful, you still might want to deploy your own cluster with your own configurations, to accommodate different sizing or different needs.

To highlight the main differences from other guides out there and the approach in this guide:

I prefer running Elasticsearch on Linux VMs as opposed to Windows machines. This is simply because once you have an Elasticsearch data node, it's going to be it's sole purpose (at least, that's what I'd recommend), and Linux is so much easier to setup and maintain as a remote box. Even if you're new to this, learning the ropes is not that hard, and most of what you'll need is detailed below.
OpenJDK is the default JDK on many Linux distributions. Never run Elasticsearch on top of OpenJDK - that has too many known problems.
Since this is Azure, you'll most likely be doing this from a Windows machine, so I'll be using tools that are available to Windows users.

With that in mind, let's start rolling. Log-in to your Azure account, and head to the Management Portal.

Creating the infrastructure

Let's start by creating the infrastructure. You may need to wait after some steps for the resource you just created to be provisioned and started.

You start by creating a new resource group. We will create our cluster within that resource group, in the Azure Region you will specify. In the portal, click Resource groups -> Add, and give the new resource group a descriptive name.
Create a new Virtual Network. This will create a subnet where our Elasticsearch nodes can properly communicate with each other. Make sure to specify the resource group you just created.

Create a new Linux Virtual Machine on the same resource group, my personal preference is the latest Ubuntu Server 15.10 or the 14.04 LTS. As for sizing, a good starting default is the D2 / DS2 instances - enough RAM, cores and disk to get you started. When creating the VM, make sure to select the virtual network you previously created. Set a password (or use ssh keys if you have previously created any) and confirm the creation of the new VM.

Click next and finish, and then repeat the previous step to create as many VMs as you need. A recommended setup for a basic installation would be 3 nodes (meaning, 3 servers running Elasticsearch) of the same size, to allow for replication factor of two.
For each machine after it has been created you will need to open port 9200 for HTTP traffic if you want to access it from the outside. You can do so by going to the Endpoints settings of that machine and adding a HTTP/TCP rule for port 9200. We will discuss this in more length later.

Preparing the Linux VMs

After the VMs were created and initialized, you can click on it in the portal to get to the Dashboard and get it's "SSH details". This will be a domain and a port.

Download PuTTY.exe and for each VM you've created follow the steps below. This is where we prepare the machine and install Oracle JVM and Elasticsearch on them and then launch the cluster.

Once logged in, execute the following commands, in the specified order.

First, this one-liner will install Oracle JDK (mostly) silently, and set JAVA_HOME accordingly:

echo debconf shared/accepted-oracle-license-v1-1 select true | \
sudo debconf-set-selections && echo debconf shared/accepted-oracle-license-v1-1 seen true | \
sudo debconf-set-selections && sudo add-apt-repository ppa:webupd8team/java && sudo apt-get update -y && sudo apt-get install oracle-java8-installer -y && sudo apt-get install oracle-java8-set-default -y

If you wish to verify the JVM version, you can now run java -version.

Next, download and install latest Elasticsearch (2.3.1 at the time of writing this) using the official Debian package, and also setting it up to run on boot up:

es_version=2.3.1 && curl -s "https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/$es_version/elasticsearch-$es_version.deb" -o elasticsearch-$es_version.deb && sudo dpkg -i elasticsearch-$es_version.deb && sudo update-rc.d elasticsearch defaults 95 10

The following command will set up some data folders on the mounted drive and make them accessible to Elasticsearch:

sudo mkdir -p /mnt/elasticsearch/{data,log,work} && sudo chown -R elasticsearch:elasticsearch /mnt/elasticsearch

Configuring Elasticsearch

We now have multiple VMs properly installed with Elasticsearch ready to run on them. Before actually running the cluster, there's a few configurations we need to do.

First, we need to tell Elasticsearch how much memory it can consume. The highly-recommended default is exactly 50% of the available memory. Meaning if we have a 28GB instance, we will give Elasticsearch 14GB. We do this by running:

sudo nano /etc/init.d/elasticsearch

And then uncomment ES_HEAP_SIZE=2g (by removing the #) and changing it to be 50% of available memory (e.g. ES_HEAP_SIZE=14g). To quit and save click Ctrl-X, y, and then Enter.

Last remaining bit is the Elasticsearch configurations file. It contains the basic configuration options with good explanations on them (that you should read!), but on our VM we don't really care about all that and just need some of the settings. So let's clean the file and get into editing a clean one:

sudo rm /etc/elasticsearch/elasticsearch.yml && sudo nano /etc/elasticsearch/elasticsearch.yml

Paste the following into the file, changing the defaults as per comments below:

# It is recommended to choose descriptive cluster and node names.
# Node names are useful for monitoring and debugging, so you'd want to set them
# via your provisioning script / tools instead of keeping the defaults.
cluster.name: my-cluster
node.name: one

# Use the mounted drive for storage
path.data: /mnt/elasticsearch/data
path.work: /mnt/elasticsearch/work
# you can also change the logs path, via path.logs

# These settings are based on a single node cluster.
# Real-world cluster will consist of more nodes, see discussion below. 
# Since multicast is now disabled by default, you will need to explicitly add
# nodes to the unicast.hosts array below.
node.master: true
node.data: true
discovery.zen.minimum_master_nodes: 1
gateway.recover_after_nodes: 1
gateway.expected_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["localhost"]

index.number_of_shards: 4
index.number_of_replicas: 0

# Performance related configurations
bootstrap.mlockall: true
node.max_local_storage_nodes: 1

To elaborate on what we have done here:

cluster.name is the name of the cluster. It has to be unique per cluster, and all nodes within the same cluster have to share the exact same name.
node.name is just a convenience option, I usually set it to the VM name.
number_of_shards and number_of_replicas is just a default configuration you can override anytime with the index settings API. These defaults are good for a 3-node cluster
bootstrap.mlockall: true ensures Elasticsearch gets all the memory it needs properly. This must be there.
Set gateway.expected_nodes to the number of nodes in your cluster, set discovery.zen.minimum_master_nodes to ceiling(number of nodes / 2), and disable multicast.
discovery.zen.ping.unicast.hosts should contain the IPs of the VMs that you want to participate in the cluster. This needs to be the internal IP on the subnet, and you can get it from the VM dashboard, just by the SSH Details.
node.max_local_storage_nodes prevents multiple instances of a data node to be launched by mistake on the same node.

Once done, hit Ctrl-X, y, Enter again to quit and save. You can now run Elasticsearch on this VM, you do that by typing:

sudo /etc/init.d/elasticsearch start

You can check the status of the service by running:

sudo /etc/init.d/elasticsearch status

And after a few seconds it takes it to initialize, you can verify it is running by pinging it over HTTP:

curl http://localhost:9200

You should get Elasticsearch's Hello World, spoken in JSON.

More explanation and guidance on the initial configuration you can find at the official docs.

Managing the node

To stop a node you can simply issue:

sudo /etc/init.d/elasticsearch stop

To verify our cluster is all up and running, log in to one of the machines and execute curl http://localhost:9200/_cluster/state?pretty. You should see a response similar to this:

{
  "cluster_name" : "my-cluster",
  "version" : 5,
  "master_node" : "Ufyhfc5-RyiK16EJie9lUg",
  "blocks" : { },
  "nodes" : {
    "Ufyhfc5-RyiK16EJie9lUg" : {
      "name" : "es-vm-ub2",
      "transport_address" : "inet[/10.0.0.6:9300]",
      "attributes" : { }
    },
    "uPXj-Gz1RpeEQkCNy7qiYg" : {
      "name" : "es-vbm-ub1",
      "transport_address" : "inet[/10.0.0.4:9300]",
      "attributes" : { }
    },
    "oEpz-mFNSiSr9K_T8PdCFg" : {
      "name" : "es-vm-ub3",
      "transport_address" : "inet[/10.0.0.5:9300]",
      "attributes" : { }
    }
  },
  "metadata" : {
    "templates" : { },
    "indices" : { }
  },
  "routing_table" : {
    "indices" : { }
  },
  "routing_nodes" : {
    "unassigned" : [ ],
    "nodes" : {
      "Ufyhfc5-RyiK16EJie9lUg" : [ ],
      "oEpz-mFNSiSr9K_T8PdCFg" : [ ],
      "uPXj-Gz1RpeEQkCNy7qiYg" : [ ]
    }
  },
  "allocations" : [ ]
}

This will show you the cluster state, and you should be able to tell if you are missing nodes. More on this here.

Install the Azure plugin

The official Elasticsearch plugin for Azure provides out-of-the-box support for various Azure facilities, like backing up to Azure Storage and enabling nodes discovery via the Azure APIs.

Installation of the plugin is well-documented here, as well as testing it and it's usages for discovery and backups.

Accessing the node

By default our Elasticsearch cluster is inaccessible from the outside, and that's generally a good approach. We don't want people to have access to our data, even if it's read-only. But how to access it from your applications? you have two options:

Deploy your application to Azure. Assuming you are running them on the same subscription and Virtual Network, you can open the 9200 HTTP port on the nodes and tell Elasticsearch to bind to the private IP of that node. Since accessing the nodes from your app via the private IP can only be done from within Azure, this will make the nodes accessible to your Azure deployed apps only. That is probably the best and easiest approach. To do that, set network.host in the elasticsearch.yml configuration file to your node's private IP. If you have the Azure plugin installed, using as the value _site_ would provide this functionality dynamically.
Alternatively, if you need to access the cluster from elsewhere, you can open the 9200 port but you should protect it with authentication. The most reiable way of protecting a publicly available Elasticsearch from the outside world is via Shield, or putting it behind a proxy.

Some advice

Congratulations, you now have your Elasticsearch cluster running on Azure. Here are a few more points for your consideration.

Prefer mounted disks, SSD works best, and set up backups

We used the mounted drives to be used for our data. Those drives are volatile, so it's a good advice to actually have more than one node in the cluster so if a node restarts data doesn't go to waste. Also, create backups using the Azure plugin and the excellent snapshot / restore API.

Changing the data and log paths is fairly simple and can be done by changing path.logs and path.data in elasticsearch.yml like we did in this guide, see more details here.

Due to the way Elaticsearch and Lucene opreate under the hood, prefer SSD disks. They will give you a considerable performance boost.

Log manangement

You should track the Elasticsearch log files. If anything goes south, you should know where to look and that place should be nearby.

Elasticsearch writes logs to /var/log/elasticsearch/. You should consult with them to check for errors and such on a regular basis (or if something went wrong). Use tail or less to do so, on the file bearing the cluster name under the logs folder.

Use a good monitoring tool

The best Elasticsearch real-time monitoring tool out there is Kopf. Install it, you will thank me later.

Other decent monitoring tools are BigDesk and Marvel (available as Kibana plugin).

For monitoring the machine itself, I usually use collectd and htop. The former sends metrics to an aggregative service that I usually have (a topic for another post), and the latter visualizes real-time machine stats (install it via sudo apt-get install htop).

Scaling out

A typical Elasticsearch cluster has 3 types of nodes - master eligible, data, and client nodes. A more elaborate explanation on the topology can be found here, but the gist is that you will need three types of nodes with several nodes for each.

Scaling out for redundancy and having multiple nodes in three different configuration call for automation, however I'm not an Azure automation expert. Automating deployments using ARM resources can be done using Azure Automation or via Chef / Puppet / Docker, and cloning of existing machines is possible by copying the vhd of the original machine using PowerShell. I would also look at using HasiCorp's Terraform to script the launch described in this guide. While there seems to be some work already done on that, I haven't got to actually testing it myself. Probably a good topic for a future post.

Don't use a load balancer

In this guide we did not setup a load balancher in front of our Elasticsearch cluster, because it's mostly redundant if you are using an official Elasticsearch client which will do sniffing and round robin, so no need to introduce unnecessary friction.

It's tempting to put a load balancer in front of your cluster, but it really doesn't make a lot of sense in the case of Elasticsearch. The official clients in all languages will do round-robin and fail-over on their own. They also implement sniffing functionality so they can detect new nodes when they join the cluster. Your load balancer (or Azure's) can only be so smart. Elasticsearch clients will be smarter, as long as you make sure to use an official and recent version of a client (or one that was implemented with this in mind).

If you adhere to the recommended setup then you should be able to scale out client nodes as necessary, and your applications will sniff and find new nodes as they are added (and ignore old unreachable ones after a while).

Some words about sizing

Sizing has always been a tough problem in distributed system. As I mentioned before, launching small instances is enough to get the feel of using Elasticsearch. For real production servers the data nodes are probably going to be the biggest cost of the cluster.

I would look at at least 16gb machines for data nodes (obviously with enough disk space as well), and at least 4 cores. Master eligible servers can be relatively slow, but recommended to have 2 cores to absolutely minimize the possibility of them crashing (and we've run master nodes on 1 core machines for a long while without problems). Client nodes are optional but really helpful if you have many requests coming to the cluster; their requirements are also relatively low.

As for number of nodes, 3 master eligible nodes are definitely enough for any cluster size that is not huge. Number of data nodes depends on the amount of data and the replication factor you are after (for redundancy and performance). Number of client nodes is a question of the amount of requests you expect to be dealing with.

Code 972