The definitive guide for Elasticsearch on Windows Azure

July 7th, 2014 Elasticsearch, .NET, Windows Azure, Cloud, English posts

12 min read

Running Elasticsearch in the cloud with Azure is fairly easy. There are several guides out there talking about how to do this, but my way of doing this is a bit different so I thought it'd be worth blogging about. I've also included discussions on some important related topics in the end.

To highlight the main differences from other guides out there, with my approach:

I prefer running Elasticsearch on Linux VMs as opposed to Windows machines. This is simply because once you have an Elasticsearch data node, it's going to be it's sole purpose (at least, that's what I'd recommend), and Linux is so much easier to setup and maintain as a remote box. Even if you're new to this, learning the ropes is not that hard, and most of what you'll need is detailed below.
Once running on a Linux VM, I will never run Elasticsearch on top of OpenJDK - that has too many known problems.
Since this is Azure, you'll most likely be doing this from a Windows machine, so I'll be using tools that are available for Windows users.
I will not setup a load balancher in front of the Elasticsearch cluster, because it's mostly redundant if you are using an official Elasticsearch client which will do sniffing and round robin, so no need to introduce unnecessary friction.

With that in mind, let's start rolling. Log-in to your Windows Azure account, and head to the Management Portal.

Creating the infrastructure

Let's start by creating the infrastructure. You may need to wait after some steps for the resource you just created to be provisioned and started.

You start by creating a new Virtual Network. This will create a subnet where our Elasticsearch nodes can properly communicate with each other. As with all things in Azure, you click on the big "+ NEW" button at the bottom, and make sure to specify a proper Location:

Now create a new Cloud Service for the Elasticsearch cluster to run on. This will be the "host" for all our VMs, which we will create shortly - again make sure the Region is appropriate and is on the same location as the Virtual Network you created:

We can now start creating Virtual Machines to work with. Go to New -> Compute -> Virtual Machine, and opt for creating new VMs from Gallery. In the Window that opens select the latest Ubuntu version (14.04 LTS at the time of this writing), and click next.
There are now 2 screens which will need your attention. In the second screen, "Virtual machine configuration", make sure to specify a good name to the VM, and set a proper size. If this is for production environment, I'd strongly suggest at least 28GB memory, unless you are certain you can work with smaller sizes. Then, on Authentication, uncheck "Upload ssh key" and check "Provide a password instead". This will allow you to SSH into the machine with a username/password key. More on uploading certificates later:

In the next screen you will be asked to attach the VM to a Cloud Service and a Virtual Network. Select the ones we have created ("my-elasticsearch" Cloud Service and "my-elasticsearch-app" Virtual Network in this case), and accept the rest of the defaults:

Click next and finish, and then repeat the process (starting at #3) to create as many VMs as you need. A recommended setup for a basic installation would be 3 nodes (meaning, 3 servers running Elasticsearch) of the same size, to allow for replication factor of two.

Preparing the Linux VMs

After the VMs were created and initialized, you can click on it in the portal to get to the Dashboard and get it's "SSH details". This will be a domain and a port.

Download PuTTY.exe and for each VM you've created, follow the steps below. This is where we prepare the machine to have OracleJVM and Elasticsearch, before we bring up the cluster.

With PuTTY, login to the machine and type the username and password you setup for the machine when creating it.
Once logged in, execute the following commands:

This one-liner will install Oracle JDK (mostly) silently, and set JAVA_HOME accordingly. For more details on what it's doing, see http://www.webupd8.org/2012/01/install-oracle-java-jdk-7-in-ubuntu-via.html:

echo debconf shared/accepted-oracle-license-v1-1 select true | \
sudo debconf-set-selections && echo debconf shared/accepted-oracle-license-v1-1 seen true | \
sudo debconf-set-selections && sudo add-apt-repository ppa:webupd8team/java && sudo apt-get update && sudo apt-get install oracle-java7-installer && sudo apt-get install oracle-java7-set-default

If you wish to verify the JVM version, you can now run java -version.

Next, download and install latest Elasticsearch (1.2.1 at the time of writing this) using the official Debian package, and also setting it up to run on boot up:

curl -s https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.1.deb -o elasticsearch-1.2.1.deb
&& sudo dpkg -i elasticsearch-1.2.1.deb && sudo update-rc.d elasticsearch defaults 95 10

Configuring Elasticsearch

We now have multiple VMs properly installed with Elasticsearch ready to run on them. Before actually running the cluster, there's a few configurations we need to do.

First, we need to tell Elasticsearch how much memory it can consume. The highly-recommended default is 50% of the available memory. Meaning if we have a 28GB instance, we will give Elasticsearch 14GB. We do this by running:

sudo nano /etc/init.d/elasticsearch

And then uncomment ES_HEAP_SIZE=2g (by removing the #) and changing it to be 50% of available memory (e.g. ES_HEAP_SIZE=14g). To quit and save click Ctrl-X, y, and then Enter.

Last remaining bit is the Elasticsearch configurations file. It has a lot of details in it, with a lot of explanations (that you should read!), but on our VM we don't really care about all that, we just need to have the settings relevant to us there. So let's clean the file and get into editing a clean one:

sudo rm /etc/elasticsearch/elasticsearch.yml && sudo nano /etc/elasticsearch/elasticsearch.yml

Paste the following into the file, changing the defaults as per comments below:

cluster.name: my-cluster
node.name: "my-node"
index.number_of_shards: 1
index.number_of_replicas: 2
bootstrap.mlockall: true
gateway.expected_nodes: 3
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.0.0.4", 10.0.0.5", "10.0.0.6"]

To elaborate on what we have done here:

cluster.name is the name of the cluster. It has to be unique per cluster, and all nodes within the same cluster have to share the exact same name.
node.name is just a convenience option, I usually set it to the VM name.
number_of_shards and number_of_replicas is just a default configuration you can override anytime with the index settings API. These defaults are good for a 3-node cluster
bootstrap.mlockall: true ensures Elasticsearch gets all the memory it needs properly. This must be there.
Set gateway.expected_nodes to the number of nodes in your cluster, set discovery.zen.minimum_master_nodes to ceiling(number of nodes / 2), and disable multicast.
discovery.zen.ping.unicast.hosts should contain the IPs of the VMs that you want to participate in the cluster. This needs to be the internal IP on the subnet, and you can get it from the VM dashboard, just by the SSH Details.

Once done, hit Ctrl-X, y, Enter again to quit and save. You can now run Elasticsearch on this VM, you do that by typing:

sudo /etc/init.d/elasticsearch start

And after a few seconds it takes it to initialize, you can verify it is running by pinging it over HTTP:

curl http://localhost:9200

You should get Elasticsearch's Hello World, spoken in JSON.

Managing the node

Elasticsearch now runs as a service, and writes logs to /var/log/elasticsearch. You should consult with them to check for errors and such on a regular basis (or if something went wrong).

To check the status of a node, simply type:

sudo /etc/init.d/elasticsearch status

And to stop a node:

sudo /etc/init.d/elasticsearch stop

To verify our cluster is all up and running, log in to one of the machines and execute curl http://localhost:9200/_cluster/state?pretty. You should see a response similar to this:

{
  "cluster_name" : "my-cluster",
  "version" : 5,
  "master_node" : "Ufyhfc5-RyiK16EJie9lUg",
  "blocks" : { },
  "nodes" : {
    "Ufyhfc5-RyiK16EJie9lUg" : {
      "name" : "es-vm-ub2",
      "transport_address" : "inet[/10.0.0.6:9300]",
      "attributes" : { }
    },
    "uPXj-Gz1RpeEQkCNy7qiYg" : {
      "name" : "es-vbm-ub1",
      "transport_address" : "inet[/10.0.0.4:9300]",
      "attributes" : { }
    },
    "oEpz-mFNSiSr9K_T8PdCFg" : {
      "name" : "es-vm-ub3",
      "transport_address" : "inet[/10.0.0.5:9300]",
      "attributes" : { }
    }
  },
  "metadata" : {
    "templates" : { },
    "indices" : { }
  },
  "routing_table" : {
    "indices" : { }
  },
  "routing_nodes" : {
    "unassigned" : [ ],
    "nodes" : {
      "Ufyhfc5-RyiK16EJie9lUg" : [ ],
      "oEpz-mFNSiSr9K_T8PdCFg" : [ ],
      "uPXj-Gz1RpeEQkCNy7qiYg" : [ ]
    }
  },
  "allocations" : [ ]
}

This will show you the cluster state, and you should be able to tell if you are missing nodes. More on this here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-state.html.

Some advice

Congratulations, you now have your Elasticsearch cluster running on Azure. Here are a few more points for your consideration.

Use separate disks

In this guide we opted to using the VM storage space for logs and data.

You should mount drives to be used for logging and data files. Meaning, attach one new disk for every VM and have it use it as a data storage. This way, you can easily upgrade your Elasticsearch VMs on the go by just destroying machines without copying files. This is also a good practice for data redundancy. I will consider having one shared disk (with separate folders, one for each VM) for logs a good practice as well, then you can have on logstash instance running on top of one disk.

Changing the data and log paths is fairly simple and can be done by changing path.logs and path.data in elasticsearch.yml, more details here.

Log manangement

Speaking of Elasticsearch logs, you should track them. If anything goes south, you should know where to look and that place should be nearby.

Network access

Our Elasticsearch cluster is inaccessible from the outside, and that's good. We don't want people to have access to our data, even if it's read-only. But how to access it from your applications? you have two options:

Deploy your application to Azure. You can create a new Cloud Service and have it run on the same Virtual Network, so the 9200 HTTP port (as well as the 9300 port if your app is in Java) are both open and accessible to you. That is the best approach. Unfrotunately, Azure Websites do not currently support running on a Virtual Network, meaning you can only deploy websites as a Cloud Service if you want them to access the cluster this way.
Alternatively, if you need to access the cluster from elsewhere (or you have to use Azure Websites and cannot convert them to a Cloud Service), you can open the 9200 port but you should protect it with authentication (See https://github.com/Asquera/elasticsearch-http-basic and https://github.com/sonian/elasticsearch-jetty).

Use a good monitoring tool

See BigDesk, ElasticHQ and Marvel (first 2 are free, latter needs to be purchased). Either way, unless you use HTTP authentication you will not be able to install them as a site-plugin, but will have to deploy them via a secure host deployed to Azure as a Cloud Service.

Use certificates for logging in to the VMs

Instead of typing a username / password combination to log into your VMs, you can upload a certificate and authenticate using a private key. This tends to be more trouble to do on Windows, but it is indeed possible to do. You can also generate a certificate from your OpenSSH keypair (which is what you probably use for git anyway). See here for more details: http://azure.microsoft.com/en-us/documentation/articles/linux-use-ssh-key/.

Using the Azure plugin

Instead of specifying the IPs of the nodes yourself in the elasticsearch.yml config file (aka using Unicast), you can use the official Elasticsearch Azure plugin to figure out what nodes are available for you. I preferred not using it because for small and solid-state clusters this adds friction I'd rather not have, but for large, vibrant clusters this will probably help a lot.

For more information on that plugin, see https://github.com/elasticsearch/elasticsearch-cloud-azure and http://www.elasticsearch.org/blog/azure-cloud-plugin-for-elasticsearch/.

Another benefit of the plugin (which you can leverage separately, even if you don't enable Azure Multicast) is enabling snapshot/restore for Azure Storage.

Don't use a load balancer

It's tempting to put a load balancer in front of your cluster, but it really doesn't make a lot of sense in the case of Elasticsearch. Most clients (especially the official ones) will do round-robin and fail-over on their own and in addition implement sniffing functionality, to detect new nodes when they join the cluster. Your load balancer (or Azure's) can only be so smart. Elasticsearch clients will be smarter, as long as you make sure to use an official and recent version of a client (or one that was implemented with this in mind).

Automatic provisioning

The manual provisioning steps outlined above can be automated pretty easily, I just hadn't had the opportunity of trying that out. I believe using a Puppet Master or similar, or automatically running a custom provisioning script downloaded from storage. Either way, treat this guide as a boilerplate really, and just use whatever works for you.

Good post Itamar, thank you. In regards to web site, there is another option - Azure Hybrid Connection. That way your ES is not exposed, but Azure Website can talk to it.

Sean Feldman July 8th, 2014

Itamar, This is very interesting - thanks for this great step by step.I look forward to trying it out soon on Extra Small or Small VM.

SeanFeldman, is there a step by step on how to set up ES on Azure Hybrid Connection ? Azure Hybrid Connection says that it's for accessing on-premise resources whereas Azure websites lives in Azure...could you clarify ?

Julie Setbon July 9th, 2014
Julie,

Yes you can. You'd run ES on a vm that doesn't expose ports 9200 and 9300 as endpoints. HC works on premises, so think of your VN on Azure VNet as on premises :) Ping me at feldman.sean (att) gmail

Sean Feldman July 10th, 2014

Now there is another option for the Azure Websites and connecting to the ES cluster. Now you can put the Azure Website on the same Virtual Network. For now it is only available in the Preview Portal. It is a much cleaner solution than the other options in my opinion.

Jeff September 25th, 2014

Code 972