The definitive guide for Elasticsearch on Windows Azure
Running Elasticsearch in the cloud with Azure is fairly easy. There are several guides out there talking about how to do this, but my way of doing this is a bit different so I thought it'd be worth blogging about. I've also included discussions on some important related topics in the end.
To highlight the main differences from other guides out there, with my approach:
- I prefer running Elasticsearch on Linux VMs as opposed to Windows machines. This is simply because once you have an Elasticsearch data node, it's going to be it's sole purpose (at least, that's what I'd recommend), and Linux is so much easier to setup and maintain as a remote box. Even if you're new to this, learning the ropes is not that hard, and most of what you'll need is detailed below.
- Once running on a Linux VM, I will never run Elasticsearch on top of OpenJDK - that has too many known problems.
- Since this is Azure, you'll most likely be doing this from a Windows machine, so I'll be using tools that are available for Windows users.
- I will not setup a load balancher in front of the Elasticsearch cluster, because it's mostly redundant if you are using an official Elasticsearch client which will do sniffing and round robin, so no need to introduce unnecessary friction.
With that in mind, let's start rolling. Log-in to your Windows Azure account, and head to the Management Portal.
Creating the infrastructure
Let's start by creating the infrastructure. You may need to wait after some steps for the resource you just created to be provisioned and started.
- You start by creating a new Virtual Network. This will create a subnet where our Elasticsearch nodes can properly communicate with each other. As with all things in Azure, you click on the big "+ NEW" button at the bottom, and make sure to specify a proper Location:
- Now create a new Cloud Service for the Elasticsearch cluster to run on. This will be the "host" for all our VMs, which we will create shortly - again make sure the Region is appropriate and is on the same location as the Virtual Network you created:
-
We can now start creating Virtual Machines to work with. Go to New -> Compute -> Virtual Machine, and opt for creating new VMs from Gallery. In the Window that opens select the latest Ubuntu version (14.04 LTS at the time of this writing), and click next.
-
There are now 2 screens which will need your attention. In the second screen, "Virtual machine configuration", make sure to specify a good name to the VM, and set a proper size. If this is for production environment, I'd strongly suggest at least 28GB memory, unless you are certain you can work with smaller sizes. Then, on Authentication, uncheck "Upload ssh key" and check "Provide a password instead". This will allow you to SSH into the machine with a username/password key. More on uploading certificates later:
- In the next screen you will be asked to attach the VM to a Cloud Service and a Virtual Network. Select the ones we have created ("my-elasticsearch" Cloud Service and "my-elasticsearch-app" Virtual Network in this case), and accept the rest of the defaults:
- Click next and finish, and then repeat the process (starting at #3) to create as many VMs as you need. A recommended setup for a basic installation would be 3 nodes (meaning, 3 servers running Elasticsearch) of the same size, to allow for replication factor of two.
Preparing the Linux VMs
After the VMs were created and initialized, you can click on it in the portal to get to the Dashboard and get it's "SSH details". This will be a domain and a port.
Download PuTTY.exe and for each VM you've created, follow the steps below. This is where we prepare the machine to have OracleJVM and Elasticsearch, before we bring up the cluster.
-
With PuTTY, login to the machine and type the username and password you setup for the machine when creating it.
-
Once logged in, execute the following commands:
This one-liner will install Oracle JDK (mostly) silently, and set JAVA_HOME accordingly. For more details on what it's doing, see http://www.webupd8.org/2012/01/install-oracle-java-jdk-7-in-ubuntu-via.html:
echo debconf shared/accepted-oracle-license-v1-1 select true | \
sudo debconf-set-selections && echo debconf shared/accepted-oracle-license-v1-1 seen true | \
sudo debconf-set-selections && sudo add-apt-repository ppa:webupd8team/java && sudo apt-get update && sudo apt-get install oracle-java7-installer && sudo apt-get install oracle-java7-set-default
If you wish to verify the JVM version, you can now run java -version
.
Next, download and install latest Elasticsearch (1.2.1 at the time of writing this) using the official Debian package, and also setting it up to run on boot up:
curl -s https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.1.deb -o elasticsearch-1.2.1.deb
&& sudo dpkg -i elasticsearch-1.2.1.deb && sudo update-rc.d elasticsearch defaults 95 10
Configuring Elasticsearch
We now have multiple VMs properly installed with Elasticsearch ready to run on them. Before actually running the cluster, there's a few configurations we need to do.
First, we need to tell Elasticsearch how much memory it can consume. The highly-recommended default is 50% of the available memory. Meaning if we have a 28GB instance, we will give Elasticsearch 14GB. We do this by running:
sudo nano /etc/init.d/elasticsearch
And then uncomment ES_HEAP_SIZE=2g
(by removing the #) and changing it to be 50% of available memory (e.g. ES_HEAP_SIZE=14g
). To quit and save click Ctrl-X, y, and then Enter.
Last remaining bit is the Elasticsearch configurations file. It has a lot of details in it, with a lot of explanations (that you should read!), but on our VM we don't really care about all that, we just need to have the settings relevant to us there. So let's clean the file and get into editing a clean one:
sudo rm /etc/elasticsearch/elasticsearch.yml && sudo nano /etc/elasticsearch/elasticsearch.yml
Paste the following into the file, changing the defaults as per comments below:
cluster.name: my-cluster
node.name: "my-node"
index.number_of_shards: 1
index.number_of_replicas: 2
bootstrap.mlockall: true
gateway.expected_nodes: 3
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.0.0.4", 10.0.0.5", "10.0.0.6"]
To elaborate on what we have done here:
cluster.name
is the name of the cluster. It has to be unique per cluster, and all nodes within the same cluster have to share the exact same name.node.name
is just a convenience option, I usually set it to the VM name.number_of_shards
andnumber_of_replicas
is just a default configuration you can override anytime with the index settings API. These defaults are good for a 3-node clusterbootstrap.mlockall: true
ensures Elasticsearch gets all the memory it needs properly. This must be there.- Set
gateway.expected_nodes
to the number of nodes in your cluster, setdiscovery.zen.minimum_master_nodes
to ceiling(number of nodes / 2), and disable multicast. discovery.zen.ping.unicast.hosts
should contain the IPs of the VMs that you want to participate in the cluster. This needs to be the internal IP on the subnet, and you can get it from the VM dashboard, just by the SSH Details.
Once done, hit Ctrl-X, y, Enter again to quit and save. You can now run Elasticsearch on this VM, you do that by typing:
sudo /etc/init.d/elasticsearch start
And after a few seconds it takes it to initialize, you can verify it is running by pinging it over HTTP:
curl http://localhost:9200
You should get Elasticsearch's Hello World, spoken in JSON.
Managing the node
Elasticsearch now runs as a service, and writes logs to /var/log/elasticsearch
. You should consult with them to check for errors and such on a regular basis (or if something went wrong).
To check the status of a node, simply type:
sudo /etc/init.d/elasticsearch status
And to stop a node:
sudo /etc/init.d/elasticsearch stop
To verify our cluster is all up and running, log in to one of the machines and execute curl http://localhost:9200/_cluster/state?pretty
. You should see a response similar to this:
{
"cluster_name" : "my-cluster",
"version" : 5,
"master_node" : "Ufyhfc5-RyiK16EJie9lUg",
"blocks" : { },
"nodes" : {
"Ufyhfc5-RyiK16EJie9lUg" : {
"name" : "es-vm-ub2",
"transport_address" : "inet[/10.0.0.6:9300]",
"attributes" : { }
},
"uPXj-Gz1RpeEQkCNy7qiYg" : {
"name" : "es-vbm-ub1",
"transport_address" : "inet[/10.0.0.4:9300]",
"attributes" : { }
},
"oEpz-mFNSiSr9K_T8PdCFg" : {
"name" : "es-vm-ub3",
"transport_address" : "inet[/10.0.0.5:9300]",
"attributes" : { }
}
},
"metadata" : {
"templates" : { },
"indices" : { }
},
"routing_table" : {
"indices" : { }
},
"routing_nodes" : {
"unassigned" : [ ],
"nodes" : {
"Ufyhfc5-RyiK16EJie9lUg" : [ ],
"oEpz-mFNSiSr9K_T8PdCFg" : [ ],
"uPXj-Gz1RpeEQkCNy7qiYg" : [ ]
}
},
"allocations" : [ ]
}
This will show you the cluster state, and you should be able to tell if you are missing nodes. More on this here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-state.html.
Some advice
Congratulations, you now have your Elasticsearch cluster running on Azure. Here are a few more points for your consideration.
Use separate disks
In this guide we opted to using the VM storage space for logs and data.
You should mount drives to be used for logging and data files. Meaning, attach one new disk for every VM and have it use it as a data storage. This way, you can easily upgrade your Elasticsearch VMs on the go by just destroying machines without copying files. This is also a good practice for data redundancy. I will consider having one shared disk (with separate folders, one for each VM) for logs a good practice as well, then you can have on logstash instance running on top of one disk.
Changing the data and log paths is fairly simple and can be done by changing path.logs
and path.data
in elasticsearch.yml, more details here.
Log manangement
Speaking of Elasticsearch logs, you should track them. If anything goes south, you should know where to look and that place should be nearby.
Network access
Our Elasticsearch cluster is inaccessible from the outside, and that's good. We don't want people to have access to our data, even if it's read-only. But how to access it from your applications? you have two options:
-
Deploy your application to Azure. You can create a new Cloud Service and have it run on the same Virtual Network, so the 9200 HTTP port (as well as the 9300 port if your app is in Java) are both open and accessible to you. That is the best approach. Unfrotunately, Azure Websites do not currently support running on a Virtual Network, meaning you can only deploy websites as a Cloud Service if you want them to access the cluster this way.
-
Alternatively, if you need to access the cluster from elsewhere (or you have to use Azure Websites and cannot convert them to a Cloud Service), you can open the 9200 port but you should protect it with authentication (See https://github.com/Asquera/elasticsearch-http-basic and https://github.com/sonian/elasticsearch-jetty).
Use a good monitoring tool
See BigDesk, ElasticHQ and Marvel (first 2 are free, latter needs to be purchased). Either way, unless you use HTTP authentication you will not be able to install them as a site-plugin, but will have to deploy them via a secure host deployed to Azure as a Cloud Service.
Use certificates for logging in to the VMs
Instead of typing a username / password combination to log into your VMs, you can upload a certificate and authenticate using a private key. This tends to be more trouble to do on Windows, but it is indeed possible to do. You can also generate a certificate from your OpenSSH keypair (which is what you probably use for git anyway). See here for more details: http://azure.microsoft.com/en-us/documentation/articles/linux-use-ssh-key/.
Using the Azure plugin
Instead of specifying the IPs of the nodes yourself in the elasticsearch.yml config file (aka using Unicast), you can use the official Elasticsearch Azure plugin to figure out what nodes are available for you. I preferred not using it because for small and solid-state clusters this adds friction I'd rather not have, but for large, vibrant clusters this will probably help a lot.
For more information on that plugin, see https://github.com/elasticsearch/elasticsearch-cloud-azure and http://www.elasticsearch.org/blog/azure-cloud-plugin-for-elasticsearch/.
Another benefit of the plugin (which you can leverage separately, even if you don't enable Azure Multicast) is enabling snapshot/restore for Azure Storage.
Don't use a load balancer
It's tempting to put a load balancer in front of your cluster, but it really doesn't make a lot of sense in the case of Elasticsearch. Most clients (especially the official ones) will do round-robin and fail-over on their own and in addition implement sniffing functionality, to detect new nodes when they join the cluster. Your load balancer (or Azure's) can only be so smart. Elasticsearch clients will be smarter, as long as you make sure to use an official and recent version of a client (or one that was implemented with this in mind).
Automatic provisioning
The manual provisioning steps outlined above can be automated pretty easily, I just hadn't had the opportunity of trying that out. I believe using a Puppet Master or similar, or automatically running a custom provisioning script downloaded from storage. Either way, treat this guide as a boilerplate really, and just use whatever works for you.