Using Elasticsearch as a storage for git repositories
A few days ago, while in Paris, I had a late-night beer & code with Emeric Fermas, the guardian angel of libgit2 and libgit2sharp. If you don't know those projects, look them up - it's basically a re-write of Linus's hacks also known as git.
Anyway, as Em is an extremely passionate and talented developer and also very knowledgeable about git, we just ended up sitting down to put together a module that uses Elasticsearch as a backend to git.
Why would I want that, you ask? well, I can give you many reasons, but here are 2 great use cases:
Roll-your-own highly available git storage
GitHub store their hosted git repositories on disk, and they replicate each one across multiple servers. I'd assume they have some complicated logic behind that, probably using home-brewed tools. This approach has its limits - for example huge repositories (over time, or with many large blobs) are probably going to be hard to replicate, and making sure the cluster is even and properly used is a big challenge. I've been there myself.
Because Elasticsearch is great at distributing stuff, this backend implementation can pretty much do all that for free. You can now have Elasticsearch manage evening-out of your cluster (using the file-system allocation decider for example or any other types of deciders), easily replicate git repositories, and even shard large repositories. This is because git blobs are now pretty much just Elasticsearch documents.
Leveraging git from an Elasticsearch instance acting as a document store
Elasticsearch is great at storing documents, searching on them, and scaling out large data sets. Documents are things that may change often. Git is great at comparing things, and keeping track of changes. You see where I'm going with this?
The idea is very simple. Instead of doing ninja moves to compare documents of different versions, or storing audit-trail manually, you could now just use Elasticsearch to do this for you. You "commit" documents to Elasticsearch, which then know the current version of the document. But it can also remember the previous version, and the one before that. And other branches. And it could merge them. Or just diff.
Just imagine the possibilities.
What's next
This is just a PoC. It works, it's tested, but it's not 100% usable just yet.
I have more coming. First I'm going to stabilize this and make sure I put all the tweaks I can into it. Next up I want to enable search on those git repositories hosted on Elasticsearch, and then showcase some really cool use cases like building a wiki and an audit-trail for documents that are all using native Elastcisearch.
Until then, code is available from here for you to tackle with: https://github.com/synhershko/libgit2sharp.Elasticsearch
It's implemented in C# using libgit2sharp, but it's fairly easy to port it to Ruby or any other language with proper bindings to libgit2.