Orev: The Apache OpenRelevance Viewer

English posts, HebMorph, Lucene, Lucene.Net, OpenRelevance, Orev

Comments

4 min read

It has been quite a some time since I said I'll be working on this, as I got caught on other pressing matters and had to drop it for a while. But it is all for the best. The technology I used for this new version is just a perfect fit for this application, and it wasn't available then. I'll be addressing the technical aspects later in this post and also in some follow-up posts.

My first interest in the OpenRelevance project, and one of the main reasons I created Orev, was the HebMorph project. Using Orev, I'm hoping to be able to create an environment where tools for Hebrew IR can be tested and compared, to produce the ultimate Hebrew analyzer, for Lucene and other libraries as well.

Before anything else, the complete source code is available at https://github.com/synhershko/Orev.

I have a hosted version too which I will publish a link to soon, once I get some things sorted out and some feedback from other people who were involved in this project.

What is this?

The OpenRelevance project is an Apache project, aimed at making materials for doing relevance testing for information retrieval (IR), Machine Learning and Natural Language Processing (NLP). Think TREC, but open-source.

These materials require a lot of managing work and many human hours to be put into collecting corpora and topics, and then judging them. Without going into too many details here about the actual process, it essentially means crowd-sourcing a lot of work, and that is assuming the OpenRelevance project had the proper tools to offer the people recruited for the work.

Having no such tool, the Viewer - Orev - is meant for being exactly that, and so to minimize the overhead required from both the project managers and the people who will be doing the actual work. By providing nice and easy facilities to add new Topics and Corpora, and to feed documents into a corpus, it will make it very easy to manage the surrounding infrastructure. And with a nice web UI to be judging documents with, the work of the recruits is going to be very easy to grok.

More technical details

Orev is multi-lingual from the ground up, and is heavily user-based. Every user can view available topics and corpora, and make judgments based on the languages he speaks. Managers can add new topics, create new corpora and feed those with documents. Documents can be added to a corpus, or updated, at a later time, too. We will probably add the ability to enable users to send topics in as well and so on.

Even more technical details

When I started to work on this I was using NHibernate and spent some time on designing a DB schema, fighting with ASP.NET MVC and all that. Now that MVC 3 is out, and RavenDB is rocking worlds, it was a matter of a few hours to get this all started again from scratch. Using a schema-less DB really made this possible to do in a minimum number of hours, excluding some dilemmas and frustrations which I will be blogging about soon.

In the original design I intended on loading corpus documents from external sources, or store them on the file-system. Since now it is using RavenDB, which is a document based database, storing the documents in the DB itself now actually makes sense. This is how we can also offer later updating of a corpus with new documents, or patching old documents.

What's next

We need to run a lot of tests, get a lot of feedback and improve accordingly. The first step is obviously gathering content and raising interest, so if you find this post / project interesting - please spread the word.

Orev is currently using the default ASP.NET MVC theme. If there's any HTML5/CSS designer and magic worker who can take up the task to recreate it to be more inviting and easier to work with - it is something we can definitely use.

I have enabled the github bug tracker in the Orev source repository. Please use it for reporting bugs or asking for features.

When the dust sets down and actual judging will commence on a regular basis, we will start working on code to output stats and statistical computations, in preparations for the original cause of the OpenRelevance project - to measure performance of IR software (+ NLP + ML, of course), and to be able to produce bleeding edge analyzers for various languages.


Comments are now closed