Hacking with RavenDB's multi-maps

A couple of months ago I blogged about Orev - OpenRelevance viewer. The purpose of Orev, in short, is to create materials and a sandbox that allow to measure relevance between different full-text search methods.

In Orev we have Corpora, Topics and Judgments. A user is shown a Topic (= a few sentences describing something), and a Corpus Document, and he has to make a Judgment - whether the document is relevant to this Topic or not. By having a lot of judgments on a lot of corpora, using a lot of topics, we can perform automatic searches with different methods, and measure their relevance.

Orev was built using RavenDB as it's back-store, and in this post I'm going to show a nice approach we used to facilitate the judging process.

The Model

To start with, the model is a very simple one - we have Topic, User and Corpus, and of course we have a Judgment.

A Corpus has many CorpusDocuments, which are saved separately and not within the Corpus document itself. This is done for many reasons: they are different transactional units (if I update a typo in a document, the entire Corpus doesn't really change), and we want to retrieve one single document at a time when judging. Also, containing all documents within a parent Corpus document will bloat that document tremendously. So, for all intents and purposes, each CorpusDocument has to be stored as its own document.

And this is how they look:

    public class Corpus
    {
        public string Id { get; set; }

        [Required]
        public string Name { get; set; }

        [Required]
        public string Description { get; set; }

        [Required]
        [StringLength(5, MinimumLength = 5)]
        public string Language { get; set; } // a language identifier string, en-US for example
    }

    public class CorpusDocument
    {
        public string Id { get; set; }

        public string CorpusId { get; set; }
        public string Title { get; set; }
        public string Content { get; set; }
        public string InternalUniqueName { get; set; } // to allow us to track original name in the imported corpus
    }

    public class Topic
    {
        public string Id { get; set; }

        [Required]
        public string Title { get; set; }

        [Required]
        [DataType(DataType.MultilineText)]
        public string Description { get; set; }

        [Required]
        [DataType(DataType.MultilineText)]
        public string Narrator { get; set; }

        [Required]
        [StringLength(5, MinimumLength = 5)]
        public string Language { get; set; } // a language identifier string, en-US for example

        /// <summary>
        /// Id of user submitting this topic
        /// </summary>
        public string UserId { get; set; }
    }

    public class Judgment
    {
        public enum Verdict
        {
            Relevant,
            NotRelevant,
            Skip,
        };

        [Required]
        public string CorpusId { get; set; }

        [Required]
        public string DocumentId { get; set; }

        [Required]
        public string TopicId { get; set; }

        [Required]
        public string UserId { get; set; }

        [Required]
        public Verdict UserJudgement { get; set; }
    }

The Problem

We deployed the application, and imported a lot of Topics, Corpora and CorpusDocuments. Now we want to start generating Judgments. So we let our user select the Corpus he wants to work on, and a Topic to judge CorpusDocuments against. But once we start the judgment process, how can we pull the next CorpusDocument? remember, we have to find one in the selected Corpus that hasn't been judged yet for the selected Topic.

Before jumping ahead to the solution, try to think how you would solve this yourself. Hint: it involves multi-maps.

The Solution

At first glance it seems the query we are going to issue is going to ask RavenDB questions about Judgments. More specifically, it is going to ask it for all Judgments that were not yet made for a specific CorpusDocument and Topic. But how can we query on documents that do not exist?

And then we realize that we are actually querying for a CorpusDocument: when judging, I don't care about other judgments, all I want is to get the next CorpusDocument to show to the user. Another realization is that if I look on all the Judgments made on a specific CorpusDocument, I can get a list of Topics it has been judged against, and perhaps work my way from that. If only I could consolidate both... hmm...

So this is where RavenDB's multi-maps come in. I select all Judgments with their Topic ID within an array, and all CorpusDocuments each with an empty array. This will result in one big set of rows, with each row containing the CorpusDocument ID (which is a document ID + the corpus ID) and one Topic ID there exists a Judgment for. The reason I'm selecting Topics as an array in this stage, is to comply with the format we will produce results in the Reduce step; RavenDB requires all Map and Reduce functions to have the same type of output.

It is important to note ALL corpus documents will be listed, but there may be corpus documents with no topics at all - they will be represented by one row with the CorpusDocument ID, and with an empty string as the Topic ID.

The next thing we want to do is to perform a Reduce step on that set of rows. Notice that if we group all the rows based on the CorpusDocument ID (which includes the Corpus ID), we can have a smaller set of rows, where a CorpusDocument is represented only once, and along with it all the Topic IDs there are Judgments for. So, if we previously had a lot of rows, each row with one CorpusDocument identifier and one Topic identifier, we now consolidated all the data we have for each CorpusDocument into one row per CorpusDocument. And this is exactly what we want to have.

Hence, we write this index:

    public class CorpusDocuments_ByNextUnrated : AbstractMultiMapIndexCreationTask<CorpusDocuments_ByNextUnrated.ReduceResult>
    {
        public class ReduceResult
        {
            public string DocumentId { get; set; }
            public string CorpusId { get; set; }
            public string[] Topics { get; set; }
        }

        public CorpusDocuments_ByNextUnrated()
        {
            AddMap<CorpusDocument>(docs => from corpusDoc in docs
                                           select new { DocumentId = corpusDoc.Id, CorpusId = corpusDoc.CorpusId, Topics = new[] {string.Empty} }
                                           );

            AddMap<Judgment>(judgments => from j in judgments
                                          select new { DocumentId = j.DocumentId, j.CorpusId, Topics = new[] { j.TopicId } });

            Reduce = results => from result in results
                                group result by new { result.DocumentId, result.CorpusId }
                                into g
                                    select new
                                           {
                                               DocumentId = g.Key.DocumentId,
                                            CorpusId = g.Key.CorpusId,
                                               Topics = g.SelectMany(x => x.Topics).Distinct().ToArray(),
                                           };

            TransformResults = (db, results) => from result in results
                                                let doc = db.Load<CorpusDocument>(result.DocumentId)
                                                select doc;
        }
    }

Now we have an index which contains all the info we need: the CorpusDocument ID, the ID of the Corpus it belongs to, and the list of topics with judgments for each CorpusDocument, where all CorpusDocuments exist, even if they were never judged for any Topic. Performing the actual query is now just a matter of performing a match-all-docs-except query:

            var query = RavenSession.Advanced.LuceneQuery<CorpusDocument, CorpusDocuments_ByNextUnrated>()
                .Where("Topics:*") // match all docs
                .AndAlso()
                .WhereEquals("CorpusId", corpusId)
                .AndAlso()
                .Not
                .WhereEquals("Topics", topicId) // remove corpus docs with a particular TopicId attached to them
                .RandomOrdering()
                .FirstOrDefault();

This will issue a Lucene query like this: Topics:* AND CorpusId:corpus/1 AND -Topics:topics/1

This query will first match all index documents from the given corpus, and then will remove all CorpusDocuments which have a given TopicId attached to them. The way we built the index, if a CorpusDocument has a certain TopicId attached to it in the index, that means a Judgment has previously been made to it; and if a CorpusDocument has already been judged for our Topic, we are not interested in it anymore.

And just to spice things up a bit, I threw in RandomSorting().

Comments

  • Ayende Rahien

    You can optimize the query to be just:

    CorpusId:corpus/1 AND -Topics:topics/1

Leave a Comment