Modelling hierarchical data with RavenDB

RavenDB, Software design, English posts Comments (1)

A very common requirement with any database engine is to be able to store and query hierarchical data. This usually comes up in the context of categories for products in an e-commerce website, in a site-structure context (this page needs to be a parent of that page) or when trying to model company hierarchies tree (for deciding on permissions, for example).

While RavenDB applications need to solve the same kind of problems, a good solution often times looks very different from what you'd expect.

The immediate solution seems quite obvious, especially if you've solved a similar problem before with other types of databases (relational, for example). A Category document (shown here as a POCO) will contain its parent ID, and using some advanced indexing techniques one can get standard hierarchical queries working:

public class Category
{
    public string Id { get; set; }
    public string ParentId { get; set; }
    public string Name { get; set; }
}

However, there's a significant indexing latency and performance hit involved using this technique. All-in-all, this is usually a method unfit for document databases, as it employs relational thinking and takes no advantage whatsoever of the features available provided by the document database.

Instead, try thinking of a hierarchy tree as an entity of its own.

With a document-database, hierarchies are usually best modeled in one document that defines the hierarchy. In our scenario of categories that would be to define the categories tree, where the categories themselves can be represented by standalone documents (and thus hold Name, Description etc, and allow for other collection to reference them), or not if you don't need them to exist separately (less common, and will somewhat bloat the tree document).

Modeled from code, a Category document would look something like this:

public class Category
{
    public string Id { get; set; }
    public string Name { get; set; }
    // other meta-data that you want to store per category, like image etc
}

And the hierarchy tree document can be serialized from a class like the following, where this class can have methods for making nodes in it easily accessible:

public class CategoriesHierarchyTree
{
    public class Node
    {
       public string CategoryId { get; set; }
       public List<Node> Children { get; set; }
    }

    public List<Node> RootCategories { get; set; }

    // various methods for looking up and updating tree structure
}

This approach of hierarchy-tree has several important advantages:

  1. One transactional scope - when the tree changes, the tree changes in one transaction, always. You cannot get affected by multiple concurrent changes to the tree since you can leverage optimistic concurrency when editing this one document. Using the approach you propose it is impossible to guarantee that therefore harder to guarantee the completeness and correctness of the hierarchy tree over time. If you think of a hierarchy as a tree, it actually makes a lot of sense to have each change lock the entire tree until it completes. The hierarchy tree is one entity.
  2. Caching - the entire hierarchy can be quickly and efficiently cached, even using aggressive caching which will minimize the times the server is accessed with queries on hierarchy.
  3. All operations are done entirely in-memory - since its one document, aka object, all queries on the hierarchy (whose the parent of, list of children etc) are made entirely in-memory and effectively cost close to nothing to perform. Using an index with Recurse() to answer such queries is order of magnitude costlier (network costs and computational). If performance is your biggest concern - this is definitely a winner.
  4. Multiple parents per category, no denormalization - if a category document is saved outside the hierarchy tree, like demonstrated above, you can effectively put a category under multiple parents without the need to denormalize. All category data is in one place, in a document outside of the tree, and the tree only holds a reference to the category.

I highly recommend going with this approach. It is a bit of a shift from the relational mindset, but its so worth it, even when the tree grows big.


Comments

Gareth Thackeray
Posted by Gareth Thackeray on

This is exactly what we did a few months ago - good to know we're thinking like the master ;-)

Additionally we write a put a QuickLookup property on the CategoryHierarchy class and use a DocumentStoreListener to traverse the whole tree and add detailed contextual information about each category in there so that you can find out a node's ancestral path with just a dictionary lookup.

E.g.

class NodeInfo{ public Node[] Path {get;set;} public Node Node {get;set;} }

class CategoryHierarchy { List RootCategories {get;set;} Dictionary<string, NodeInfo> QuickLookup {get;set;} }

As an added bonus you can also use the QuickLookup in Results Transformers and indexes.



Add a new comment
Name
E-mail
Website (optional)
Your comment
2 * 4 = ?