Open-source Hebrew information retrieval (HebMorph, part 3)

June 7th, 2010 .NET, English posts, HebMorph, IR, Lucene

11 min read

Indexing Hebrew texts for later retrieval is not a trivial task. Although several solutions exist, I have pointed out that they are not necessarily providing the best results. Either way, there is no freely available solution allowing to index Hebrew even at the very basic level. HebMorph was started with this in mind. It is a free, open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals. During the work on this project, we will try and come up with different approaches to indexing Hebrew, and provide the tools to perform reliable comparisons between them. This project's ultimate goal is providing various IR libraries with the best Hebrew IR capabilities possible. Apache Lucene has been selected to be our planning and testing framework. This is thanks to its advanced capabilities, flexibility, and the author's familiarity with it. During these initial steps, .NET code is being written and used with Lucene.Net (a .Net port of Java Lucene). Once the project stabilizes enough, ports to other languages will be followed. Following is the roadmap for this project. More general details, including licensing info, code repository location, mailing list and latest updates (including roadmap updates!), can be found at http://www.code972.com/blog/hebmorph/.

Phase 1 - Hebrew Morphological Analyzer for Lucene (done)

The first milestone in this project's agenda is having a proper Hebrew lemmatisation. To get this started I needed to have a way of "understanding" Hebrew words, so some level of disambiguation could be done. I preferred going with a dictionary, to avoid the hassle of finding a decent algorithmic lemmatizer for Hebrew. After some research, I decided to go with hspell, mainly due to its maturity and large vocabulary (~460k words in 2009's release). Also, hspell follows the spelling rules set by the Academy of the Hebrew Language; this is important since it can make the disambiguation process easier. Although hspell was originally meant to be used as a spell-checker, it contains enough linguistics data to be utilized as a morphological tool. But working with hspell wasn't an easy task. Many design choices were made in it that I had to either reverse or ignore for now (for example, the word type flags are auto-generated but then compiled to the application instead of residing side by side with the vocabulary files, or the linginfo part being loaded as a raw string into memory and its lookup method separate from the actual word-checking mechanism). I may be writing more on those later. I only cared about being able to read hspell's data into memory in a workable manner. So, instead of porting hspell's API, I implemented my own radix tree, hspell data files reader and words lookup. Words are looked up in 3 stages:

Exact - the word is being searched in the radix tree for an exact match.
Prefixes omitted - the Hebrew prefix letters M$H WKLB are being omitted one-by-one (while exist), and whenever a word is found its linguistic payload is checked against the prefix omitted. The word is only added to the list of candidates if the prefix is legal for the word.
Toleration - an optional tolerative lookup is being performed, where letters in the key are being skipped over or added based on a set of rules, so many spellings incompatible with the Academy's spelling rules can still be identified and accepted. For each valid Hebrew prefixes, the prefix letters are removed and the process is repeated. Words identified using tolerators are being given a lower rank, so exact matches will be marked as more probable. Unlike the original hspell's code, multiple toleration methods are supported in a single lookup operation.

In addition to that, the .Net code base also includes a Hebrew tokenizer and a (very basic) list of stop words. The tokenizer is able to tag tokens as Hebrew, NonHebrew, Numerics, Hebrew constructs (Smichut) and Acronyms - useful info which could be used also for disambiguation. The Lucene.Net integration provided at this stage is utilizing the Hebrew Tokenizer and morphological analyzer, and is already doing a good job in making Hebrew texts searchable. It also ignores Niqqud characters, and handles non-Hebrew words, numbers, and OOV cases. Test applications are also included, providing GUIs allowing to perform morphological analysis on texts, and to index files and execute Hebrew-enabled queries on them using Lucene.Net. Code in .Net 2.0 compatible (intentionally not 3+), and was compiled and tested on VS2005 and Mono.

Phase 2 - Improving and stabilizing the morphological analyzer

Before doing anything else, we need the morphological analyzer to be at its best. Therefore, the following tasks need to be completed:

Test hspell's coverage against Wikipedia and possibly other Hebrew corporas, and reporting missing words to the original hspell project.
Find the best way to handle OOV cases. Since hspell seem to recognize most verbs, perhaps we can assume all OOV cases to be nouns and strip affixes associated with nouns? Another option would be to analyze the tokens stream and look for similar occurrences, taking into account different spellings and missing prefixes.
Try to resolve ambiguities using a similar approach as above (by looking at the stream of tokens), or by using synonyms.
Decide on a maximum number of lemmas to index per term and a minimum probability ranking so the index stays relatively small and precision stays high.
Generate a more complete list of stop words, by looking at the highest frequency terms he-wiki that have little value.
Improve tolerance mechanism. While it is meant to prevent unnecessary OOV cases and to provide other likely meanings of a word, it can also be a source of many unwanted ambiguities, and result in adding many incorrect tokens to the index. Therefore:
- Only tolerate consonants used as vowels (primarily Waw and Yud), and non-doubled consonant Waw letters.
- Tweak tolerator functions to be less greedy, and avoid tolerations where they are less likely to be needed.
- Make sure this process only makes valid expansions, and never create spelling errors.
- While tolerating, never try and guess what the user was looking for, but rather what other probable meaning could this word have.
- Research at what is the best ranking formula to be used in tolerators, so decision making of the disambiguation process can be more effective.
Make the morphological analyzer more flexible, so it gives control over word lookup sensitivity (exact / with toleration / morphological / expanded).

Phase 3 - OpenRelevance

From the OpenRelevance site:

The Open Relevance Project (ORP) is a new Apache Lucene sub-project aimed at making materials for doing relevance testing for Information Retrieval (IR), Machine Learning and Natural Language Processing (NLP) into open source. Our initial focus is on creating collections, judgments, queries and tools for the Lucene ecosystem of projects (Lucene Java, Solr, Nutch, Mahout, etc.) that can be used to judge relevance in a free, repeatable manner.

This is actually a testbed for performing relevance testing, like done by TREC, CLEF and FIRE, but freely. Since there are no collections and judgments available for testing Hebrew relevance, we will have to create one before we can perform any such tests.

Phase 4 - Creating and comparing various Analyzers

As we have shown in the previous post, just because an Analyzer is sophisticated, doesn't convert to better relevance. At this phase, we will be testing a (hopefully) large variety of analyzers and indexing approaches, using the Hebrew OpenRelevance package created in the previous phase. Following is a list of possibilities and thoughts, just to give you an idea of what could work. The general assumption at this stage is that indexing units don't have to be meaningful. Once we actually get to this phase, I will post more details on each as we plan and write them.

Simple n-grams analyzer. 4/5-grams have shown to work pretty well for Arabic, and this may be the case for Hebrew as well. See http://web.jhu.edu/bin/q/b/p75-mcnamee.pdf, mainly the section "Reasons for N-gram Effectiveness".
Other variations of n-grams, such as word-internal n-grams.
n-grams may not work as well for Hebrew as they do for Arabic, due to writing inconsistencies caused by Niqqud-less spelling (mainly in addition or omission of vowels), and more ambiguous prefixes. Therefore, several n-gram variations such as skipgrams, or some sort of vowels normalization, may work better than simple n-grams.
Full scale algorithmic stemmers, such as Segal’s publicly available morphological analyzer, or this one (seem to be used also by the Mila center). Ideally one without any dictionary files. This is probably going to yield similar results to a dictionary-based approach, but worths a try since the different error rates may result in different relevancy. Perhaps making a Hebrew Snowball while we are at it.
"Light stemmers", algorithms producing indexing units out of words by just removing prefix and doing other simple normalization operations. This is the best method today for many languages, also proven to work with Arabic (even though it doesn't handle nonlinear morphology, such as broken plurals).
Since for Hebrew we will be building one from scratch, there's going to be a lot of trials and errors. A good place to start is probably by learning from the experience of this research: Building a Shallow Arabic Morphological Analyzer in One Day. Some research work is probably going to be required in order to decide which prefix letters can be removed and which should be kept. Good thing is that Hebrew doesn't have the amount of broken plural cases as Arabic does, so this might actually work very well. Extremely light stemmers, such as the Harman "s-stemmer", can be a good starting point.
Simple character truncation has slim chances with Hebrew, but yet it worths a shot.
Since nouns and adjectives typically matter the most, the morphological lemmatisation process can ignore, low-rank or normalize difficult verbal forms. This may help both precision and index size.
Adjust the scoring given in the lemmatisation process (by tolerators or filters) based on relevancy tests.
Storing the original word (with a unique suffix, $ for example) alongside the lemma or other indexing unit and querying for it as well.
Since queries usually contain no real context (which is important for the disambiguation process done by the morph analyzer), it may pay using different analyzers for search and indexing, where the latter doesn't try to do much disambiguation (assuming the queried word is exact). The user can always refine his search.
Multi-field searches. one field with n-grams and the other with lemmas for example, using ConjunctionScorer.
... whatever else we could come up with ...

Phase 5 - Wide availability

It is important for us to make HebMorph's findings available to all, and usable from as many platforms and tools as possible. In order to achieve that, we will be porting the "winning" methods to other languages, such as Java and C/C++, so they could be enjoyed on other platforms as well. Once language ports become available, we also intend on providing HebMorph integration with more IR technologies, such as Xapian, Sphynx, SQLite FTS3 amd more.

Dictionary enhancements

Some work is required to make the dictionary used by the morph analyzer more standardized and distributable. That includes:

Proof-testing of hspell's data against various corporas.
Adding a method allowing to save the dictionary contained in the radix in a versioned format. This is necessary to allow for its easy distribution with an index and / or IR code.
Loading of external dictionaries (user data for example).
Several hspell related code updates so linguistic data is attached to the original dictionary files and isn't compiled to the application.

Future Work

There will always be more stuff to do. Here's some of what comes to mind:

More improvements to the morphological lemmatisation process powering the first analyzer created in this project, including: using Niqqud (where provided, even partially) for disambiguation, POS tagging during tokenization or later by processing the token vector, etc.
Saving terms with Niqqud (full or partial) by performing Hebrew vowel restoration. May require some serious Hebrew NLP work.
Looking for more researches done on Arabic IR, and trying implementing approaches they suggest with Hebrew.
Using Lucene's AutomatonQuery with automates that "understand" Hebrew. For an example for one, see Finite-State Registered Automata for Non-Concatenative Morphology.
Supporting Hebrew synonyms.

Code 972