Finding Hebrew lemmas (HebMorph, part 2)
As shown in the previous post, building a Hebrew-aware search engine is not trivial. Several attempts (mainly commercial) were made to deal with that. In this post I'm going to try and draw a complete picture of what they did, and show other routes that may exist. In the next post I'll discuss HebMorph itself.
Kudos to Robert Muir for several fruitful discussions and ideas.
A Few Definitions
- In the IR world, Precision and Recall are numbers used to measure retrieval quality of a given search engine. Precision is measured by looking at the number of relevant documents retrieved, in relation to the total of items retrieved (Fallout being the rest). Recall measures how many items we "missed" (if at all), by comparing the number of relevant items retrieved to the sum of total relevant items in the collection. This is only doable in a closed and controlled environment, where this information is absolute and known.
- Relevance is a subjective term (to the user querying the index) applied on a document, which asks: is this document relevant to the query you have entered, and does it help you with the info you are looking for. A good IR product will then return scored results, where the highest scored document tries to be the most relevant.
- Root in our context is a Semitic root, usually contains 3 letters, very rarely 2, 4 or 5. See the previous post for more info on this.
- Stem and Lemma originally have different meanings, but for Semitic languages they seem to be used interchangeably. For the sake of clarity, the term Lemma will be used whenever I'll refer to a base word grouping together all its different inflected forms. A Stem is a shorter and normalized state of a word, done by an automated process.
What to Index?This is the most important question of all. Identifying the "indexing unit" - the actual term to be saved in the index - is the cornerstone for and good and efficient search operation. With Hebrew, the following options exist: (after ruling out the option of indexing the term as-is):
- The raw term - we have already ruled that out as a possibility. Using query wildcards won't do so well, in case you wondered. For example, *שיר* may return השירים (the songs), השירות (the service), שיריים (food remains) and so on...
- A Hebrew Root - escaping a triliteral out of any given word. This approach will most probably result in poor precision and relevance, since a Semitic root is just too vague. Here's an example for one, where you can see the how one triliteral can represent nouns, verbs and adjectives of different meanings all together.
- A lemma - identifying one (or more) possible meaning of the word, and using that as the index term.
- Pseudo-lemma (or: a stem) - performing string operations to remove concatenated morphology (prefix letters M$H WKLB, suffixes like IM and WT, and so on), so the index term is not necessarily meaningful or disambiguated properly, but is definitely less ambiguous than #1, and the process is definitely faster and less dependent than #2.
Hebrew NLP MethodsThe most common approach assumes that in order to provide relevant search results the correct lemma has to be indexed. To arrive at a lemma, it is required to disambiguate the Hebrew word first. This is generally done in one of 2 ways, as follows. Many systems will perform several sequential NLP operations to disambiguate it even further.
- Dictionary based - the word is being looked up in a list of words compiled by hand, from a corpus, or using expanding algorithms. The word is usually checked numerous times - first for an exact match, and a few several times with suspected prefixes removed from it, or with some editing applied to cope with spelling differences. Each word in the dictionary should hold a list of prefixes allowed to be used with it.
- Algorithmic approach - a rule-based system identifies all the possible meanings of a word, and eliminates those that are less likely to be correct.In addition to detecting morphological inflections, some systems can also ignore phonetic differences very common in Niqqud-less spelling (loss of short vowels, for example) and in loanwords (Music is written as either מוזיקה or מוסיקה, for example).
- Morphological precision, meaning the total number of standard language words recognized, and whether special noun cases like duals and broken plurals, and assimilation or 4-5 letters root for verbs, are correctly detected.
- How special words are treated. This includes slang and loanwords, and also nouns and names. Such a vocabulary is growing endlessly.
- Toleration of spelling differences: prefix particles, missing or extra short vowels, consonant Waws, and the rest of the Niqqud-less spelling issues.
- Level of disambiguation provided naturally (number of false positives; lexicographic disambiguations / POS; ranking abilities).
While dictionaries allow to perform only a strict lookup for a word, it is quite easy to make an algorithm to tolerate morphological phenomenons like prefixes, different spellings of a word, missing vowels and consonant Waws. However, rule-based systems are more likely to handle special cases incorrectly, and to accept illegal words - what could create additional ambiguities. Slang and loanwords will not be processed correctly (or not at all) unless they correspond with a rule.
External files or a database are usually required for dictionary-based approaches to work. A lemmatizer built based on a set of rules is usually implemented in code only, hence works out of the box and requires no prerequisites whatsoever. Another thing to note regarding Hebrew dictionaries, is the spelling of the words in it - it can either contain all the possible spellings for a word, or follow a specific set of spelling rules. Whichever is the case, and what are the rules it follows, are important for the disambiguation process.
Dictionaries can be either hand-crafted (possibly with some help of automated tools), or scraped from a large corpora. Dictionaries of the first type generally contain some morphological information of a word, for example what prefixes are allowed and what are its correct stem and root. The ones that are created from a corpora can use statistical data for disambiguation. Quality of statistical disambiguation depends on the amount of texts used to create it, and their quality and context. For a good corpora, it should match that of a rule-based systems.
Whatever path we take, at this stage more than one lemma exists for most Hebrew words. It is possible to filter some of them, for example based on probability ranking if provided by the disambiguation process. There are ways to disambiguate it even further, many of which rely on the original context of the word. Alon Itai and Erel Segal gave an overview of several approaches, while presenting their own:
Choueka and Luisignan (Chueka and Lusingnan, 1985) proposed to consider the immediate context of a word and to take advantage of the observation that quite often if a word appears more than once in the same document the same analysis will be the correct analysis throughout the document. Orly Albeck (1992) attempted to mimic the way humans analyze texts by manually constructing rules that would allow finding the right analysis without backtracking. Levinger (1992) gathered statistics to find the probability of each word, and then used hand crafted rules to rule out ungrammatical analyses. Carmel and Maarek (1999) used statistics on our tags to partially disambiguate words and then indexed the disambiguated text for use in a search engine. The first stage of our work, the word stage, was based on Levinger (1995), but like Carmel (1999) used the morphological tags independently of the lemmas. Ornan and Katz built a disambiguation system for Hebrew based on the phonemic script and handcrafted semantic clues (Ornan 1994).Complete POS solution (contextual or statistical) is the only likely way to provide complete and reliable disambiguation for Hebrew texts. Building such a tool, however, is a large task in itself, and even then ambiguities exist. For example, the phrase "המראה של מטוסים ריקים" can mean either "the look of empty planes [...]" or "take off of empty planes".
NLP-based Text RetrievalAfter the disambiguation process - either partial or full - IR systems will index the provided lemmas. Before doing that, several tasks are likely to be performed:
- If more than one lemma exists for a word (for example, if no contextual / POS tools were used), try and index as few lemmas as possible, to maintain high retrieval quality. Using probability ranking (if any) to high-score more probable lemmas is a good way to manage several lemmas in the same position. Posing a ranking threshold is a good idea too.
- Handle Out-Of-Vocabulary cases. Those will occur for both dictionary-based and algorithmic approaches. Unreckognized words (OOV) will usually be indexed as-is, without attempting to remove linear morphology particles.
- Some systems will attempt to identify OOV cases by looking at a list of names and addresses, so they could be grouped together despite spelling differences.
- Removal of stop-words and noise words. POS tagging can help with disambiguating the cases where a term can be either a valid term or a stop / noise word.
- Term expansion, like soundex and synonyms. Both will require additional dictionaries and/or morphological tools.
Other Retrieval MethodsThe assumption we have just discussed, of not being able to achieve good search results without complete morphological processing, has been disproved several times. Multiple researches show how good precision and recall are maintained even without complex morphological analysis, and in some cases the alternative is even better. Such non-morphological approaches include several word-based models (such as rule-based stemming, Snowball, light stemming, soundex and word truncation) and non-word approaches (character n-grams, word internal n-grams, skipgrams and character truncation).
As long as these methods provide sufficient results they are usually preferred, since they are faster and save the overhead of using morphological tools. The common trade-offs for most of them are larger index, longer index look-up times (for some) and the inability to expand queries with synonyms, or perform word decoupling.
Many studies were conducted on a variety of Latin languages, and have proven these methods to be very efficient while discussing why they are better than the morphological alternative. Apparently, IR doesn't care about understanding the language, just about statistics and redundancy. This could also be an effect of a certain error rate which exists in any lemmatisation or disambiguation process.
Although the case with Semitic language was expected to be different, 4-grams have shown to work very well for Arabic too. Researchers also showed how a light-stemming algorithm (light-10) can outperform Buckwalter's morphological approach (used by AraMorph). Since Arabic shares many language characteristics with Hebrew, this draws to the conclusion that the case for Hebrew is most probably no different.
For Hebrew and Arabic (and probably other Semitic languages as well) there may also be another reason why non-morphological approaches may have the edge. For lemmatisation to work correctly, a good context has to be provided. While indexing text files or even short sentences, that is not a real problem. However, often when a short query is given lemmatisation would fail to provide correct lemmas, and therefore result in an irrelevant query. Although this is not always the case, it is not rare enough to dismiss. All commercial solutions providing search engines with Semitic languages support seem to rely on morphological tools. And as far as I could tell, they all ignore this pitfall.
The Best Retrieval Method For HebrewSince Hebrew has its own perks, whatever works for Arabic isn't guaranteed to also work for Hebrew. The best Hebrew IR solution is expected to use a variation or combination of the methods mentioned above. Put differently: morphological analysis, disambiguation, and the vast use of NLP methods are very likely ineffective compared to other solutions, which require no similar overhead. However, without having an environment for training and testing all this, we could never discover what the best approach really is.
To the best of my knowledge, there is no such an effort today, one that is striving to advance Hebrew IR so Hebrew texts can be properly searchable. This is where HebMorph comes in - its sole purpose is to brainstorm and provide as many indexing methods as possible, and then to pick the best one out of that collection, and make it usable from as many IR technologies as possible. HebMorph is all about creating these tools (including Hebrew stemmers, full-scale and light), and providing tools to compare them to one another.
In the next post I'll describe HebMorph's current status, and provide more details about its roadmap. Stay tuned!