Challenges with indexing Hebrew texts (HebMorph, part 1)
Unfortunately, there is no magic trick for correctly indexing and searching Hebrew texts. Semitic languages like Hebrew, Arabic, and Aramaic are the hardest to morphologically analyze and disambiguate, and as a result creating a perfect IR solution for them, if at all possible, requires a lot of research and a very long process of trial and error. Some claim Hebrew is the most complex language of all from an NLP perspective. I don't know other Semitic languages well enough to comment on this, but I do know Hebrew to be complicated enough...
Since someone had to do this lengthy and tiresome work someday, I decided to go forward and do the heavy lifting myself instead of waiting for someone else to pick it up. That, and the fact I needed such a solution for another product I'm working on. This effort - HebMorph - is all about making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. As of this writing, it is still in a design phase, and is available from the github repository.
In a series of posts, I'm going to investigate this subject, and hopefully draw a complete picture. I'll start by explaining Hebrew morphology and how it affects common IR methods. From there, I'll present several possible ways to attack the problem, and finally discuss what exactly HebMorph does and what are its goals and roadmap.
Hebrew TransliterationBefore we start, here is a transliteration table (based on the one found at http://www.cs.technion.ac.il/~erelsgl/bxi/hmntx/tqstim/tatiq.html). We will use this simple convention all throughout this post, in a left-to-right reading order:
|א||ב||ג||ד||ה||ו||ז||ח||ט||י||כ / ך|
|ל||מ / ם||נ / ן||ס||ע||פ / ף||צ / ץ||ק||ר||ש||ת|
Crash-course on Hebrew MorphologyLike in other Semitic languages, most Hebrew roots consist of only three letters, hence they are often referred to as "triliterals". Some Hebrew roots are a sequence of 4 consonants, and a very few are only 2 characters long. 5 or 6 consonant-roots exist, but are very rare and mostly a result of accepting a loanword to the language. Patterns are being applied on a root to form a verb or a noun. Some nouns, mainly loanwords and some personal names, do not have a root. A root by itself doesn't have a real meaning as a word, although for many roots the base form of a verb formed with Bynian Qal is its three letters only.
The pattern system used to build verbs out of a given triliteral is called "Binyanim", and consists of 7 basic groups. While there are numerous exceptions, it is fairly easy to use those rules to create most verbs correctly out of a root. Verbs are inflected by gender, person, number, mood and tense. All verbs can be written in a one-word infinitive form, and by that isolate the actual verb from time and morphological categories. For example, ללכת ("to go") is the infinitive form of הלך ("he went") and ילכו ("they will go").
Hebrew nouns are inflected by gender, number and sometimes by possession, but not by case. Nouns are generally correlated to verbs (by shared roots), in which case their forming is about as systematic. The pattern system used to create nouns is called "Mishkalim", and is quite similar to the one being used for verbs. However, many nouns are not made of these patterns, often due to loan words from foreign languages, or other cases where there is no obvious root for the word.
Hebrew is written and read from right to left, and has 5 vowels: /a/, /e/, /i/, /o/ and /u/. Vowels are being represented by Niqqud - diacritical signs added to consonants to specify their reading, so for example סָ is being pronounced as S/a/, and סֵ as S/e/. Each vowel can be pronounced in 2 types of syllables (open and closed), and there's a Niqqud sign for each, and a few more for mid-cases (called Chataf). With Niqqud, words are being read consonant-first: the consonant is being pronounced, then the vowel specified by the Niqqud signs, and then the next consonant. However, Niqqud is not mandatory, and Niqqud-less spelling exists where words are readable in a context. Most texts available today use the Niqqud-less spelling; Hebrew words are having Niqqud added to them in modern texts only in texts meant for children or in order to disambiguate them.
The letters M$H WKLB (משה וכלב) are being added to a word as prefixes for uses which in English a dedicated word is used. From Wikipedia:
Hebrew uses a number of one-letter prefixes that are added to words for various purposes. These are called "Letters of Use" (Hebrew: אותיות השימוש, Otiyot HaShimush). Such items include: the definite article ha- (/ha/) (="the"); prepositions be- (/bə/) (="in"), le- (/lə/) (="to"), mi- (/mi/) (="from"; a shortened version of the preposition min'); conjunctions ve- (/və/) (="and"), she- (/ʃe/) (="that"), ke- (/kə/) (="as", "like").Like in English, suffixes are used for plurals. Unlike English, there is also a special form of plurals - duals. Duals are treated as plurals, but their spelling rules are a bit different. Suffixes are also used to mark the feminine form of verbs, nouns and adjectives. There are special suffixes for feminine plurals, and as with all rules - there are quite a few exceptions. Possessions may be written also as a suffix instead of as a separate word, so עט שלי (my pen) may also be written as עטי.
The vowel accompanying each of these letters may differ from those listed above, depending on the first letter or vowel following it. The rules governing these changes are hardly observed in colloquial speech, as most speakers tend to employ the regular form. The correct form may be heard in more formal circumstances. For example, if a preposition is put before a word which begins with a moving Shva, then the preposition takes the vowel /i/ (and the initial consonant may be weakened): colloquial be-kfar (="in a village") corresponds to the more formalbi-khfar.
The definite article may be inserted between a preposition or a conjunction and the word it refers to, creating composite words like mé-ha-kfar (="from the village"). The latter also demonstrates the change in the vowel of mi-. With be and le, the definite article is assimilated into the prefix, which then becomes ba or la. Thus *be-ha-matos becomes ba-matos (="in the plane"). Note that this does not happen to mé (the form of "min" or "mi-" used before the letter "he"), therefore mé-ha-matos is a valid form, which means "from the airplane".
Hebrew and IR technologiesUndiacriticized Hebrew poses many difficulties for IR systems striving to be search efficient also for Hebrew texts. In the following sub-sections I'll be detailing the problems IR and NLP tools are having with the language, and on subsequent posts I'll be detailing possible ways to resolve them.
Although some of the issues I listed exist in other languages as well, I still felt I should mention them here, and that's for two reasons. First, because I couldn't find a comprehensive list of problems to solve when writing Hebrew IR solutions, and I though this could be a great place to start with the intro for HebMorph. Second, some of the challenges are much greater with Hebrew due to its complex morphology and great deal of ambiguity. Approaches working with other simpler languages are not guaranteed to work with Hebrew, hence are worth being brought up specifically.
Prefix ambiguityTerms are being stored in indexes lexicographically ordered, hence it is important the first letter(s) of the indexed term will actually be a part of the word the term represents, or they won't be found while searching. This is usually the case for most of the world's languages. However, in Hebrew there are seven consonants (M$H WKLB) which can be added as prefixes to only change the context of the word and not its meaning. As an example, take the noun BIT (בית, home) which can be used with several prefixes, like this: LBIT (to the home), BBIT (in the home), $BBIT (that in the home), K$HBIT (when the home), and so on.
As you can see from the example above, several prefixes can be used on one word, without changing its meaning. For all the cases above, we would want to index the word "home" - BIT. Indexing the other prefixed terms without stripping their prefix will prevent them from coming up in searches, although they are valid results.
So, you ask, why not just strip those letters? Well, because in too many cases they can also be a part of a word. Take the following example:
רותי פספסה את הרכבתIn the first sentence, HRKBT is a noun and the letter H is being used as a prefix, forming the word "the train". In the second sentence, HRKBT is a verbal noun meaning "assembling (of)". Both words share the same root (RKB), but the meaning is different enough for you to want to strip the definite article H from the word in the first sentence before indexing it.
הרכבת המוצר מסובכת להפליא
I can give many more examples to demonstrate this. Here are two more just so you get the idea (the stem is underscored, affixes are marked in red):
KLBI (כלבי) can mean:
- KLBI - as my heart
- KLBI - my dog
- $BTW - that in a (musical) note (pronounced She-Be-Tav)
- $BTW - that his daugter (pronounced She-Bi-To)
- $BTW - they were on a strike (pronounced Shav-Tu)
- $BTW - (the action of) him being seated (pronounced Shiv-To)
- $BTW - His Shabbat (pronounced Sha-Ba-To)
Niqqud-less (dotless, undiacriticized) spellingAs mentioned above, we need Niqqud signs to pronounce the word at hand correctly. Every word has its own unique set of Niqqud signs, so when they are present a context is not necessary to correctly analyze the meaning of the word. Without them its just a pile of consonants that when given in context, someone with a good grasp of the language could understand their intended meaning.
Without Niqqud and without context, ambiguity is just too great. It is estimated that about 50% of words - meaning every other word - has at least one other legitimate meaning. Even a human will not be able to determine the correct meaning of a word then, and sometimes even a short context is not enough to help with that.
As an example, the word XBL (חבל) can mean either a rope (XeVel), pity (XaVal), wounded (verb, XaVal) and sabotaged (XiBel). Note how the vowels repeat themselves - however with Niqqud the actual words are very different: חֶבֶל (rope), חֲבָל (pity), חָבַל (wounded), חִבֵּל (sabotaged).
Spelling with Niqqud signs has very strict rules (which are based on syllable types), and is highly standardized. Niqqud-less spelling on the other hand is a different story. Since vowels aren't written, characters like Yud and Waw are sometimes added to the word to make up with that. The Academy of the Hebrew language have compiled a set of rules for Niqqud-less spelling, which defines when to add vowel letters like Yud and Waw, when to double consonant Waw and so on. But, these rules are very hard to memorize, some aren't widely-agreed, and most of them aren't being widely used, at least not consistently. As a result of that, spelling inconsistencies are very common, and are adding plenty more reading options for many words.
For example, the Hebrew word for drawer should be spelled MGIRA according to the Academy decision, and although all common dictionaries agree, the spelling MGRA (without Yud) is still very common. The word MGRA is by itself quite ambiguous, with about 10 different meanings, and taking the loss of short vowels into account doesn't make disambiguation easier.
Even partial Niqqud could help with any of the above - for example by eliminating possibilities where a vowel doesn't match the Niqqud given for that position, or by deciding a short vowel was not omitted. Letters like $ which can be pronounces as either Sh or S, or the Niqqud sign Dagesh which can transform B from sounding like V/e/ to B/e/, P - from F/e/ to Pe/, and K from X/e/ to K/e/. This is what could make the difference between XeVel and XiBel, and greately help disambiguation. For those, partial Niqqud is a bit more common, but it still isn't used often enough.
SuffixesHebrew suffixes are used for much more than just plurals, and also plural suffixes come in different flavors (for example, plural for masculine nouns ends with -IM, while feminine nouns have a -IOT suffix). Having suffixes to mark possessions for nouns, speaker for verbs, and pronouns, all of which can also integrate with the plural suffix (for example: DLT - a door, DLTNW - our door, DLTWTNW - our doors), is making it much tougher to correctly identify and remove the suffix.
Ambiguities exist with suffixes as well. For example, XBLH (חבלה) can mean "her rope" with a belongings suffix H ("hers"), a damage (without any suffix; a noun with Mishkal Katala), or wounded (a verb, feminine past).
Hebrew also have several cases of broken plurals, but apparently not more than Arabic has. These require the suffix identification process to take into account the exceptions; the more exist, the more complex the algorithm is.
Consonants Assimilation and IntersectionAlthough a pattern is being applied on a Semitic root to form a noun or a verb, the resulting words don't always follow the pattern exactly. In many cases letters will be replaced or assimilated, so the word will become easier to pronounce. For example, intersection in Hebrew is "Hitz-talvut" (הצטלבות, HC@LBWT), where the original word received after applying the relevant morphological template is "Hit-tzalvut" (התצלבות, HTCLBWT), which is a bit harder to pronounce. Due to that, and several other reasons, the consonants in this word have been both crossed and assimilated.
This phenomenon is very likely to make an automatic identification process of a word a bit harder to make.
Stop-words ambiguityStop words, or stop lists, are the most frequent terms in a set of text documents, which are more likely to hurt relevance than aid searches, hence are filtered by many IR tools. Usually these are words without a real meaning; English examples would be words like "the", "and" and "or", and many other pronouns and prepositions.
Compiling such a basic list in Hebrew is not a straight-forward task. First, for each "stop word" we need to take into account all the possible prefixes it may get, so each stop word immediately becomes two or three, what essentially bloats the dictionary used for this process. For example, the word גם ("also", "too") is definitely a stop word, so is the word וגם ("and also").
In many cases, Hebrew stop-words are ambiguous as well - with or without a prefix. Examples for such cases include the preposition אשר, which can also be a biblical name (Asher) being widely used today as well; the preposition אף, which can also mean a nose; the preposition כדי which can also mean "my vase"; and plenty more. By removing those, one takes the chance of actually removing a meaningful term from the index. By not filtering them out, relevance is likely to be hurt.
In some cases words are meaningless only when they are part of a specific phrase, but when they are not they may actually have an important meaning. This phenomenon also exists in other languages (for example, "no one" in English), but it seems to be more common in Hebrew ("על ידי", "אי פעם", "אף על פי", "שום דבר"...). For these phrases, the reverse dilemma exists.
ConstructsPhrases are frequently written in Hebrew, where the spelling of the words in it slightly changes (usually with a suffix), so does their meaning. This is called a Construct. The first word in פי התהום - the verge of the abyss - when looked at separately, can either mean "my mouth", "the mouth of-", or a preposition used when discussing multiplication. By looking at just that word, we would probably go and index the term "mouth". Indeed, the literal meaning of that phrase is "the mouth of the abyss", however the term "mouth" alone has no real meaning in that context. Since Hebrew constructs are likely to cause quite a few false positives in searches, it is a good idea to try and index them differently - perhaps as a whole, or only the words in it that do not have more than one meaning.
Geresh and GershayimA double quotes (Gershayim) and a single quote (Geresh) characters are often being added to a Hebrew word, also mid-word, not as punctuation but rather to mark abbreviations and acronyms. A single quote may also follow one of the letters XCGZ (חצץ גז), to change the way they are pronounced. This usually happens in loanwords and personal non-Hebrew names, and is not considered as part of the Niqqud rules, but rather as an "expansion" of the letter. This leads to these characters practically being part of the dictionary.
Ambiguities may exist where a word terminates with a single quote after a letter, in which case some context is required to correctly identify whether it is an abbreviation or not. For example, the word AINC' (אינצ') could either mean an inch, or an abbreviation of Encyclopedia.
Most texts do not use the Unicode Geresh and Gershayim characters, but rather the common single quote (') and double quotes (") characters which are more available while typing. These are considered punctuation characters by many tokenizers, and are also used in many query syntaxes, hence this could lead to many confusions if no prior handling is done.
Various written dialectsThe Hebrew language has changed a lot in the many years it has been spoken, and although the modern (or Israeli) dialect is the most common today, many dialects are still being widely used in written texts. Biblical and Talmudic dialects (including the Mishnaic, early-rabbinic, post-biblical and some Medieval variations) are used in Torah studies, and various variations of the Medieval dialect are still being used in poems and literature. Some even claim the modern dialect differs so much from the original Hebrew, that it should be considered as a brand new language.
Although many languages have different dialects, in Hebrew that appears to be of real significance because of its morphology. The dialects don't just differ in their grammar and have their own vocabulary; many times they inflect verbs and nouns differently, or give them a totally different meaning. Then you end up having several more ambiguities per term, or several extra Out-Of-Vocabulary cases. For example, the word חמר would mean wine in a Mishnaic / Talmudic context; in modern Hebrew it would be understood as a type of soil.
Wrapping upTrilitterals, non-concatinative morphology, and the fact that most of the vowels are not written, many particles are attached to the word without space, a double consonant is written with one letter, and some letters signify vowels and consonants interchangeably, cause great ambiguity in the Hebrew language. Almost every string of characters may designate many words, the average being about three different meanings per term. Not helping are the variant spelling conventions, which makes it harder to decide what is the most probable meaning of a word. Having multiple possible terms per-word in an index obviously affects IR systems trying to make this text searchable.
In the next post I'll show common approaches used to resolve this, and right after that what HebMorph is trying to do with all that.