More flexible Hebrew indexing with HebMorph
In the past week I've been working on making Hebrew indexing with HebMorph more flexible. Now it is possible to perform different type of searches, and also control the way lemmas are filtered. You can also perform exact searches and morphological searches on one field, without indexing the contents twice. See below for more details on how its done.
HebMorph now contains two new entities: Lucene.Analysis.Hebrew.SimpleAnalyzer and HebMorph.LemmaFilter.
The Hebrew SimpleAnalyzer
Lucene.Analysis.Hebrew.SimpleAnalyzer performs, as its name suggests, simple analysis only. It calls HebMorph.Tokenizer to perform text tokenization, and then passes the tokens through a NiqqudFilter (to remove Niqqud characters), StopFilter (along with a list of Hebrew stop-words) and a LowerCaseFilter (to normalize non-Hebrew tokens). The tokenization process also tries to remove certain noise cases - which are unique to Hebrew - where its obvious a token is not a real word.
The MorphAnalyzer has a similar process in place, but it also makes use of meta data returned for each Token. In addition, it uses HebMorph.Lemmatizer to compile a list of possible lemmas, and indexes them along with the original term (which is being marked as such). By default, the original term is stored only when more than one lemma is returned for a term.
To enable dual searches the morphological analyzer needs to store the original tokens for all cases (since SimpleAnalyzer performs no lemmatization), and to mark them accordingly. SimpleAnalyzer, in turn, needs to be aware of what is an original term and what is not. To achieve that, SimpleAnalyzer now uses AddSuffixFilter to "stick" a $ sign to the end of each analyzed term. MorphAnalyzer, in turn, is asked to append this char to all original terms, and not only to cases where there is more than one possible lemma.
To use this analyzer duality, fields need to be created with MorphAnalyzer with
MorphAnalyzer.alwaysSaveMarkedOriginal = true (false by default). When using the same analyzer for search, it is recommended to set it again to false. To use
Hebrew.SimpleAnalyzer to search these fields, register all the Hebrew tokens with the "$" suffix as follows:
SimpleAnalyzer an = new SimpleAnalyzer();
Flexible lemma filtering
Until today, no filtering was done to the collection of lemmas returned from the lemmatizer. Since the toleration mechanism often returns some very wrong lemmas, we needed a way of moderating the lemmas accepted by the morphological analyzer.
This is where HebMorph.LemmaFilter comes in. It is a class accepting a collection of HebMorph.Token objects, and returning only tokens which passed the checks done by its member function IsValidToken. A LemmaFilter is "pluggable" into MorphAnalyzer, and can also be used to create more focused searches (or expand them further if required). Since the LemmaFilter object is aware of all token properties returned from the lemmatizer, it is quite a powerful mechanism, alowing the consumer
Using these new features, preliminary results for searches ran on a corpus of about 15,000 Hebrew documents loo pretty good. Indexing is done with MorphAnalyzer to create one field only, and then searches are performed in one of two flavors: exact (using Hebrew.SimpleAnalyzer) to find names or exact phrases like שיר השירים, and morphological (using MorphAnalyzer) to find topics or non-exact words.
Comments are now closed