Testing hspell's language coverage using Wikipedia

July 13th, 2010 .NET, English posts, HebMorph, hspell

4 min read

As part of the HebMorph project, I needed to test hspell's dictionary on a large modern corpus. Knowing how many words it can recognize is very important, and below I'll be explaining exactly why. The project, along with usage instructions, is released under the GNU GPL and available from here. The report (zipped XML) is available here. Words listed in this report are only those unknown to hspell, aka OOV cases. There are 460,116 such words (minus 3596 words which could be automatically identified as spelling errors), each bundled with a Wiki page ID it was first found in. This is useful so the word can be found in context easily.

Each entry can represent one of 4 completely different cases:

The word is invalid (just a pile of letters).
It is a spelling error.
It uses a different spelling convention, not following the Academia's decisions. Since hspell's dictionary is based on the Academia's rules, it will not recognize those words (or will but incorrectly, but this is not covered in this test.)
The word is valid and is missing from hspell's dictionary.

With the first two there is nothing I can do, except notify the corpus's authors. These actually are the lion-share of the words in this report. What I'm most interested in, for 2 different reasons, are bullets #3 and #4. Finding them, out of all words in the report, is very important, but is also quite a lengthy process...

Perfecting toleration mechanism

HebMorph supports looking up a word in the dictionary while tolerating certain spelling errors, but in the current lemmatization process this is only done if no lemma is found for a word. This ensures the number of lemmas returned per token is kept to the minimum. And even then a filtering mechanism is used. Many words not recognized by this coverage test are written in a non-standard spelling, which sometimes is so common it even considered to be the correct spelling by the crowd. These words, estimated in tens of thousands in the report I linked to above, are exactly where toleration should be kicking in. Using many words from this report, we can now perfect the toleration mechanism so it recognizes more cases, and does so with better precision. This alone should improve relevance in searches.

Valid OOV cases

Other words in the report may be valid words that are not in the hspell dictionary. In other words: gaps in hspell's language coverage. From a quick look I had in the results, it seems like there are not so much of those words. Also, most of those words will be proper names and loanwords, hardly any verb inflictions or the like. Words that are found to be valid will be reported to the hspell project, and are likely to be added to the its dictionary as soon as they are found.

Coverage percentage

The plain numbers say coverage is at about 50%. Numbers don't lie, but they do mislead. Out of the 460k words hspell couldn't "understand", there are only so little that are spelled correctly and should indeed be in the dictionary. Tolerators should do the job with valid words that are spelled in a different way than hspell recognizes (see the Niqqud-less spelling section in my first post on HebMorph).

Since not all proper names should be recognized, and spelling errors (not differences!) aren't to be considered at all as coverage gaps, I think it is safe to say current coverage is pretty good; probably over 90%. Until we are able to analyze the results in full, I think it is safe to say hspell is the right choice for HebMorph.

Code 972

Testing hspell's language coverage using Wikipedia

Perfecting toleration mechanism

Valid OOV cases

Coverage percentage

Comments are now closed