All blog posts of month 6 of the year 2011

Practical Hebrew search - Open2011 presentation

English posts, HebMorph, IR, Lucene, Lucene.Net Comments (1)

Attached with this post is the presentation I gave today at Open2011 in Tel-Aviv.

The sample app can be found here: http://hebmorph.code972.com/. It is also going to be HebMorph's home in a few weeks when I'll be done generating all the necessary content.

As promised, I will be posting more details on some interested findings on Hebrew search, and comparisons with Google search. I want to have a bit more comprehensive posts about that, so it will be up in a few weeks time.


Some words on HebMorph's licensing

English posts, HebMorph, Open source Comments (0)

Without being a lawyer, and trying real hard not to become one, it is not easy to be an author of an open-source project. Apparently it takes quite a lot of thought, and definitely a lot of reading, to make sure the code you release has an appropriate license that specifies your intent correctly. If you don't pay enough attention, you probably are going to end up with a license that is not at all enforcing what you intended it to.

This is what happened to me with HebMorph, and this post is here to clarify everything that needs clarifying, and to explain the reasoning behind the recent license change to HebMorph.

Like I said in an e-mail conversation we recently held in HebMorph's mailing list, this project is all about research and sharing of information.  We WILL reach our goals, some sooner than other, and when we do, the knowledge we gathered will be free for all to learn from and use. However, since we have a very long road ahead of us, I needed to make sure this project can support itself. I spent a lot of time researching options, charting a path, writing code, testing approaches and a lot more, and to be able to continue doing that in large bulks of time (and not occasionally) we needed income.

This is when I decided to charge for any commercial use made with code released under by the HebMorph project. It is actually pretty simple and very fair: I release my work for all to see and use without any charge. If, however, you make profit from my work, I'd like you to support the project. Aiming for quite a small market, relying on donations won't cut it, so I decided to use a license which will allow me to enforce that.

I explicitly stated more than once, and in more than one place, that I'm not after anyone's money. This project grew out of sheer interest, and it will definitely continue to evolve. This is why HebMorph doesn't have a price tag; if you want to use it in a commercial product, contact me and we'll figure something out. An arrangement that is fair for both parties.

Unaware of many legal details, I chose GPLv2 to be HebMorph's license. It seemed promising: any derivative work would require the consuming application to be released under GPLv2 as well, and since most companies would like to avoid that - they would pay for a commercial license. It also was the same license hspell is using, and since some parts of HebMorph are definitely a derivative work of hspell, it required HebMorph to be released under a compatible license, or GPLv2 itself. Problem solved - or at least so I thought.

Following a recent user inquiry, I found out my license of choice was in fact not suitable at all. First, it has many flaws and loopholes making it quite ineffective in enforcing what I wanted it to. It is practically the last license I would choose for any modern software; here's a good read on why.

Secondly, and not less important, any GPLv2 software is incompatible with Lucene/Solr, a software that is released under the Apache software license. Since our main platform is Lucene, we can't afford that.

Now that I realized all this, I've changed HebMorph's license to be AGPLv3. This license is based on GPLv3 (an improvement over GPLv2 on itself), but adds a paragraph that defines "use" in a way that covers also websites and webservices, and by that seals off the infamous GPLv2 loophole. Since AGPLv3 isn't compatible with GPLv2, I had to get an explicit permission from hspell's authors to still be able to use it, and such they did - with the exception of being able to use the hspell files distributed with HebMorph only for search purposes.

Now, you may notice how I frequently used the word "fair" when describing the license selection process. This is because I'm not here to run and seal loopholes, or make sure anyone that is making profit from my work is paying back. I enjoy doing other things, not that. I expect users to be fair; if they make profit from a product that uses HebMorph in one way or another, I expect them to be fair and give back. There probably could be thousands of ways to bypass any license, AGPL included, so I'm making it clear that I release HebMorph under the AGPL and also under the expectation of fairness.  At some point I was actually considering using RPL, but then I decided it is too restrictive and will probably make more problems than it will solve. So I selected AGPLv3, and let me say this again: please act in good faith.

And just to make sure: as far as I'm concerned, using any HebMorph code through Solr is just the same as using it through Lucene. Solr is dynamically linking the jars in what falls under the very definition of "derivative work", and in case that was in doubt, it isn't now. I'm explicitly specifying this, so even if there is a loophole here (which I'm quite certain there is not), it is now under the license definition of "use": if your application uses Solr, and Solr uses HebMorph, your application is effectively using AGPLv3 software and need to be AGPLv3 as well.

Hopefully this clarifies some things about HebMorph, and as always I'd love to hear any thoughts on this.

Due to the unintended conflict of licenses, any previous versions of HebMorph being used with Lucene/Solr has to move to the new license.

As before, OSS projects and non-profit closed source projects are welcome to use HebMorph with no charge, but the latter should contact me in advance to discuss some terms.


FastVectorHighlighter issues revisited

In a previous post I described how to use FVH to highlight contents which went through filters / readers like HTMLStripCharFilter in the analysis process. As DIGY in the comments spotted right away, my approach was all wrong. Yes, I knew any CharFilter or Tokenizer implementation would store term positions and offsets that take into account any skips done in the content, but since it didn't work for me I didn't care to look any deeper and just made that work around, and then ran to tell.

So, don't use that. Instead, rely on your analyzer to store positions and offsets and on FVH to use them correctly when highlighting. As it happens, the custom analyzers I used suffered from a nasty bug that was not allowing them to consider skips. Now that I fixed that, it all works like a charm.

However, two issues still remained. First, since my stored fields contain HTML, the fragments may contain HTML tags as well, sometimes partial ones. In many cases the fragment that will end up on your webpage would ruin the page layout because of a stubborn misplaced </div> tag that found its way to the fragment. Escaping all <'s and >'s is not a really good solution - you don't really want your fragments to contain ugly looking HTML tags.

The second issue was having duplicate content. I wanted to process the content more than once - index it with 2 or more analyzers, but didn't want to store it more than once since it was exactly the same content.  To still be able to highlight on those other fields as well, I needed FVH to allow me to specify a field name to pull the stored contents from.

Solving the first problem was quite easy, and required nothing more than a simple extension function. It is called on the fragment string after receiving it from FVH. To be on the safe side, I made sure to ask for a larger fragment than I originally intended, so even if a lot of HTML noise is present, some context will remain in the fragment:

public static string HtmlStripFragment(this string fragment)
{
    if (string.IsNullOrEmpty(fragment)) return string.Empty;

    var sb = new StringBuilder(fragment.Length);
    bool withinHtml = false, first = true;
    foreach (var c in fragment)
    {
        if (c == '>')
        {
            if (first) sb.Length = 0;
            withinHtml = false;
            first = false;
            continue;
        }
        if (withinHtml)
            continue;
        if (c == '<')
        {
            first = false;
            withinHtml = true;
            continue;
        }
        sb.Append(c);
    }

    // FVH was instantiated with "[b]" and "[/b]" as post- and pre- tags for highlighting,
    // so they won't get lost in translation
    return sb.Append("...").Replace("[b]", "<b>").Replace("[/b]", "</b>").ToString();
}

The second issue was solved by subclassing FragmentsBuilder, only this time it was a bit less intrusive:

public class CustomFragmentsBuilder : BaseFragmentsBuilder
{
    public string ContentFieldName { get; protected set; }

    /// <summary>
    /// a constructor.
    /// </summary>
    public CustomFragmentsBuilder()
    {
    }

    public CustomFragmentsBuilder(string contentFieldName)
    {
        ContentFieldName = contentFieldName;
    }

    /// <summary>
    /// a constructor.
    /// </summary>
    /// <param name="preTags">array of pre-tags for markup terms</param>
    /// <param name="postTags">array of post-tags for markup terms</param>
    public CustomFragmentsBuilder(String[] preTags, String[] postTags)
        : base(preTags, postTags)
    {
    }

    public CustomFragmentsBuilder(string contentFieldName, String[] preTags, String[] postTags)
        : base(preTags, postTags)
    {
        ContentFieldName = contentFieldName;
    }

    /// <summary>
    /// do nothing. return the source list.
    /// </summary>
    public override List<WeightedFragInfo> GetWeightedFragInfoList(List<WeightedFragInfo> src)
    {
        return src;
    }

    protected override Field[] GetFields(IndexReader reader, int docId, string fieldName)
    {
        var field = ContentFieldName ?? fieldName;
        var doc = reader.Document(docId, new MapFieldSelector(new[] {field}));
        return doc.GetFields(field); // according to Document class javadoc, this never returns null
    }
}

And as always the usual disclaimer applies - this isn't necessarily the best way to do this, and I'd definitely like to hear of more elegant ways to achieve that if such exist.


Custom tokenization and Lucene's FastVectorHighlighter

NOTE: The approach described below is wrong, you may want to read the follow-up post.

Perhaps you have tackled this before: you wanted to use Lucene's FastVectorHighlighter (aka FVH), but since you have a custom CharFilter in your analysis chain, the highlighter fails to produce valid fragments.

In my particular case, I used HTMLStripCharFilter (available to Lucene.Net through my pet contrib project) to extract text content from HTML pages, and then pass it through the rest of the analysis process. This confused FVH, since it was taking the full content from store, where HTML was still present, and token positions were not taking that into account. And any other custom CharFilter that is added to the analysis chain is going to cause the same troubles.

To overcome this, I needed to make sure FVH is aware of all content stripping operations that are made before or while tokenization is happening. All I had to do was to implement a custom FragmentsBuilder, looking as follows (.Net code; a Java version would look almost identical):

public class HtmlFragmentsBuilder : BaseFragmentsBuilder
{
    /// <summary>
    /// a constructor.
    /// </summary>
    public HtmlFragmentsBuilder()
        : base()
    {
    }

    /// <summary>
    /// a constructor.
    /// </summary>
    /// <param name="preTags">array of pre-tags for markup terms</param>
    /// <param name="postTags">array of post-tags for markup terms</param>
    public HtmlFragmentsBuilder(String[] preTags, String[] postTags)
        : base(preTags, postTags)
    {
    }

    /// <summary>
    /// do nothing. return the source list.
    /// </summary>
    public override List<WeightedFragInfo> GetWeightedFragInfoList(List<WeightedFragInfo> src)
    {
        return src;
    }

    protected override String GetFragmentSource(StringBuilder buffer, int[] index, Field[] values, int startOffset, int endOffset)
    {
        string fieldText;
        while (buffer.Length < endOffset && index[0] < values.Length)
        {
            fieldText = GetFilteredFieldText(values[index[0]]);
            if (index[0] > 0 && values[index[0]].IsTokenized() && fieldText.Length > 0)
                buffer.Append(' ');
            buffer.Append(fieldText);
            ++(index[0]);
        }
        var eo = buffer.Length < endOffset ? buffer.Length : endOffset;
        return buffer.ToString().Substring(startOffset, eo - startOffset);
    }

    /// <summary>
    /// Gets the field text, after applying custom filtering
    /// </summary>
    /// <param name="field"></param>
    /// <returns></returns>
    protected string GetFilteredFieldText(Field field)
    {
        var theStream = new MemoryStream(Encoding.UTF8.GetBytes(field.StringValue()));
        var reader = CharReader.Get(new StreamReader(theStream));
        reader = new HTMLStripCharFilter(reader);

        int r;
        var sb = new StringBuilder();
        while ((r = reader.Read()) != -1)
        {
            sb.Append((char)r);
        }
        return sb.ToString();
    }
}

FVH will then need to be configured to use it:

var fvh = new FastVectorHighlighter(FastVectorHighlighter.DEFAULT_PHRASE_HIGHLIGHT,                                             FastVectorHighlighter.DEFAULT_FIELD_MATCH,
                    new SimpleFragListBuilder(), new HtmlFragmentsBuilder());
// ...
var fq = fvh.GetFieldQuery(query);
var fragment = fvh.GetBestFragment(fq, searcher.GetIndexReader(), hits[i].doc, "Content", 300);

If you're using Lucene.Net, you'll have to make sure this patch is applied to your FVH before this could compile.

That was the easiest way to get this working, and fast. Perhaps I could make it more generic, or change the original implementation to allow that and submit it as a patch. Maybe I'll do it someday. Or you could...


Announcing: Lucene.Net.Contrib

Whenever you start doing real-world stuff with Lucene you find yourself hacking and extending. That's the beauty of Lucene - it has so many extension points, and you can write almost every part of it from scratch to match your requirements.

Lately I've been working on some stuff relating to both RavenDB and HebMorph (separately...), and it became quite annoying keeping track of Lucene.Net extensions that are not part of the core project. In fact, several contrib packages (rather: projects) that are part of the original Lucene.Net project are hardly maintained and are not so friendly to use

So, I thought it was time to give all those a home. I created a new github repository called Lucene.Net.Contrib, where all those enhancements, large or small, should go. Once there's enough to go on, I'll create a nuget package and make it easily accessible.

Having a centralized location for all those has only benefits. Bugs can be found and fixed, a lot of time can be saved by just looking if someone has already ported or wrote stuff that you need, and the most important of all: finding new opportunities. Java Lucene has all that for quite some time now, and since I've been doing Lucene.Net a lot lately, I thought I'd give my small donation...

This is not trying to compete with Lucene.Net's contrib section, it is just intended in being much more flexible, fast growing community of extensions, most probably will be small in size.

What's currently there (not much - and only analysis/search related):

  • HTMLStripCharFilter - by plugging this to the analysis chain you can get any analyzer strip all HTML tags and take those positions into considerations (useful for later highlighting).
  • ReverseStringFilter - reverses a string; useful for cases where you need to allow leading wildcards and never trailing wildcards.
  • BinaryCoordSimilarity - Lucene Similarity configuration, which in a multi-word query scenario is punishing all results which do not contain ALL search terms.

Other stuff that is probably going to be included (or makes sense to):

All code is released under the same Apache license as Lucene and Lucene.Net's, unless otherwise specified (but only permissive licenses are allowed in).

Have you put your Lucene.Net extensions in yet? Fork away!

GitHub repo: https://github.com/synhershko/Lucene.Net.Contrib


Some updates on NAppUpdate

.NET, English posts, NAppUpdate, WinForms, WPF Comments (3)

After having several issues with their auto-update mechanism, 2 weeks ago the Hibernating Rhinos profilers were updated to use NAppUpdate. Once again it was proven to be a very flexible and robust library, and several updates were already pushed to hundreds (thousands?) of users without any problem.

Before the profilers could start using NAppUpdate I had to make some updates to the library, namely: catch and expose the last error thrown (if any); fix an issue with UAC popping for updates on Windows 7 and Vista; better support for promptly cancelling a download mid-way; and a few other fixes and updates. These fixes are already available on github, and probably invalidate the 0.1 release...

Implementing NAppUpdate required custom implementation of a FeedReader and a Task, and the whole process didn't take more than one hour to code (testing is another story...). The profiler's AutoUpdateFeedReader makes a simple check against a very simple one-liner feed with the profiler's current version, and the it's AutoUpdateTask downloads the latest build as a zip file from the server, extracts it to a temporary folder and when told to overwrites the old files with the new ones in a bulk.

The actual task looks something like this - note the logical separation into steps, which are executed sequentially:

public bool Prepare(IUpdateSource source)
{
    // Clear temp folder
    if (Directory.Exists(updateDirectory))
    {
        try
        {
            Directory.Delete(updateDirectory, true);
        }
        catch {}
    }

    Directory.CreateDirectory(updateDirectory);

    // Download the zip to a temp file that is deleted automatically when the app exits
    string zipLocation = null;
    try
    {
        if (!source.GetData(LatestVersionDownloadUrl, string.Empty, ref zipLocation))
            return false;
    }
    catch (Exception ex)
    {
        Log.Error("Cannot get update package from source", ex);
        throw new UpdateProcessFailedException("Couldn't get Data from source", ex);
    }

    if (string.IsNullOrEmpty(zipLocation))
        return false;

    // Unzip to temp folder; no need to delete the zip file as this will be done by the OS
    return Extract(zipLocation);
}

public bool Execute()
{
    // since all we do is a cold update, nothing other than backup needs to happen here

    return true;
}

public IEnumerator&lt;KeyValuePair&lt;string, object&gt;&gt; GetColdUpdates()
{
    if (filesList == null)
        yield break;

    foreach (var file in filesList)
    {
        yield return new KeyValuePair&lt;string, object&gt;(file, Path.Combine(updateDirectory, file));
        Log.DebugFormat("Registering file {0} to be updated with {1}", file, Path.Combine(updateDirectory, file));
    }
}

Triggering the actual check for updates is a one-liner (after configuring the UpdateManager instance with a feed URL, a FeedReader and all that; the task is created and returned by the custom FeedReader):

UpdateManager.Instance.updateManager.CheckForUpdateAsync(StartDownloadingUpdate);

// ...

private void StartDownloadingUpdate(int updates)
{
    if (updates == 0) // no updates are available
        return;

    if (updates &lt; 0) // an error has occurred
    {
        Log.ErrorFormat("Error while checking for updates: {0}", UpdateManager.Instance.LatestError);
        return;
    }

    // If updates are found, start downloading them async
    UpdateManager.Instance.PrepareUpdatesAsync(success =&gt;
    {
        if (!success)
        {
            if (UpdateManager.Instance.LatestError != null)
            {
                Log.ErrorFormat("Error downloading updates: {0}", UpdateManager.Instance.LatestError);
            }
            return;
        }

        // Notify the user of the update, and call UpdateManager.Instance.ApplyUpdates() when ready
    });
}

It couldn't be simpler than that, and it just works...

This has triggered some interest in the project, and wheels are now in motion again and hopefully new features will be introduced soon, followed by a 0.2 release.

As always, you can grab the sources and file bugs here. Bugs and feature-requests can also be submitted to the mailing list.


Showing 6 posts out of 6 total, page 1