Synonyms and stemming revisited

Christian Vogler Sat, 30 Aug 2008 10:04:24 -0700

I apologize for beating a dead horse, but upon searching the archives,
I found no satisfactory resolution. According to the archives, Hoss
recommends in multiple messages that the synonym filter is put before
the stemmer and that synonym stemming at query time then should work
as expected. Unfortunately, this is only true for the first word that
appears in the synonym list.


Consider the following simplified index-time configuration:

      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="test_synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German2"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>


Furthermore, consider the following synonym definition:

reise,urlaub

(These mean travel and vacation, respectively)

Both words can appear with many different endings, such as:

reise, reisen, reist, ...
urlaub,urlaube,urlauben, ...

The stemmer reduces all these to "reis" and "urlaub", respectively.

Now, suppose that a document contains "reise" at index time. According
to the filter order, this
will be expanded by the synonym filter to:

reise urlaub, and then stemmed as:

reis urlaub.

So far, so good. In this case, queries for urlaube, reisen, etc., will
all hit the indexed document.

However, consider a document that contains "reisen" at index time. As
the synonym filter comes first, there is no match for the synonym, and
the analyzer progresses to index this document with "reisen" -> "reis"
only, with "urlaub" missing.

Hence, queries such as "reisen, reist" will hit, but "urlaub",
"urlaube", etc. will not.

I see two solutions:

Either put all possible endings in the synonym file - I do not really
like this solution, as it would make the file very large, and it also
is too easy to miss some specific ending. Or run the stemmer before
the synonym filter, in which case the synonym definitions need to
appear in their stemmed forms. Am I missing something, or does the
conversion of the synonym text file need to be done by hand at the
moment? I suppose that it would not be too difficult to write some
code that does this conversion automatically, so that the synonym
definition:

reise,urlaub is converted to
reis,urlaub

which then should solve all problems.

Best regards
- Christian
-- 
Christian Vogler, Ph.D.
Institute for Language and Speech Processing
Athens, Greece

Synonyms and stemming revisited

Reply via email to