Re: Difficulty with Multi-Word Synonyms

Gregg Donovan Thu, 17 Sep 2009 21:56:08 -0700

Thanks. And thanks for the help -- we're hoping to switch from query-time to
index-time synonym expansion for all of the reasons listed on the
wiki<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46>,
so this will be great to resolve.


I created SOLR-1445 <https://issues.apache.org/jira/browse/SOLR-1445>,
though the problem seems to be caused by
LUCENE-1919<https://issues.apache.org/jira/browse/LUCENE-1919>,
as you noted.

Is there a recommended workaround that avoids combining the new and old
APIs? Would a version of SynonymFilter that also implemented
incrementToken() be helpful?

--Gregg

On Thu, Sep 17, 2009 at 7:38 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Thu, Sep 17, 2009 at 6:29 PM, Lance Norskog <goks...@gmail.com> wrote:
> > Please add a Jira issue for this. It will get more attention there.
> >
> > BTW, thanks for creating such a precise bug report.
>
> +1
>
> Thanks, I had missed this.  This is serious, and looks due to a Lucene
> back compat break.
> I've added the testcase and can confirm the bug.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> > On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan <gregg...@gmail.com>
> wrote:
> >> I'm running into an odd issue with multi-word synonyms in Solr (using
> >> the latest [9/14/09] nightly ). Things generally seem to work as
> >> expected, but I sometimes see words that are the leading term in a
> >> multi-word synonym being replaced with the token that follows them in
> >> the stream when they should just be ignored (i.e. there's no synonym
> >> match for just that token). When I preview the analysis at
> >> admin/analysis.jsp it looks fine, but at runtime I see problems like
> >> the one in the unit test below. It's a simple case, so I assume I'm
> >> making some sort of configuration and/or usage error.
> >>
> >> package org.apache.solr.analysis;
> >> import java.io.*;
> >> import java.util.*;
> >> import org.apache.lucene.analysis.WhitespaceTokenizer;
> >> import org.apache.lucene.analysis.tokenattributes.TermAttribute;
> >>
> >> public class TestMultiWordSynonmys extends junit.framework.TestCase {
> >>
> >>   public void testMultiWordSynonmys() throws IOException {
> >>     List<String> rules = new ArrayList<String>();
> >>     rules.add( "a b c,d" );
> >>     SynonymMap synMap = new SynonymMap( true );
> >>     SynonymFilterFactory.parseRules( rules, synMap, "=>", ",", true,
> null);
> >>
> >>     SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
> >> StringReader("a e")), synMap );
> >>     TermAttribute termAtt = (TermAttribute)
> >> ts.getAttribute(TermAttribute.class);
> >>
> >>     ts.reset();
> >>     List<String> tokens = new ArrayList<String>();
> >>     while (ts.incrementToken()) tokens.add( termAtt.term() );
> >>
> >>    // This fails because ["e","e"] is the value of the token stream
> >>     assertEquals(Arrays.asList("a","e"), tokens);
> >>   }
> >> }
> >>
> >> Any help would be much appreciated. Thanks.
> >>
> >> --Gregg
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>

Re: Difficulty with Multi-Word Synonyms

Reply via email to