Thanks. And thanks for the help -- we're hoping to switch from query-time to index-time synonym expansion for all of the reasons listed on the wiki<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46>, so this will be great to resolve.
I created SOLR-1445 <https://issues.apache.org/jira/browse/SOLR-1445>, though the problem seems to be caused by LUCENE-1919<https://issues.apache.org/jira/browse/LUCENE-1919>, as you noted. Is there a recommended workaround that avoids combining the new and old APIs? Would a version of SynonymFilter that also implemented incrementToken() be helpful? --Gregg On Thu, Sep 17, 2009 at 7:38 PM, Yonik Seeley <yo...@lucidimagination.com>wrote: > On Thu, Sep 17, 2009 at 6:29 PM, Lance Norskog <goks...@gmail.com> wrote: > > Please add a Jira issue for this. It will get more attention there. > > > > BTW, thanks for creating such a precise bug report. > > +1 > > Thanks, I had missed this. This is serious, and looks due to a Lucene > back compat break. > I've added the testcase and can confirm the bug. > > -Yonik > http://www.lucidimagination.com > > > > > On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan <gregg...@gmail.com> > wrote: > >> I'm running into an odd issue with multi-word synonyms in Solr (using > >> the latest [9/14/09] nightly ). Things generally seem to work as > >> expected, but I sometimes see words that are the leading term in a > >> multi-word synonym being replaced with the token that follows them in > >> the stream when they should just be ignored (i.e. there's no synonym > >> match for just that token). When I preview the analysis at > >> admin/analysis.jsp it looks fine, but at runtime I see problems like > >> the one in the unit test below. It's a simple case, so I assume I'm > >> making some sort of configuration and/or usage error. > >> > >> package org.apache.solr.analysis; > >> import java.io.*; > >> import java.util.*; > >> import org.apache.lucene.analysis.WhitespaceTokenizer; > >> import org.apache.lucene.analysis.tokenattributes.TermAttribute; > >> > >> public class TestMultiWordSynonmys extends junit.framework.TestCase { > >> > >> public void testMultiWordSynonmys() throws IOException { > >> List<String> rules = new ArrayList<String>(); > >> rules.add( "a b c,d" ); > >> SynonymMap synMap = new SynonymMap( true ); > >> SynonymFilterFactory.parseRules( rules, synMap, "=>", ",", true, > null); > >> > >> SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new > >> StringReader("a e")), synMap ); > >> TermAttribute termAtt = (TermAttribute) > >> ts.getAttribute(TermAttribute.class); > >> > >> ts.reset(); > >> List<String> tokens = new ArrayList<String>(); > >> while (ts.incrementToken()) tokens.add( termAtt.term() ); > >> > >> // This fails because ["e","e"] is the value of the token stream > >> assertEquals(Arrays.asList("a","e"), tokens); > >> } > >> } > >> > >> Any help would be much appreciated. Thanks. > >> > >> --Gregg > >> > > > > > > > > -- > > Lance Norskog > > goks...@gmail.com > > >