Re: Can't mix Synonyms with Shingles?

Robert Muir Wed, 10 Aug 2011 17:13:36 -0700

On Wed, Aug 10, 2011 at 7:10 PM, Jeff Wartes <jwar...@whitepages.com> wrote:
>
> After some further playing around, I think I understand what's going on. 
> Because the SynonymFilterFactory pays attention to term position when it 
> inserts a multi-word synonym, I had assumed it scanned for matches in a way 
> that respected term position as well. (ie, for a two-word synonym, I assumed 
> it would try to find the second word in position n+1 if it found the first 
> word in position n)
>
> This does not appear to be the case. It appears to find multi-word synonym 
> matches by simply walking the list of terms, exhausting all the terms in 
> position one before looking at any terms in position two.


this is correct: and i think it would cause some serious bad
performance otherwise: if you have a tokenstream like this: (A B C) (D
E F) (G H I) ..., and are matching multiword synonyms, it can
potentially explode at least in terms of cpu time and all the
state-saving/restoring/copying and stuff it would need to start
considering the tokenstream as more of a token-confusion-network, and
it gets worse if you think about position increments > 1.

at least recently in svn, the limitation is documented:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymFilter.java

-- 
lucidimagination.com

Re: Can't mix Synonyms with Shingles?

Reply via email to