On Wed, Aug 10, 2011 at 7:10 PM, Jeff Wartes <jwar...@whitepages.com> wrote: > > After some further playing around, I think I understand what's going on. > Because the SynonymFilterFactory pays attention to term position when it > inserts a multi-word synonym, I had assumed it scanned for matches in a way > that respected term position as well. (ie, for a two-word synonym, I assumed > it would try to find the second word in position n+1 if it found the first > word in position n) > > This does not appear to be the case. It appears to find multi-word synonym > matches by simply walking the list of terms, exhausting all the terms in > position one before looking at any terms in position two.
this is correct: and i think it would cause some serious bad performance otherwise: if you have a tokenstream like this: (A B C) (D E F) (G H I) ..., and are matching multiword synonyms, it can potentially explode at least in terms of cpu time and all the state-saving/restoring/copying and stuff it would need to start considering the tokenstream as more of a token-confusion-network, and it gets worse if you think about position increments > 1. at least recently in svn, the limitation is documented: http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymFilter.java -- lucidimagination.com