Hmmm, Certainly only the outputs of the last filter make it into the index. Consider stopwords being the last filter, you'd expect stopwords to be removed.
There's nothing that I know of that'll do what you're asking, the code for ENGTF doesn't have any "preserve original" that I see. This seems like a useful addition though, you've done a nice job of characterizing the problem. Want to raise a JIRA and/or do a patch? I'd guess your only real short-term workaround would be to increase the max gram size. I suppose you could do a copyfield into a field that doesn't do the n-gramming and search against that too, but that feels kind of kludgy... Best, Erick On Wed, Aug 28, 2013 at 7:16 AM, heaven <aheave...@gmail.com> wrote: > Hi, please help me figure out what's going on. I have the next field type: > > <fieldType name="words_ngram" class="solr.TextField" omitNorms="false"> > <analyzer type="index"> > <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\d\w]+" /> > <filter class="solr.StopFilterFactory" words="url_stopwords.txt" > ignoreCase="true" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" > maxGramSize="20" /> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\d\w]+" /> > <filter class="solr.StopFilterFactory" words="url_stopwords.txt" > ignoreCase="true" /> > <filter class="solr.LowerCaseFilterFactory" /> > </analyzer> > </fieldType> > > And the next string indexed: > http://plus.google.com/111950520904110959061/profile > > Here is what the analyzer shows: > http://img607.imageshack.us/img607/5074/fn1.png > > Then I do the next query: > fq=type:Site& > sort=score desc& > q=https\\:\\/\\/plus.google.com\\/111950520904110959061\\/profile& > fl=* score& > qf=url_words_ngram& > defType=edismax& > start=0& > rows=20& > mm=1 > > And have no results. > > These queries do match: > 1. https://plus.google > 2. https://plus.google.com > 3. 11195052090 > > And these do not: > 1. https://plus.google.com/111950520904110959061/profile > 2. 111950520904110959061/profile > 3. 111950520904110959061 > > The reason is that "111950520904110959061" length is 21 when I have max > gram > size set to 20. Tried to increase max gram size to 200 and it works, but is > there any way to match given query without doing that? The query analyzer > show there are exact matches at PT, SF and LCF or does it work that way so > in index we have only the output from the last filter factory (ENGTF in my > example)? If so, is there an option to preserve the original tokens also? > > So that for maxGramSize="5" and indexed string awesomeness I'd have: > "a", "aw", "awe", "awes", "aweso", "awesomeness" > > Best, > Alex > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967.html > Sent from the Solr - User mailing list archive at Nabble.com. >