Hmmm, Certainly only the outputs of the last filter make it into
the index. Consider stopwords being the last filter, you'd expect
stopwords to be removed.

There's nothing that I know of that'll do what you're asking, the
code for ENGTF doesn't have any "preserve original" that I
see. This seems like a useful addition though, you've
done a nice job of characterizing the problem. Want to
raise a JIRA and/or do a patch?

I'd guess your only real short-term workaround would be to
increase the max gram size.

I suppose you could do a copyfield into a field that doesn't
do the n-gramming and search against that too, but that
feels kind of kludgy...

Best,
Erick


On Wed, Aug 28, 2013 at 7:16 AM, heaven <aheave...@gmail.com> wrote:

> Hi, please help me figure out what's going on. I have the next field type:
>
> <fieldType name="words_ngram" class="solr.TextField" omitNorms="false">
>   <analyzer type="index">
>     <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\d\w]+" />
>     <filter class="solr.StopFilterFactory" words="url_stopwords.txt"
> ignoreCase="true" />
>     <filter class="solr.LowerCaseFilterFactory" />
>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="20" />
>   </analyzer>
>   <analyzer type="query">
>     <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\d\w]+" />
>     <filter class="solr.StopFilterFactory" words="url_stopwords.txt"
> ignoreCase="true" />
>     <filter class="solr.LowerCaseFilterFactory" />
>   </analyzer>
> </fieldType>
>
> And the next string indexed:
> http://plus.google.com/111950520904110959061/profile
>
> Here is what the analyzer shows:
> http://img607.imageshack.us/img607/5074/fn1.png
>
> Then I do the next query:
> fq=type:Site&
> sort=score desc&
> q=https\\:\\/\\/plus.google.com\\/111950520904110959061\\/profile&
> fl=* score&
> qf=url_words_ngram&
> defType=edismax&
> start=0&
> rows=20&
> mm=1
>
> And have no results.
>
> These queries do match:
> 1. https://plus.google
> 2. https://plus.google.com
> 3. 11195052090
>
> And these do not:
> 1. https://plus.google.com/111950520904110959061/profile
> 2. 111950520904110959061/profile
> 3. 111950520904110959061
>
> The reason is that "111950520904110959061" length is 21 when I have max
> gram
> size set to 20. Tried to increase max gram size to 200 and it works, but is
> there any way to match given query without doing that? The query analyzer
> show there are exact matches at PT, SF and LCF or does it work that way so
> in index we have only the output from the last filter factory (ENGTF in my
> example)? If so, is there an option to preserve the original tokens also?
>
> So that for maxGramSize="5" and indexed string awesomeness I'd have:
> "a", "aw", "awe", "awes", "aweso", "awesomeness"
>
> Best,
> Alex
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to