the queryparser first splits on whitespace.

so each individual word of your query: short,red,evil,fox gets its own
tokenstream, and therefore isn't shingled.

On Fri, Jun 4, 2010 at 6:21 PM, Greg Bowyer <gbow...@shopzilla.com> wrote:

> Hi all
>
> Interesting and by the looks of things very solid project you have here
> with
> SOLR, however ..
>
> I have an index that contains a large number of "phrases" that I need to
> search
> for over, each of these phrases is fairly small being on average about 4
> words
> long.
>
> The search terms that I am given to search these phrases are very long, and
> quite arbitrary, sometimes the search terms will be up to 25 words long.
>
> As such the performance of my index when built naively is sporadic
> sometimes
> searches are very fast on average they are somewhat slower.
>
> I have attempted to improve this situation by using shingling for the
> phrases
> and the related search queries, in my schema I have the following
>
>
>    <fieldType name="bigramed_phrase" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
> outputUnigramIfNoNgram="true" />
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.ShingleFilterFactory" outputUnigrams="false"
> outputUnigramIfNoNgram="true" />
>      </analyzer>
>    </fieldType>
>
> In the indexes, as seen with luke I do indeed have a large range of
> shingled
> terms.
>
> When I run the analyser for either query or index terms I also see the
> breakdown
> with the shingled terms correctly displayed.
>
> However when I attempt to use this in a query I do not see the terms
> applied in
> the debug output, for example with the term "short red evil fox" I would
> expect
> to see the shingles
> 'short_red' 'red_evil' 'evil_fox'
>
> but instead I get the following
>
> "debug":{
>  "rawquerystring":"short red evil fox",
>  "querystring":"short red evil fox",
>  "parsedquery":"+() ()",
>  "parsedquery_toString":"+() ()",
>  "explain":{},
>  "QParser":"DisMaxQParser",
>  "altquerystring":null,
>  "boostfuncs":null,
>  "filter_queries":["atomId:(8235 100000914 100000911 )"],
>  "parsed_filter_queries":["atomId:8235 atomId:100000914 atomId:100000911"],
>  "timing":{ ......
>
> Does anyone know what I could be doing wrong here, is it a bug in the debug
> output, a stupid mistake misconception or piece of idiocy on my part or
> something else.
>
>
> Many thanks
>
> -- Greg Bowyer
>
>
>


-- 
Robert Muir
rcm...@gmail.com

Reply via email to