Re: Phrase matching on a text field

Phil Chadwick Thu, 07 May 2009 17:47:23 -0700

Hi Jay

Thank you for your response.


The data relating to the string (s_title) defines *exactly* what was
fed into the SOLR indexing.  The string is not otherwise relevant to
the question.

The essence of my question is why can the indexed text (t_title) not
be phrase matched by the query on the text when the word "for" is
present in the query.

The following work (and I would expect them to work):

    q=s_title:"FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT"
    q=t_title:"future directions"
    q=t_title:"integrated catchment"

The following do not work (and I would expect them to work):

    q=t_title:"directions for integrated"

The following do not work (not sure if I expect them to work or not):

    q=t_title:"directions integrated"

My reading is that if the "FOR" is removed in the text indexing, it
should also be removed for the text query!

I also added 'enablePositionIncrements="true"' to the text query analyzer
to make it the same as the text index analyzer:

    <filter class="solr.StopFilterFactory"
        ignoreCase="true"
        words="stopwords.txt"
        enablePositionIncrements="true"/>

There was no change in the outcome.

The definitions for text and string were exactly as in the SOLR 1.3
example schema (shown below).

The section of that schema for "text" is shown below.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">

  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
      ignoreCase="true"
      words="stopwords.txt"
      enablePositionIncrements="true"/>
    <filter class="solr.WordDelimiterFilterFactory"
      generateWordParts="1"
      generateNumberParts="1"
      catenateWords="1"
      catenateNumbers="1"
      catenateAll="0"
      splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory"
      protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>

  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory"
      synonyms="synonyms.txt"
      ignoreCase="true"
      expand="true"/>
    <filter class="solr.StopFilterFactory"
      ignoreCase="true"
      words="stopwords.txt"
      <!-- enablePositionIncrements="true" -->
      />
    <filter class="solr.WordDelimiterFilterFactory"
      generateWordParts="1"
      generateNumberParts="1"
      catenateWords="0"
      catenateNumbers="0"
      catenateAll="0"
      splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory"
      protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>

</fieldType>


Cheers,


-- 
Phil

The art of being wise is the art of knowing what to overlook.
        -- William James



Jay Hill wrote:
>
> The string fieldtype is not being tokenized, while the text fieldtype is
> tokenized. So the stop word "for" is being removed by a stop word filter,
> which doesn't happen with the text field type (no tokenizing).
> 
> Have a look at the schema.xml in the example dir and look at the default
> configuration for both the text and string fieldtypes. String string
> fieldtype is not analyzed whereas the text fieldtype has a number of
> different filters that take action.

> On Wed, May 6, 2009 at 11:09 PM, Phil Chadwick
> <p.chadw...@internode.on.net>wrote:
> 
> > Hi,
> >
> > I'm trying to figure out why phrase matching on a text field only works
> > some of the time.
> >
> > I have a SOLR index containing a document titled "FUTURE DIRECTIONS FOR
> > INTEGRATED CATCHMENT".  The "FOR" seems to be causing a problem...
> >
> > The title field is indexed as both s_title and t_title (string and text,
> > as defined in the demo schema), thus:
> >
> >    <field name="title" type="string" indexed="false" stored="false"
> >        multiValued="false" />
> >    <field name="s_title" type="string" indexed="true" stored="true"
> >        multiValued="false" />
> >    <field name="t_title" type="text" indexed="true" stored="false"
> >        multiValued="false" />
> >    <copyField source="title" dest="s_title" />
> >    <copyField source="title" dest="t_title" />
> >
> > I can match the document with an exact query on the string:
> >
> >    q=s_title:"FUTURE DIRECTIONS FOR INTEGRATED CATCHMENT"
> >
> > I can match the document with this phrase query on the text:
> >
> >    q=t_title:"future directions"
> >
> > which uses the parsedquery shown by "&debugQuery=true":
> >
> >    <str name="rawquerystring">t_title:"future directions"</str>
> >    <str name="querystring">t_title:"future directions"</str>
> >    <str name="parsedquery">PhraseQuery(t_title:"futur direct")</str>
> >    <str name="parsedquery_toString">t_title:"futur direct"</str>
> >
> > Similarly, I can match the document with this query:
> >
> >    q=t_title:"integrated catchment"
> >
> > which uses the parsedquery shown by "&debugQuery=true":
> >
> >    <str name="rawquerystring">t_title:"integrated catchment"</str>
> >    <str name="querystring">t_title:"integrated catchment"</str>
> >    <str name="parsedquery">PhraseQuery(t_title:"integr catchment")</str>
> >    <str name="parsedquery_toString">t_title:"integr catchment"</str>
> >
> > But I can not match the document with the query:
> >
> >    q=t_title:"future directions for integrated catchment"
> >
> > which uses the phrase query shown by "&debugQuery=true":
> >
> >    <str name="rawquerystring">
> >        t_title:"future directions for integrated catchment"</str>
> >    <str name="querystring">
> >        t_title:"future directions for integrated catchment"</str>
> >    <str name="parsedquery">
> >        PhraseQuery(t_title:"futur direct integr catchment")</str>
> >    <str name="parsedquery_toString">
> >        t_title:"futur direct integr catchment"</str>
> >
> > Any wisdom gratefully accepted.
> >
> > Cheers,
> >
> >
> > --
> > Phil
> >
> > 640K ought to be enough for anybody.
> >        -- Bill Gates, in 1981
> >

Re: Phrase matching on a text field

Reply via email to