The default setting should index BOTH "wi fi" and "wifi". Query for "wi-fi", either with or without quotes will query for "wi fi". Incidentally, that is known as "autoGeneratePhraseQueries".

-- Jack Krupansky

-----Original Message----- From: johnmu...@aol.com
Sent: Thursday, November 08, 2012 6:20 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions about schema.xml


Thank you everyone for your explanation. So for WordDelimiterFilter, let me see if I got it right.


Given that out-of-the box setting for catenateWords is "0" for query but is "1" for index, then I don't see how this will give me any hits. That is, if my document has "wi-fi", at index time it will be stored as "wifi". Well, than at query time if I type "wi-fi" (without quotes) I will be searching for "wi fi" and thus won't get a hit. no?


What about when I *do* quote my search, i.e.: I search for "wi-fi" with quotes, now what am I sending to the searcher, "wi-fi", "wi fi" or "wifi"? Again, this is using the default out-of-the box setting per the above.


The same applies for catenateNumbers.


Btw, I'm looking at this link for the above values: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


--MJ





-----Original Message-----
From: Erick Erickson <erickerick...@gmail.com>
To: solr-user <solr-user@lucene.apache.org>
Sent: Thu, Nov 8, 2012 6:57 pm
Subject: Re: Questions about schema.xml


And, in fact, you do NOT need to have two. If they are both identical, just
specify one analysis chain with no qualifier, i.e.
<analyzer>


On Thu, Nov 8, 2012 at 9:44 AM, Jack Krupansky <j...@basetechnology.com>wrote:

Many token filters will be used 100% identically for both "index" and
"query" analysis, but WordDelimiterFilter is a rare exception. The issue is
that at index time it has the ability to generate multiple tokens at the
same position (the "catenate" options), any of which can be queried, but at
query time it can be problematic to have these "extra" terms (except in
some conditions), so the WDF settings suppress generation of the extra
terms.

Another example is synonyms - generate extra terms at index time for
greater precision of searches, but limit the query terms to exclude the
"extra" terms.

That's the reason for the occaassional asymmetry between index-time and
query-time analyzers.

-- Jack Krupansky

-----Original Message----- From: johnmu...@aol.com
Sent: Wednesday, November 07, 2012 7:13 PM
To: solr-user@lucene.apache.org
Subject: Questions about schema.xml



HI,


Can someone help me understand the meaning of <analyzer type="index"> and
<analyzer type="query"> in schema.xml, how they are used and what do I get
back when the values are not the same?


For example, given:


<fieldType name="text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="**true">
  <analyzer type="index">
     <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="**true" />
     <filter class="solr.**WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
     <filter class="solr.**LowerCaseFilterFactory"/>
     <filter class="solr.**KeywordMarkerFilterFactory"
protected="protwords.txt"/>
     <filter class="solr.**PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
     <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
     <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="**true" />
     <filter class="solr.**WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
     <filter class="solr.**LowerCaseFilterFactory"/>
     <filter class="solr.**KeywordMarkerFilterFactory"
protected="protwords.txt"/>
     <filter class="solr.**PorterStemFilterFactory"/>
  </analyzer>
</fieldType>


If I make the entire content of "index" the same as "query" (or the other
way around) how will that impact my search?  And why would I want to not
make those two blocks the same?


Thanks!!!


-MJ



Reply via email to