Re: Questions about schema.xml

Erick Erickson Sat, 10 Nov 2012 09:55:45 -0800

You should get familiar with the admin/analysis page, it's invaluable for
understanding _exactly_ what your analysis chain does with various inputs..


Best
Erick


On Thu, Nov 8, 2012 at 9:49 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> The default setting should index BOTH "wi fi" and "wifi". Query for
> "wi-fi", either with or without quotes will query for "wi fi".
> Incidentally, that is known as "autoGeneratePhraseQueries".
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: johnmu...@aol.com
> Sent: Thursday, November 08, 2012 6:20 PM
> To: solr-user@lucene.apache.org
>
> Subject: Re: Questions about schema.xml
>
>
> Thank you everyone for your explanation.  So for WordDelimiterFilter, let
> me see if I got it right.
>
>
> Given that out-of-the box setting for catenateWords is "0" for query but
> is "1" for index, then I don't see how this will give me any hits.  That
> is, if my document has "wi-fi", at index time it will be stored as "wifi".
>  Well, than at query time if I type "wi-fi" (without quotes) I will be
> searching for "wi fi" and thus won't get a hit.  no?
>
>
> What about when I *do* quote my search, i.e.: I search for "wi-fi" with
> quotes, now what am I sending to the searcher, "wi-fi", "wi fi" or "wifi"?
> Again, this is using the default out-of-the box setting per the above.
>
>
> The same applies for catenateNumbers.
>
>
> Btw, I'm looking at this link for the above values:
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>
>
>
> --MJ
>
>
>
>
>
> -----Original Message-----
> From: Erick Erickson <erickerick...@gmail.com>
> To: solr-user <solr-user@lucene.apache.org>
> Sent: Thu, Nov 8, 2012 6:57 pm
> Subject: Re: Questions about schema.xml
>
>
> And, in fact, you do NOT need to have two. If they are both identical, just
> specify one analysis chain with no qualifier, i.e.
> <analyzer>
>
>
> On Thu, Nov 8, 2012 at 9:44 AM, Jack Krupansky <j...@basetechnology.com>**
> wrote:
>
>  Many token filters will be used 100% identically for both "index" and
>> "query" analysis, but WordDelimiterFilter is a rare exception. The issue
>> is
>> that at index time it has the ability to generate multiple tokens at the
>> same position (the "catenate" options), any of which can be queried, but
>> at
>> query time it can be problematic to have these "extra" terms (except in
>> some conditions), so the WDF settings suppress generation of the extra
>> terms.
>>
>> Another example is synonyms - generate extra terms at index time for
>> greater precision of searches, but limit the query terms to exclude the
>> "extra" terms.
>>
>> That's the reason for the occaassional asymmetry between index-time and
>> query-time analyzers.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: johnmu...@aol.com
>> Sent: Wednesday, November 07, 2012 7:13 PM
>> To: solr-user@lucene.apache.org
>> Subject: Questions about schema.xml
>>
>>
>>
>> HI,
>>
>>
>> Can someone help me understand the meaning of <analyzer type="index"> and
>> <analyzer type="query"> in schema.xml, how they are used and what do I get
>> back when the values are not the same?
>>
>>
>> For example, given:
>>
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
>> autoGeneratePhraseQueries="****true">
>>   <analyzer type="index">
>>      <tokenizer class="solr.****WhitespaceTokenizerFactory"/>
>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="****true" />
>>      <filter class="solr.****WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>      <filter class="solr.****LowerCaseFilterFactory"/>
>>      <filter class="solr.****KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>      <filter class="solr.****PorterStemFilterFactory"/>
>>   </analyzer>
>>   <analyzer type="query">
>>      <tokenizer class="solr.****WhitespaceTokenizerFactory"/>
>>      <filter class="solr.****SynonymFilterFactory"
>> synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="****true" />
>>      <filter class="solr.****WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>      <filter class="solr.****LowerCaseFilterFactory"/>
>>      <filter class="solr.****KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>      <filter class="solr.****PorterStemFilterFactory"/>
>>   </analyzer>
>> </fieldType>
>>
>>
>> If I make the entire content of "index" the same as "query" (or the other
>> way around) how will that impact my search?  And why would I want to not
>> make those two blocks the same?
>>
>>
>> Thanks!!!
>>
>>
>> -MJ
>>
>>
>
>

Re: Questions about schema.xml

Reply via email to