Hi Alex,

Thanks for reporting back with concrete details of what worked for you - very 
helpful for others with similar projects.

Steve

-----Original Message-----
From: Alex Willmer [mailto:al.will...@logica.com] 
Sent: Monday, April 23, 2012 5:35 AM
To: solr-user@lucene.apache.org
Subject: Re: StandardTokenizer and domain names containing digits

Steven A Rowe <sarowe <at> syr.edu> writes:
> StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary 
> rules from
Unicode 6.0.0 Standard
> Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-
17.html#Word_Boundaries>. 
> These rules don't include recognition of URLs or domain names.
> 
> Lucene/Solr includes another tokenizer that does recognize URLs and 
> domain
names, in addition to the
> UAX#29 Word Boundary rules: UAX29URLEmailTokenizer
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailT
okenizerFactory>.
>  (Stand-alone domain names are recognized as URLs.)
> 
> My suggestion is that you add a filter (for both the indexing and 
> querying)
that splits tokens containing
> periods:
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterF
ilterFactory>,
> something like (untested!):
> 
>     <filter class="solr.WordDelimiterFilterFactory"
>             splitOnCaseChange="0"
>             splitOnNumerics="0"
>             stemEnglishPossessive="0"
>             generateWordParts="1"
>             preserveOriginal="1" />

Steve, Thank you very much for this reply, it helped immensely. In the end I've 
gone for your suggestion, plus a swap of StandardTokenizer -> 
UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The 
fieldType now looks like

<fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100" 
autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory"
            synonyms="index_synonyms.txt" ignoreCase="true" 
            expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" 
            synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

autoGeneratePhraseQueries is set so that the tokens generated in the query 
analyzer behave more like tokens from a space delimited query. So 
"ns1.define.logica.com" finds a similar set of documents to "ns1 define logica 
com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR 
logica OR com". 

Many thanks, Alex

Reply via email to