Hi Alex, Thanks for reporting back with concrete details of what worked for you - very helpful for others with similar projects.
Steve -----Original Message----- From: Alex Willmer [mailto:al.will...@logica.com] Sent: Monday, April 23, 2012 5:35 AM To: solr-user@lucene.apache.org Subject: Re: StandardTokenizer and domain names containing digits Steven A Rowe <sarowe <at> syr.edu> writes: > StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary > rules from Unicode 6.0.0 Standard > Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29- 17.html#Word_Boundaries>. > These rules don't include recognition of URLs or domain names. > > Lucene/Solr includes another tokenizer that does recognize URLs and > domain names, in addition to the > UAX#29 Word Boundary rules: UAX29URLEmailTokenizer > <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailT okenizerFactory>. > (Stand-alone domain names are recognized as URLs.) > > My suggestion is that you add a filter (for both the indexing and > querying) that splits tokens containing > periods: > <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterF ilterFactory>, > something like (untested!): > > <filter class="solr.WordDelimiterFilterFactory" > splitOnCaseChange="0" > splitOnNumerics="0" > stemEnglishPossessive="0" > generateWordParts="1" > preserveOriginal="1" /> Steve, Thank you very much for this reply, it helped immensely. In the end I've gone for your suggestion, plus a swap of StandardTokenizer -> UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The fieldType now looks like <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="0" stemEnglishPossessive="0" generateWordParts="1" preserveOriginal="1" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="0" stemEnglishPossessive="0" generateWordParts="1" preserveOriginal="1" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> autoGeneratePhraseQueries is set so that the tokens generated in the query analyzer behave more like tokens from a space delimited query. So "ns1.define.logica.com" finds a similar set of documents to "ns1 define logica com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR logica OR com". Many thanks, Alex