StandardTokenizer and domain names containing digits

2012-04-19 Thread Alex Willmer
TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in 
the same way "ns.define.logica.com" would be?

We are just starting to use Solr 3.5.0 in production and have run into a 
slightly surprising behaviour involving the query "ns1.define.logica.com", 
through an edismax handler with "q.op"=AND defined with


 
   explicit
   10
   
   edismax
   AND
   
body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
author^10.9 changed created oneline^0.7
   
   
body^0.2 tags^1.1 title^1.5
   
 


The schema is defined with fields of type text_general, as found in the example 
schema.xml, namely:


  



  
  




  


The search string is being tokenised to "ns2", "define.logica.com", and the 
resulting query becomes

+DisjunctionMaxQuerytags:ns1 tags:define.logica.com)^1.2) | 
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) | 
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1 
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) | 
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1 
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1 
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1 
define.logica.com"^1.5))

meaning that documents containing "ns1" OR "define.logica.com" are returned. 
This is contrary to e.g. "ns.logica.define.com" which is treated as a single 
token. Is there a way I can make Solr treat both queries the same way?

Many thanks, Alex
-- 
Alex Willmer | Developer
2 Trinity Park,  Birmingham, B37 7ES | United Kingdom 
M: +44 7557 752744
al.will...@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968)
Registered Office: 250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom




Re: StandardTokenizer and domain names containing digits

2012-04-23 Thread Alex Willmer
Steven A Rowe  syr.edu> writes:
> StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules 
> from 
Unicode 6.0.0 Standard
> Annex #29, a.k.a. UAX#29: . 
> These rules don't include recognition of URLs or domain names.
> 
> Lucene/Solr includes another tokenizer that does recognize URLs and domain 
names, in addition to the
> UAX#29 Word Boundary rules: UAX29URLEmailTokenizer
> 
.
>  (Stand-alone domain names are recognized as URLs.)
> 
> My suggestion is that you add a filter (for both the indexing and querying) 
that splits tokens containing
> periods:
> 
,
> something like (untested!):
> 
>  splitOnCaseChange="0"
> splitOnNumerics="0"
> stemEnglishPossessive="0"
> generateWordParts="1"
> preserveOriginal="1" />

Steve, Thank you very much for this reply, it helped immensely. In the end I've 
gone for your suggestion, plus a swap of StandardTokenizer -> 
UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The 
fieldType now looks like


  





  
  





  


autoGeneratePhraseQueries is set so that the tokens generated in the query 
analyzer behave more like tokens from a space delimited query. So 
"ns1.define.logica.com" finds a similar set of documents to "ns1 define logica 
com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR 
logica OR com". 

Many thanks, Alex