from:"Alex Willmer"

StandardTokenizer and domain names containing digits

2012-04-19 Thread Alex Willmer

TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in 
the same way "ns.define.logica.com" would be?

We are just starting to use Solr 3.5.0 in production and have run into a 
slightly surprising behaviour involving the query "ns1.define.logica.com", 
through an edismax handler with "q.op"=AND defined with


 
   explicit
   10
   
   edismax
   AND
   
body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
author^10.9 changed created oneline^0.7
   
   
body^0.2 tags^1.1 title^1.5
   
 


The schema is defined with fields of type text_general, as found in the example 
schema.xml, namely:


  



  
  




  


The search string is being tokenised to "ns2", "define.logica.com", and the 
resulting query becomes

+DisjunctionMaxQuerytags:ns1 tags:define.logica.com)^1.2) | 
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) | 
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1 
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) | 
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1 
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1 
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1 
define.logica.com"^1.5))

meaning that documents containing "ns1" OR "define.logica.com" are returned. 
This is contrary to e.g. "ns.logica.define.com" which is treated as a single 
token. Is there a way I can make Solr treat both queries the same way?

Many thanks, Alex
-- 
Alex Willmer | Developer
2 Trinity Park,  Birmingham, B37 7ES | United Kingdom 
M: +44 7557 752744
al.will...@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968)
Registered Office: 250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom

Re: StandardTokenizer and domain names containing digits

2012-04-23 Thread Alex Willmer

Steven A Rowe  syr.edu> writes:
> StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules 
> from 
Unicode 6.0.0 Standard
> Annex #29, a.k.a. UAX#29: . 
> These rules don't include recognition of URLs or domain names.
> 
> Lucene/Solr includes another tokenizer that does recognize URLs and domain 
names, in addition to the
> UAX#29 Word Boundary rules: UAX29URLEmailTokenizer
> 
.
>  (Stand-alone domain names are recognized as URLs.)
> 
> My suggestion is that you add a filter (for both the indexing and querying) 
that splits tokens containing
> periods:
> 
,
> something like (untested!):
> 
>  splitOnCaseChange="0"
> splitOnNumerics="0"
> stemEnglishPossessive="0"
> generateWordParts="1"
> preserveOriginal="1" />

Steve, Thank you very much for this reply, it helped immensely. In the end I've 
gone for your suggestion, plus a swap of StandardTokenizer -> 
UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The 
fieldType now looks like

autoGeneratePhraseQueries is set so that the tokens generated in the query 
analyzer behave more like tokens from a space delimited query. So 
"ns1.define.logica.com" finds a similar set of documents to "ns1 define logica 
com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR 
logica OR com". 

Many thanks, Alex

StandardTokenizer and domain names containing digits

Re: StandardTokenizer and domain names containing digits

2 matches

Site Navigation

Mail list logo

Footer information