Hello,

Another common and good thing to do with data like hostnames, email addresses, 
URLs, etc. is to reverse them on separators (e.g. www.foo.com becomes 
com.foo.www) and then tokenize them so you end up with multiple tokens indexed: 
com com.foo com.foo.www .  You can see how that can be beneficial - e.g., 
queries like site:com  vs. site:com.foo vs. site:com.foo.www become possible 
and 
cheap.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Ahmet Arslan <iori...@yahoo.com>
> To: solr-user@lucene.apache.org
> Sent: Mon, November 22, 2010 5:22:10 PM
> Subject: Re: What tokenizer is good for breaking host names
> 
> > I have a "host" field in my documents which keep the host
> > from  which the page 
> > was crawled. for example, yahoo.com, or  sports.yahoo.com. I
> > want this field to 
> > be searchable so if I  search yahoo, I can find
> > sports.yahoo.com. 
> > 
> > I have used  these tokenizers and it does not work:
> > <tokenizer  class="solr.StandardTokenizerFactory"/>
> > <filter  class="solr.LowerCaseFilterFactory"/>
> > <filter
> >  class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > Now, it seems they  do not break the host name at the dots
> > and does not match 
> > find  yahoo in sports.yahoo.com.
> > What tokenizer should I use so it breaks the  host name at
> > dots?
> 
> LetterTokenizerFactory or  MappingCharFilterFactory with "."=> " "
> 
> 
>       
> 

Reply via email to