Hello, Another common and good thing to do with data like hostnames, email addresses, URLs, etc. is to reverse them on separators (e.g. www.foo.com becomes com.foo.www) and then tokenize them so you end up with multiple tokens indexed: com com.foo com.foo.www . You can see how that can be beneficial - e.g., queries like site:com vs. site:com.foo vs. site:com.foo.www become possible and cheap.
Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Ahmet Arslan <iori...@yahoo.com> > To: solr-user@lucene.apache.org > Sent: Mon, November 22, 2010 5:22:10 PM > Subject: Re: What tokenizer is good for breaking host names > > > I have a "host" field in my documents which keep the host > > from which the page > > was crawled. for example, yahoo.com, or sports.yahoo.com. I > > want this field to > > be searchable so if I search yahoo, I can find > > sports.yahoo.com. > > > > I have used these tokenizers and it does not work: > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > Now, it seems they do not break the host name at the dots > > and does not match > > find yahoo in sports.yahoo.com. > > What tokenizer should I use so it breaks the host name at > > dots? > > LetterTokenizerFactory or MappingCharFilterFactory with "."=> " " > > > >