Tokenising on Each Letter

2010-08-20 Thread Scottie

Just getting ready to launch Solr on one of our websites.

Unfortunately, we can't work out one little issue; how do I configure Solr
such that it can search our model numbers easily? For example:

ADS12P2

If somebody searched for ADS it would match, because currently its split
into tokens when it sees letters and numbers, if somebody did ADS12 it would
also work etc.

But if somebody does ADS1, currently there is no results?

Does anybody know how I should configure Solr such that it will split a
certain field over each letter or wildcard etc?

Kind Regards

Scott
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1247113.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tokenising on Each Letter

2010-08-23 Thread Scottie

Probably a good idea to post the relevant information! I guess I thought it
would be a really obvious answer but it seems its a bit more complex ;)





  








  


It seems you may be correct about the catenateAll option, but I'm not sure
if adding in a wildcard at the end of every search would be a great idea?
This is meant to be applied to a general search box, but still retain
flexibility for model numbers. Right now, we are using mySQL % % wildcards
so it matches pretty much anything on the model number, whether you cut off
the start or the end etc, and I wanted to retain that.

Could you elaborate about N gram for me, based on my schema?

The main reason I picked TextTight was for model numbers like
EQW-500DBE-1AVER etc, I thought it would produce better results?

Thanks a lot for the detailed reply.

Scott
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1291984.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tokenising on Each Letter

2010-08-23 Thread Scottie

Nikolas, thanks a lot for that, I've just gave it a quick test and it 
definitely seems to work for the examples I've gave.

Thanks again,

Scott


From: Nikolas Tautenhahn [via Lucene] 
Sent: Monday, August 23, 2010 3:14 PM
To: Scottie 
Subject: Re: Tokenising on Each Letter


Hi Scottie, 

> Could you elaborate about N gram for me, based on my schema? 

just a quick reply: 


>  positionIncrementGap="100"> 
>
>  
>  
> 
>  generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1" splitOnNumerics="0" preserveOriginal="1"/> 
>  
>  maxGramSize="30" /> 
>  
>
>
>  
>  ignoreCase="true" expand="true"/> 
>  generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1" splitOnNumerics="0" preserveOriginal="1"/> 
>  
>  
>
>  

Will produce any NGrams from 2 up to 30 Characters, for Info check 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

Be sure to adjust those sizes (minGramSize/maxGramSize) so that 
maxGramSize is big enough to keep the whole original serial number/model 
number and minGramSize is not so small that you fill your index with 
useless information. 

Best regards, 
Nikolas Tautenhahn 







View message @ 
http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1292238.html
 
To unsubscribe from Tokenising on Each Letter, click here. 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1294586.html
Sent from the Solr - User mailing list archive at Nabble.com.