Tokenising on Each Letter
Just getting ready to launch Solr on one of our websites. Unfortunately, we can't work out one little issue; how do I configure Solr such that it can search our model numbers easily? For example: ADS12P2 If somebody searched for ADS it would match, because currently its split into tokens when it sees letters and numbers, if somebody did ADS12 it would also work etc. But if somebody does ADS1, currently there is no results? Does anybody know how I should configure Solr such that it will split a certain field over each letter or wildcard etc? Kind Regards Scott -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1247113.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenising on Each Letter
Probably a good idea to post the relevant information! I guess I thought it would be a really obvious answer but it seems its a bit more complex ;) It seems you may be correct about the catenateAll option, but I'm not sure if adding in a wildcard at the end of every search would be a great idea? This is meant to be applied to a general search box, but still retain flexibility for model numbers. Right now, we are using mySQL % % wildcards so it matches pretty much anything on the model number, whether you cut off the start or the end etc, and I wanted to retain that. Could you elaborate about N gram for me, based on my schema? The main reason I picked TextTight was for model numbers like EQW-500DBE-1AVER etc, I thought it would produce better results? Thanks a lot for the detailed reply. Scott -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1291984.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenising on Each Letter
Nikolas, thanks a lot for that, I've just gave it a quick test and it definitely seems to work for the examples I've gave. Thanks again, Scott From: Nikolas Tautenhahn [via Lucene] Sent: Monday, August 23, 2010 3:14 PM To: Scottie Subject: Re: Tokenising on Each Letter Hi Scottie, > Could you elaborate about N gram for me, based on my schema? just a quick reply: > positionIncrementGap="100"> > > > > > generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" > splitOnCaseChange="1" splitOnNumerics="0" preserveOriginal="1"/> > > maxGramSize="30" /> > > > > > ignoreCase="true" expand="true"/> > generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" > splitOnCaseChange="1" splitOnNumerics="0" preserveOriginal="1"/> > > > > Will produce any NGrams from 2 up to 30 Characters, for Info check http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory Be sure to adjust those sizes (minGramSize/maxGramSize) so that maxGramSize is big enough to keep the whole original serial number/model number and minGramSize is not so small that you fill your index with useless information. Best regards, Nikolas Tautenhahn View message @ http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1292238.html To unsubscribe from Tokenising on Each Letter, click here. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1294586.html Sent from the Solr - User mailing list archive at Nabble.com.