Jack: That's a consequence of keyword tokenizer I hadn't thought of before....
Erick On Wed, Aug 21, 2013 at 11:17 AM, Jack Krupansky <j...@basetechnology.com>wrote: > The reason that a query of "bestbuy" matches indexing of "best buy" in > this case is that the keyword tokenizer treats the entire input text as one > token, including the space between "best" and "buy" and then the WDF treats > any embedded white space as if it were punctuation and then the catenateAll > attribute causes "best" and "buy" to be concatenated to form "bestbuy". > > > -- Jack Krupansky > > -----Original Message----- From: Erick Erickson > Sent: Wednesday, August 21, 2013 11:12 AM > > To: solr-user@lucene.apache.org > Subject: Re: What filter to use to search with spaces omitted/included > between words? > > Keyword tokenizer will probably cause you problems, since you'll never > match "best". > and searching name:best AND name:buy would fail as well. > > And I'm surprised this is working at all, I'd really scrutinize why bestbuy > matches an > index with Best Buy, that makes no sense on the surface. > > If you have a relatively small vocabulary, synonyms might work for you. > > Best, > Erick > > > On Tue, Aug 20, 2013 at 8:04 PM, Utkarsh Sengar <utkarsh2...@gmail.com > >wrote: > > Let me take that back, this actually works. q=bestbuy matches "Best Buy" >> and documents are returned. >> >> <fieldType name="rl_keywords" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <filter class="solr.**WordDelimiterFilterFactory" >> generateWordParts="1" generateNumberParts="1" >> >> catenateWords="1" >> >> catenateNumbers="1" >> >> catenateAll="0" >> >> preserveOriginal="1"/> >> <filter class="solr.**LowerCaseFilterFactory"/> >> <tokenizer class="solr.**KeywordTokenizerFactory"/> >> </analyzer> >> <analyzer type="query"> >> <filter class="solr.**WordDelimiterFilterFactory" >> generateWordParts="1" generateNumberParts="1" >> >> catenateWords="1" >> >> catenateNumbers="1" >> >> catenateAll="0" >> >> preserveOriginal="1"/> >> <filter class="solr.**LowerCaseFilterFactory"/> >> <tokenizer class="solr.**KeywordTokenizerFactory"/> >> </analyzer> >> </fieldType> >> >> I was using <tokenizer class="solr.**StandardTokenizerFactory"/>, >> replacing >> it with <tokenizer class="solr.**KeywordTokenizerFactory"/> did the >> trick. >> Not sure how it worked. The field value I am searching is "Best Buy", but >> when I search for "bestbuy", it returns a result. >> >> Thanks, >> -Utkarsh >> >> >> >> On Tue, Aug 20, 2013 at 4:48 PM, Utkarsh Sengar <utkarsh2...@gmail.com >> >wrote: >> >> > Thanks Tamanjit and Erick. >> > I tried out the filters, most of the usecases work except "q=bestbuy". >> > As >> > mentioned by Erick, that is a hard one to crack. >> > >> > I am looking into DictionaryCompoundWordTokenFil**terFactory but >> compound >> > words like these: >> > >> http://www.manythings.org/**vocabulary/lists/a/words.php?** >> f=compound_wordsandgeneric<http://www.manythings.org/vocabulary/lists/a/words.php?f=compound_wordsandgeneric>english >> words, it won't cover my need of custom compound words >> >> > of store names like BestBuy, WalMart or CirtuitCity. >> > >> > Thanks, >> > -Utkarsh >> > >> > >> > On Tue, Aug 20, 2013 at 4:43 AM, Jack Krupansky < >> j...@basetechnology.com >> >wrote: >> > >> >> You could either have a synonym filter to replace "bestbuy" with "best >> >> buy" or use DictionaryCompoundWordTokenFil****terFactory to do the >> same. >> >> >> >> See: >> >> http://lucene.apache.org/core/****4_4_0/analyzers-common/org/****<http://lucene.apache.org/core/**4_4_0/analyzers-common/org/**> >> >> apache/lucene/analysis/****compound/****DictionaryCompoundWordTokenFil >> **** >> >> terFactory.html< >> http://lucene.apache.org/core/**4_4_0/analyzers-common/org/** >> apache/lucene/analysis/**compound/**DictionaryCompoundWordTokenFil** >> terFactory.html<http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html> >> > >> >> >> >> There are some examples in my book, but they are for German compound >> >> words since that was the original primary intent for this filter. But >> >> it >> >> should work for any words since it is a simple dictionary. >> >> >> >> -- Jack Krupansky >> >> >> >> -----Original Message----- From: Erick Erickson >> >> Sent: Tuesday, August 20, 2013 7:21 AM >> >> To: solr-user@lucene.apache.org >> >> Subject: Re: What filter to use to search with spaces omitted/included >> >> between words? >> >> >> >> >> >> Also consider WordDelimterFilterFactory, which will break up the >> >> tokens on upper/lower case transitions. >> >> >> >> to get relevance, consider edismax-style query parsers and adding >> >> automatic phrase generation (with boosts usually). >> >> >> >> This one will be a problem: >> >> q=bestbuy >> >> >> >> There's no good generic way to get this to split up. One >> >> possibility is to use synonyms if the list is known, but >> >> otherwise there's no information here to distinguish it >> >> from "legitimate" words. >> >> >> >> edgeNgrams work on _tokens_, not words so I doubt >> >> they would help in this case either since there is only >> >> one token. >> >> >> >> Best >> >> Erick >> >> >> >> >> >> On Tue, Aug 20, 2013 at 3:16 AM, tamanjit.bin...@yahoo.co.in < >> >> tamanjit.bin...@yahoo.co.in> wrote: >> >> >> >> Additionally, if you dont want results like q=best and result=bestbuy; >> >>> you >> >>> can use <charFilter class="solr.****PatternReplaceCharFilterFactor** >> **y" >> >>> pattern="\W+" replacement=""/> to actually replace whitespaces with >> >>> nothing. >> >>> >> >>> >> >>> http://wiki.apache.org/solr/****AnalyzersTokenizersTokenFilter****<http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**> >> >>> s#CharFilterFactories< >> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter** >> s#CharFilterFactories<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories> >> > >> >>> < >> >>> http://wiki.apache.org/solr/****AnalyzersTokenizersTokenFilter****<http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**> >> >>> s#CharFilterFactories< >> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter** >> s#CharFilterFactories<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories> >> > >> >>> > >> >>> >> >>> >> >>> >> >>> -- >> >>> View this message in context: >> >>> http://lucene.472066.n3.**nabb**le.com/What-filter-to-use-**<http://nabble.com/What-filter-to-use-**> >> >>> to-search-with-spaces-omitted-****included-between-words-** >> >>> tp4085576p4085601.html< >> http://lucene.472066.n3.**nabble.com/What-filter-to-use-** >> to-search-with-spaces-omitted-**included-between-words-** >> tp4085576p4085601.html<http://lucene.472066.n3.nabble.com/What-filter-to-use-to-search-with-spaces-omitted-included-between-words-tp4085576p4085601.html> >> > >> >>> Sent from the Solr - User mailing list archive at Nabble.com. >> >>> >> >>> >> >> >> > >> > >> > -- >> > Thanks, >> > -Utkarsh >> > >> >> >> >> -- >> Thanks, >> -Utkarsh >> >> >