Re: What filter to use to search with spaces omitted/included between words?

Erick Erickson Wed, 21 Aug 2013 08:29:25 -0700

Jack:

That's a consequence of keyword tokenizer I hadn't thought of before....


Erick


On Wed, Aug 21, 2013 at 11:17 AM, Jack Krupansky <j...@basetechnology.com>wrote:

> The reason that a query of "bestbuy" matches indexing of "best buy" in
> this case is that the keyword tokenizer treats the entire input text as one
> token, including the space between "best" and "buy" and then the WDF treats
> any embedded white space as if it were punctuation and then the catenateAll
> attribute causes "best" and "buy" to be concatenated to form "bestbuy".
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Erick Erickson
> Sent: Wednesday, August 21, 2013 11:12 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: What filter to use to search with spaces omitted/included
> between words?
>
> Keyword tokenizer will probably cause you problems, since you'll never
> match "best".
> and searching name:best AND name:buy would fail as well.
>
> And I'm surprised this is working at all, I'd really scrutinize why bestbuy
> matches an
> index with Best Buy, that makes no sense on the surface.
>
> If you have a relatively small vocabulary, synonyms might work for you.
>
> Best,
> Erick
>
>
> On Tue, Aug 20, 2013 at 8:04 PM, Utkarsh Sengar <utkarsh2...@gmail.com
> >wrote:
>
>  Let me take that back, this actually works. q=bestbuy matches "Best Buy"
>> and documents are returned.
>>
>>         <fieldType name="rl_keywords" class="solr.TextField"
>> positionIncrementGap="100">
>>              <analyzer type="index">
>>                <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1"
>>
>> catenateWords="1"
>>
>> catenateNumbers="1"
>>
>> catenateAll="0"
>>
>> preserveOriginal="1"/>
>>                 <filter class="solr.**LowerCaseFilterFactory"/>
>>                 <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>             </analyzer>
>>             <analyzer type="query">
>>                 <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1"
>>
>> catenateWords="1"
>>
>> catenateNumbers="1"
>>
>> catenateAll="0"
>>
>> preserveOriginal="1"/>
>>                 <filter class="solr.**LowerCaseFilterFactory"/>
>>                 <tokenizer class="solr.**KeywordTokenizerFactory"/>
>>             </analyzer>
>>         </fieldType>
>>
>> I was using <tokenizer class="solr.**StandardTokenizerFactory"/>,
>> replacing
>> it with <tokenizer class="solr.**KeywordTokenizerFactory"/> did the
>> trick.
>> Not sure how it worked. The field value I am searching is "Best Buy", but
>> when I search for "bestbuy", it returns a result.
>>
>> Thanks,
>> -Utkarsh
>>
>>
>>
>> On Tue, Aug 20, 2013 at 4:48 PM, Utkarsh Sengar <utkarsh2...@gmail.com
>> >wrote:
>>
>> > Thanks Tamanjit and Erick.
>> > I tried out the filters, most of the usecases work except "q=bestbuy".
>> > As
>> > mentioned by Erick, that is a hard one to crack.
>> >
>> > I am looking into DictionaryCompoundWordTokenFil**terFactory but
>> compound
>> > words like these:
>> >
>> http://www.manythings.org/**vocabulary/lists/a/words.php?**
>> f=compound_wordsandgeneric<http://www.manythings.org/vocabulary/lists/a/words.php?f=compound_wordsandgeneric>english
>>  words, it won't cover my need of custom compound words
>>
>> > of store names like BestBuy, WalMart or CirtuitCity.
>> >
>> > Thanks,
>> > -Utkarsh
>> >
>> >
>> > On Tue, Aug 20, 2013 at 4:43 AM, Jack Krupansky <
>> j...@basetechnology.com
>> >wrote:
>> >
>> >> You could either have a synonym filter to replace "bestbuy" with "best
>> >> buy" or use DictionaryCompoundWordTokenFil****terFactory to do the
>> same.
>> >>
>> >> See:
>> >> http://lucene.apache.org/core/****4_4_0/analyzers-common/org/****<http://lucene.apache.org/core/**4_4_0/analyzers-common/org/**>
>> >> apache/lucene/analysis/****compound/****DictionaryCompoundWordTokenFil
>> ****
>> >> terFactory.html<
>> http://lucene.apache.org/core/**4_4_0/analyzers-common/org/**
>> apache/lucene/analysis/**compound/**DictionaryCompoundWordTokenFil**
>> terFactory.html<http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html>
>> >
>> >>
>> >> There are some examples in my book, but they are for German compound
>> >> words since that was the original primary intent for this filter. But
>> >> it
>> >> should work for any words since it is a simple dictionary.
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> -----Original Message----- From: Erick Erickson
>> >> Sent: Tuesday, August 20, 2013 7:21 AM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: What filter to use to search with spaces omitted/included
>> >> between words?
>> >>
>> >>
>> >> Also consider WordDelimterFilterFactory, which will break up the
>> >> tokens on upper/lower case transitions.
>> >>
>> >> to get relevance, consider edismax-style query parsers and adding
>> >> automatic phrase generation (with boosts usually).
>> >>
>> >> This one will be a problem:
>> >> q=bestbuy
>> >>
>> >> There's no good generic way to get this to split up. One
>> >> possibility is to use synonyms if the list is known, but
>> >> otherwise there's no information here to distinguish it
>> >> from "legitimate" words.
>> >>
>> >> edgeNgrams work on _tokens_, not words so I doubt
>> >> they would help in this case either since there is only
>> >> one token.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >>
>> >> On Tue, Aug 20, 2013 at 3:16 AM, tamanjit.bin...@yahoo.co.in <
>> >> tamanjit.bin...@yahoo.co.in> wrote:
>> >>
>> >>  Additionally, if you dont want results like q=best and result=bestbuy;
>> >>> you
>> >>> can use <charFilter class="solr.****PatternReplaceCharFilterFactor**
>> **y"
>> >>> pattern="\W+" replacement=""/> to actually replace whitespaces with
>> >>> nothing.
>> >>>
>> >>>
>> >>> http://wiki.apache.org/solr/****AnalyzersTokenizersTokenFilter****<http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**>
>> >>> s#CharFilterFactories<
>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
>> s#CharFilterFactories<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories>
>> >
>> >>> <
>> >>> http://wiki.apache.org/solr/****AnalyzersTokenizersTokenFilter****<http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**>
>> >>> s#CharFilterFactories<
>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
>> s#CharFilterFactories<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories>
>> >
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>> http://lucene.472066.n3.**nabb**le.com/What-filter-to-use-**<http://nabble.com/What-filter-to-use-**>
>> >>> to-search-with-spaces-omitted-****included-between-words-**
>> >>> tp4085576p4085601.html<
>> http://lucene.472066.n3.**nabble.com/What-filter-to-use-**
>> to-search-with-spaces-omitted-**included-between-words-**
>> tp4085576p4085601.html<http://lucene.472066.n3.nabble.com/What-filter-to-use-to-search-with-spaces-omitted-included-between-words-tp4085576p4085601.html>
>> >
>> >>> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>>
>> >>>
>> >>
>> >
>> >
>> > --
>> > Thanks,
>> > -Utkarsh
>> >
>>
>>
>>
>> --
>> Thanks,
>> -Utkarsh
>>
>>
>

Re: What filter to use to search with spaces omitted/included between words?

Reply via email to