Re: What filter to use to search with spaces omitted/included between words?

Jack Krupansky Wed, 21 Aug 2013 08:18:47 -0700

The reason that a query of "bestbuy" matches indexing of "best buy" in thiscase is that the keyword tokenizer treats the entire input text as onetoken, including the space between "best" and "buy" and then the WDF treatsany embedded white space as if it were punctuation and then the catenateAllattribute causes "best" and "buy" to be concatenated to form "bestbuy".


-- Jack Krupansky

-----Original Message-----From: Erick Erickson

Sent: Wednesday, August 21, 2013 11:12 AM
To: solr-user@lucene.apache.org

Subject: Re: What filter to use to search with spaces omitted/includedbetween words?


Keyword tokenizer will probably cause you problems, since you'll never
match "best".
and searching name:best AND name:buy would fail as well.

And I'm surprised this is working at all, I'd really scrutinize why bestbuy
matches an
index with Best Buy, that makes no sense on the surface.

If you have a relatively small vocabulary, synonyms might work for you.

Best,
Erick

On Tue, Aug 20, 2013 at 8:04 PM, Utkarsh Sengar<utkarsh2...@gmail.com>wrote:

Let me take that back, this actually works. q=bestbuy matches "Best Buy"
and documents are returned.

        <fieldType name="rl_keywords" class="solr.TextField"
positionIncrementGap="100">
             <analyzer type="index">
               <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"

catenateWords="1"

catenateNumbers="1"

catenateAll="0"

preserveOriginal="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <tokenizer class="solr.KeywordTokenizerFactory"/>
            </analyzer>
            <analyzer type="query">
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"

catenateWords="1"

catenateNumbers="1"

catenateAll="0"

preserveOriginal="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <tokenizer class="solr.KeywordTokenizerFactory"/>
            </analyzer>
        </fieldType>

I was using <tokenizer class="solr.StandardTokenizerFactory"/>, replacing
it with <tokenizer class="solr.KeywordTokenizerFactory"/> did the trick.
Not sure how it worked. The field value I am searching is "Best Buy", but
when I search for "bestbuy", it returns a result.

Thanks,
-Utkarsh



On Tue, Aug 20, 2013 at 4:48 PM, Utkarsh Sengar <utkarsh2...@gmail.com
>wrote:

> Thanks Tamanjit and Erick.

> I tried out the filters, most of the usecases work except "q=bestbuy".> As

> mentioned by Erick, that is a hard one to crack.
>
> I am looking into DictionaryCompoundWordTokenFilterFactory but compound
> words like these:
>

http://www.manythings.org/vocabulary/lists/a/words.php?f=compound_wordsandgenericenglish words, it won't cover my need of custom compound words

> of store names like BestBuy, WalMart or CirtuitCity.
>
> Thanks,
> -Utkarsh
>
>
> On Tue, Aug 20, 2013 at 4:43 AM, Jack Krupansky <j...@basetechnology.com
>wrote:
>
>> You could either have a synonym filter to replace "bestbuy" with "best
>> buy" or use DictionaryCompoundWordTokenFil**terFactory to do the same.
>>
>> See:
>> http://lucene.apache.org/core/**4_4_0/analyzers-common/org/**
>> apache/lucene/analysis/**compound/**DictionaryCompoundWordTokenFil**
>> terFactory.html<
http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html
>
>>
>> There are some examples in my book, but they are for German compound

>> words since that was the original primary intent for this filter. But>> it

>> should work for any words since it is a simple dictionary.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Erick Erickson
>> Sent: Tuesday, August 20, 2013 7:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: What filter to use to search with spaces omitted/included
>> between words?
>>
>>
>> Also consider WordDelimterFilterFactory, which will break up the
>> tokens on upper/lower case transitions.
>>
>> to get relevance, consider edismax-style query parsers and adding
>> automatic phrase generation (with boosts usually).
>>
>> This one will be a problem:
>> q=bestbuy
>>
>> There's no good generic way to get this to split up. One
>> possibility is to use synonyms if the list is known, but
>> otherwise there's no information here to distinguish it
>> from "legitimate" words.
>>
>> edgeNgrams work on _tokens_, not words so I doubt
>> they would help in this case either since there is only
>> one token.
>>
>> Best
>> Erick
>>
>>
>> On Tue, Aug 20, 2013 at 3:16 AM, tamanjit.bin...@yahoo.co.in <
>> tamanjit.bin...@yahoo.co.in> wrote:
>>
>>  Additionally, if you dont want results like q=best and result=bestbuy;
>>> you
>>> can use <charFilter class="solr.**PatternReplaceCharFilterFactor**y"
>>> pattern="\W+" replacement=""/> to actually replace whitespaces with
>>> nothing.
>>>
>>>
>>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
>>> s#CharFilterFactories<
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
>
>>> <
>>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
>>> s#CharFilterFactories<
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
>
>>> >
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.**nabble.com/What-filter-to-use-**
>>> to-search-with-spaces-omitted-**included-between-words-**
>>> tp4085576p4085601.html<
http://lucene.472066.n3.nabble.com/What-filter-to-use-to-search-with-spaces-omitted-included-between-words-tp4085576p4085601.html
>
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>
>
> --
> Thanks,
> -Utkarsh
>



--
Thanks,
-Utkarsh

Re: What filter to use to search with spaces omitted/included between words?

Reply via email to