Re: A little help with indexing joined words

Avlesh Singh Mon, 05 Oct 2009 12:25:48 -0700

Zambrano, I was too quick to respond to your idf explanation. I definitely
did not mean that "idf" and "length-norms" are the same thing.


Andrew, this is how i would have done it -
First, I would create a field called "prefix_text" as undeneath in my
schema.xml
<fieldType name="prefix_text" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z0-9])" replacement="" replace="all"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" maxGramSize="100"
minGramSize="1"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z0-9])" replacement="" replace="all"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

Second, I would declare a field of this and populate the same (using
copyField) while indexing.

Third, while querying I would query on the both the fields. I would boost
the matches for original field to a large extent over the n-grammed field.
Scenarios where "Dragon Fly" is expected to match against "Dragonfly" in the
index, query on the original field would not give you any matches, thereby
bringing results from the prefix_token field right there on top.

Hope this helps.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 11:10 PM, Christian Zambrano <czamb...@gmail.com>wrote:

> Would you mind explaining how omitNorm has any effect on the IDF problem I
> described earlier?
>
> I agree with your second sentence. I had to use the NGramTokenFilter to
> accommodate partial matches.
>
>
> On 10/05/2009 12:11 PM, Avlesh Singh wrote:
>
>> Using synonyms might be a better solution because the use of
>>> EdgeNGramTokenizerFactory has the potential of creating a large number of
>>> token which will artificially increase the number of tokens in the index
>>> which in turn will affect the IDF score.
>>>
>>>
>>>
>> Well, I don't see a reason as to why someone would need a length based
>> normalization on such matches. I always have done omitNorms while using
>> fields with this filter.
>>
>> Yes, synonyms might an answer when you have limited number of such words
>> (phrases) and their possible combinations.
>>
>> Cheers
>> Avlesh
>>
>> On Mon, Oct 5, 2009 at 10:32 PM, Christian Zambrano<czamb...@gmail.com
>> >wrote:
>>
>>
>>
>>> Using synonyms might be a better solution because the use of
>>> EdgeNGramTokenizerFactory has the potential of creating a large number of
>>> token which will artificially increase the number of tokens in the index
>>> which in turn will affect the IDF score.
>>>
>>> A query for "borderland" should have returned results though. It is
>>> difficult to troubleshoot why it didn't without knowing what query you
>>> used,
>>> and what kind of analysis is taking place.
>>>
>>> Have you tried using the analysis page on the admin section to see what
>>> tokens gets generated for 'Borderlands'?
>>>
>>> Christian
>>>
>>>
>>> On 10/05/2009 11:01 AM, Avlesh Singh wrote:
>>>
>>>
>>>
>>>> We have indexed a product database and have come across some search
>>>> terms
>>>>
>>>>
>>>>> where zero results are returned.  There are products in the index with
>>>>> 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
>>>>> 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
>>>>> respectively.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> "Borderland" should have worked for a regular text field. For all other
>>>> desired matches you can use EdgeNGramTokenizerFactory.
>>>>
>>>> Cheers
>>>> Avlesh
>>>>
>>>> On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe<eupe...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi
>>>>> I am hoping someone can point me in the right direction with regards to
>>>>> indexing words that are concatenated together to make other words or
>>>>> product
>>>>> names.
>>>>>
>>>>> We have indexed a product database and have come across some search
>>>>> terms
>>>>> where zero results are returned.  There are products in the index with
>>>>> 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
>>>>> 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
>>>>> respectively.
>>>>>
>>>>> Where do I look to resolve this?  The product name field is indexed
>>>>> using
>>>>> a
>>>>> text field type.
>>>>>
>>>>> Thanks in advance
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: A little help with indexing joined words

Reply via email to