Re: Question on Exact Matches - edismax

Sandeep Mestry Thu, 04 Apr 2013 08:39:54 -0700

Another problem that I see in Solr analysis is the query term that matches
the tokenized field does not match on the case insensitive field.
So, if I'm searching for 'coast to coast', I see that the tokenized series
title (pg_series_title) is matched but not the ci field which is
pg_series_title_ci.


The definition of both field is as below:

<fieldType name="text_wc" class="solr.TextField" positionIncrementGap="100"
>
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
stemEnglishPossessive="0" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
splitOnNumerics="0" preserveOriginal="1" />
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
stemEnglishPossessive="0" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
splitOnNumerics="0" preserveOriginal="1" />
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>


<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true"
omitNorms="true" compressThreshold="10">
            <analyzer>
                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>

<field name="pg_series_title" type="text_wc" indexed="true" stored="true"
multiValued="false" />
<field name="pg_series_title_ci" type="string_ci" indexed="true"
stored="true" multiValued="false" />

*<copyField source="pg_series_title" dest="pg_series_title_ci" />*
*
*
*Can this copyfield directive be an issue? Should it be other way round or
does it matter?*

Thanks,
Sandeep





On 4 April 2013 10:38, Sandeep Mestry <sanmes...@gmail.com> wrote:

> Hi Jan,
>
> Thanks for your reply. I have defined string_ci like below:
>
> <fieldType name="string_ci" class="solr.TextField" sortMissingLast="true"
> omitNorms="true" compressThreshold="10">
>             <analyzer>
>                 <tokenizer class="solr.KeywordTokenizerFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
> When I analyse the query in solr, I saw that document containing
> pg_series_title_ci:"funny"  matches when I do a search for
> pg_series_title_ci:"funny games" and is ranked higher than the document
> containing the exact matches. I can use the default string data type but
> then the match will be on exact casing.
>
> Thanks,
> Sandeep
>
>
> On 3 April 2013 22:20, Jan Høydahl <jan....@cominvent.com> wrote:
>
>> Can you show us your *_ci field type? Solr does not really have a way to
>> tell whether a match is "exact" or only partial, but you could hack around
>> it with the fieldType. See https://github.com/cominvent/exactmatch for a
>> possible solution.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> 3. apr. 2013 kl. 15:55 skrev Sandeep Mestry <sanmes...@gmail.com>:
>>
>> > Hi All,
>> >
>> > I have a requirement where in exact matches for 2 fields (Series Title,
>> > Title) should be ranked higher than the partial matches. The
>> configuration
>> > looks like below:
>> >
>> > <requestHandler name="assetdismax" class="solr.SearchHandler" >
>> >        <lst name="defaults">
>> >            <str name="defType">edismax</str>
>> >            <str name="echoParams">explicit</str>
>> >            <float name="tie">0.01</float>
>> >            <str name="qf">*pg_series_title_ci*^500 *title_ci*^300 *
>> > pg_series_title*^200 *title*^25 classifications^15
>> classifications_texts^15
>> > parent_classifications^10 synonym_classifications^5 pg_brand_title^5
>> > pg_series_working_title^5 p_programme_title^5 p_item_title^5
>> > p_interstitial_title^5 description^15 pg_series_description
>> annotations^0.1
>> > classification_notes^0.05 pv_program_version_number^2
>> > pv_program_version_number_ci^2 pv_program_number^2
>> pv_program_number_ci^2
>> > p_program_number^2 ma_version_number^2 ma_recording_location
>> > ma_contributions^0.001 rel_pg_series_title rel_programme_title
>> > rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5
>> > pv_uuid^0.5 ma_uuid^0.5</str>
>> >            <str name="pf">pg_series_title_ci^500 title_ci^500</str>
>> >            <int name="ps">0</int>
>> >            <str name="q.alt">*:*</str>
>> >            <str name="mm">100%</str>
>> >            <str name="q.op">AND</str>
>> >            <str name="facet">true</str>
>> >            <str name="facet.limit">-1</str>
>> >            <str name="facet.mincount">1</str>
>> >        </lst>
>> >    </requestHandler>
>> >
>> > As you can see above, the search is against many fields. What I'd want
>> is
>> > the documents that have exact matches for series title and title fields
>> > should rank higher than the rest.
>> >
>> > I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields
>> for
>> > series title and title and have boosted them higher over the tokenized
>> and
>> > rest of the fields. I have also implemented a similarity class to
>> override
>> > idf however I still get documents having partial matches in title and
>> other
>> > fields ranking higher than exact match in pg_series_title_ci.
>> >
>> > Many Thanks,
>> > Sandeep
>>
>>
>

Re: Question on Exact Matches - edismax

Reply via email to