Re: Possible issue in edismax?

Felipe Lahti Wed, 30 Jan 2013 09:55:45 -0800

Let me see if I understood your problem:

By your first e-mail I think you are worried about the returned order of
documents from Solr. Is that correct? If yes, as I said before it's not
only the boosting that influence the order of returned documents. There's
term frequency, IDF(inverse document frequency)... If I understood
correctly by your first e-mail, you are interested in get rid of IDF. So
for that, you can create a NoIDFSimilarity class to override the default
similarity.


Can you paste here the score calculation for one document?


On Wed, Jan 30, 2013 at 2:06 PM, Sandeep Mestry <sanmes...@gmail.com> wrote:

> (Sorry for in complete reply in my previous mail, didn't know Ctrl F sends
> an email in Gmail.. ;-))
>
> Thanks Felipe, yes I have seen that and my requirement falls for
>
> How can I make exact-case matches score higher
>
> Example: a query of "Penguin" should score documents containing "Penguin"
> higher than docs containing "penguin".
>
> The general strategy is to index the content twice, using different fields
> with different fieldTypes (and different analyzers associated with those
> fieldTypes). One analyzer will contain a lowercase filter for
> case-insensitive matches, and one will preserve case for exact-case
> matches.
>
> Use copyField <http://wiki.apache.org/solr/SchemaXml#copyField> commands
> in
> the schema to index a single input field multiple times.
>
> Once the content is indexed into multiple fields that are analyzed
> differently, query across both
> fields<http://wiki.apache.org/solr/SolrRelevancyFAQ#multiFieldQuery>
> .
>
> I have added a case insensitive field too to match the exact matches
> higher, however the result is not even considering the matches in field -
> forget the exact matching part.
>
> And I have tried the debugQuery option as mentioned in my previous mail,
> and I have also posted the parsed queries. From the debug query, I see that
> field boosted with lesser factor (contribution) is still resulting higher
> than the one with higher boost factor (series_title).
>
>
> Thanks,
>
> Sandeep
>
>
>
>
> On 30 January 2013 16:02, Sandeep Mestry <sanmes...@gmail.com> wrote:
>
> > Thanks Felipe, yes I have seen that and my requirement somewhere falls
> for
> >
> >
> > On 30 January 2013 15:53, Felipe Lahti <fla...@thoughtworks.com> wrote:
> >
> >> Hi Sandeep,
> >>
> >> Quick answer is that not only the boost that you define in your
> >> requestHandler is taken to calculate the score of each document. There
> are
> >> others factors that contribute to score calculation. You can take a look
> >> here about http://wiki.apache.org/solr/SolrRelevancyFAQ. Also, you can
> >> see
> >> using debugQuery=true the score calculation for each document returned.
> >>
> >> Let me know you need something else.
> >>
> >>
> >>
> >> On Wed, Jan 30, 2013 at 1:13 PM, Sandeep Mestry <sanmes...@gmail.com>
> >> wrote:
> >>
> >> > Hi All,
> >> >
> >> > I'm facing an issue in relevancy calculation by dismax query parser.
> >> > The boost factor applied does not work as expected in certain cases
> when
> >> > the keyword is generic and by generic I mean, if the keyword is
> >> appearing
> >> > many times in the document as well as in the index.
> >> >
> >> > I have parser configuration as below:
> >> >
> >> > <requestHandler name="querydismax" class="solr.SearchHandler" >
> >> >         <lst name="defaults">
> >> >             <str name="defType">edismax</str>
> >> >             <str name="echoParams">explicit</str>
> >> >             <float name="tie">0.01</float>
> >> >             <str name="qf">series_title^500 title^100 description^15
> >> > contribution</str>
> >> >             <str name="pf">series_title^200</str>
> >> >             <int name="ps">0</int>
> >> >             <str name="q.alt">*:*</str>
> >> >         </lst>
> >> > </requestHandler>
> >> >
> >> > As you can see above, I'd expect the documents containing the matches
> >> for
> >> > series title should rank higher than the ones in contribution.
> >> >
> >> > This works well, if I type in a query like 'wonderworld' which is a
> less
> >> > occurring term and the series titles rank higher. But, if I type in a
> >> > keyword like 'news' which is the most common term in the index, I get
> >> hits
> >> > in contributions even though I have lots of documents having word news
> >> in
> >> > series title.
> >> >
> >> > The field definition is as below:
> >> >
> >> > <field name="series_title" type="text_wc" indexed="true" stored="true"
> >> > multiValued="false" />
> >> > <field name="title" type="text_wc" indexed="true" stored="true"
> >> > multiValued="false" />
> >> > <field name="description" type="text_wc" indexed="true" stored="true"
> >> > multiValued="false" />
> >> > <field name="contribution" type="text" indexed="true" stored="true"
> >> > multiValued="true" />
> >> >
> >> > <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100"
> >> > compressThreshold="10">
> >> >             <analyzer type="index">
> >> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> >> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >> >             </analyzer>
> >> >             <analyzer type="query">
> >> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> >> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >> >             </analyzer>
> >> >         </fieldType>
> >> >
> >> > <fieldType name="text_wc" class="solr.TextField"
> >> positionIncrementGap="100"
> >> > >
> >> >             <analyzer type="index">
> >> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> >> > stemEnglishPossessive="0" generateWordParts="1"
> generateNumberParts="1"
> >> > catenateWords="1" catenateNumbers="1" catenateAll="1"
> >> splitOnCaseChange="1"
> >> > splitOnNumerics="0" preserveOriginal="1" />
> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >> >             </analyzer>
> >> >             <analyzer type="query">
> >> >                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> >> > stemEnglishPossessive="0" generateWordParts="1"
> generateNumberParts="1"
> >> > catenateWords="1" catenateNumbers="1" catenateAll="1"
> >> splitOnCaseChange="1"
> >> > splitOnNumerics="0" preserveOriginal="1" />
> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >> >             </analyzer>
> >> >  </fieldType>
> >> >
> >> > I have tried debugging and when I use query term news, I see that
> >> matches
> >> > for contributions are ranked higher than series title. The parsed
> >> queries
> >> > look like below:
> >> > (Note that I have edited the query as in reality I have lot of fields
> >> that
> >> > are searchable and I have only mentioned the fields containing text
> >> data -
> >> > rest all contain uuids)
> >> >
> >> > <str name="parsedquery">
> >> > (+DisjunctionMaxQuery((description:news^15.0 | title:news^100.0 |
> >> > contributions:news | series_title:news^500.0)~0.01) () () () () () ()
> >> () ()
> >> > () () () () () () () () () () () () () () () () () () () ())/no_coord
> >> > </str>
> >> > <str name="parsedquery_toString">
> >> > +(description:news^15 | title:news^100.0 | contributions:news |
> >> > series_title:news^500.0)~0.01 () () () () () () () () () () () () ()
> ()
> >> ()
> >> > () () () () () () () () () () () () ()
> >> >
> >> >
> >> > Could you guide me in right direction please?
> >> >
> >> > Many Thanks,
> >> > Sandeep
> >> >
> >>
> >>
> >>
> >> --
> >> Felipe Lahti
> >> Consultant Developer - ThoughtWorks Porto Alegre
> >>
> >
> >
>



-- 
Felipe Lahti
Consultant Developer - ThoughtWorks Porto Alegre

Re: Possible issue in edismax?

Reply via email to