Re: Possible issue in edismax?

Sandeep Mestry Wed, 30 Jan 2013 23:31:54 -0800

Thanks Felipe..
Can you point me an example please?

Also forgive me but if a document has matches in more searchable fields
then should it not rank higher?


Thanks,
Sandeep
On 30 Jan 2013 19:30, "Felipe Lahti" <fla...@thoughtworks.com> wrote:

> If you compare the first and last document scores you will see that the
> last one matches more fields than first one. So, you maybe thinking why?
> The first doc only matches "contributions" field and the last matches a
> bunch of fields so if you want to  have behave more like (<str
> name="qf">series_title^500 title^100 description^15 contribution</str>) you
> have to override the method of DefaultSimilarity.
>
>
> On Wed, Jan 30, 2013 at 4:12 PM, Sandeep Mestry <sanmes...@gmail.com>
> wrote:
>
> > I have pasted it below and it is slightly variant from the dismax
> > configuration I have mentioned above as I was playing with all sorts of
> > boost values, however it looks more lie below:
> >
> > <str name="c208c2ca-4270-27b8-e040-a8c00409063a">
> > 2675.7844 = (MATCH) sum of: 2675.7844 = (MATCH) max plus 0.01 times
> others
> > of: 2675.7844 = (MATCH) weight(contributions:news in 63298)
> > [DefaultSimilarity], result of: 2675.7844 = score(doc=63298,freq=1.0 =
> > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > 595177.7 = fieldWeight in 63298, product of: 1.0 = tf(freq=1.0), with
> freq
> > of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> > 40960.0 = fieldNorm(doc=63298)
> > </str>
> > <str name="c208c2a9-66bc-27b8-e040-a8c00409063a">
> > 2317.297 = (MATCH) sum of: 2317.297 = (MATCH) max plus 0.01 times others
> > of: 2317.297 = (MATCH) weight(contributions:news in 9826415)
> > [DefaultSimilarity], result of: 2317.297 = score(doc=9826415,freq=3.0 =
> > termFreq=3.0 ), product of: 0.004495774 = queryWeight, product of:
> > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > 515439.0 = fieldWeight in 9826415, product of: 1.7320508 = tf(freq=3.0),
> > with freq of: 3.0 = termFreq=3.0 14.530705 = idf(docFreq=14,
> > maxDocs=11282414) 20480.0 = fieldNorm(doc=9826415)
> > </str>
> > <str name="c208c2aa-1806-27b8-e040-a8c00409063a">
> > 2140.6274 = (MATCH) sum of: 2140.6274 = (MATCH) max plus 0.01 times
> others
> > of: 2140.6274 = (MATCH) weight(contributions:news in 9882325)
> > [DefaultSimilarity], result of: 2140.6274 = score(doc=9882325,freq=1.0 =
> > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > 476142.16 = fieldWeight in 9882325, product of: 1.0 = tf(freq=1.0), with
> > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> > 32768.0 = fieldNorm(doc=9882325)
> > </str>
> > <str name="c208c2b0-5165-27b8-e040-a8c00409063a">
> > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times
> others
> > of: 1605.4707 = (MATCH) weight(contributions:news in 220007)
> > [DefaultSimilarity], result of: 1605.4707 = score(doc=220007,freq=1.0 =
> > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > 357106.62 = fieldWeight in 220007, product of: 1.0 = tf(freq=1.0), with
> > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> > 24576.0 = fieldNorm(doc=220007)
> > </str>
> > <str name="c208c2cc-d01b-27b8-e040-a8c00409063a">
> > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times
> others
> > of: 1605.4707 = (MATCH) weight(contributions:news in 241151)
> > [DefaultSimilarity], result of: 1605.4707 = score(doc=241151,freq=1.0 =
> > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > 357106.62 = fieldWeight in 241151, product of: 1.0 = tf(freq=1.0), with
> > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> > 24576.0 = fieldNorm(doc=241151)
> > </str>
> > </lst>
> > <str name="otherQuery">id:c208c2b4-1b3e-27b8-e040-a8c00409063a</str>
> > <lst name="explainOther">
> > <str name="*c208c2b4-1b3e-27b8-e040-a8c00409063a*"> <!-- this should rank
> > higher -->
> > 6.5742764 = (MATCH) sum of: 6.5742764 = (MATCH) max plus 0.01 times
> others
> > of: 3.304414 = (MATCH) weight(description:news^25.0 in 967895)
> > [DefaultSimilarity], result of: 3.304414 = score(doc=967895,freq=1.0 =
> > termFreq=1.0 ), product of: 0.042727955 = queryWeight, product of: 25.0 =
> > boost 5.5240083 = idf(docFreq=122362, maxDocs=11282414) 3.093982E-4 =
> > queryNorm 77.33611 = fieldWeight in 967895, product of: 1.0 =
> tf(freq=1.0),
> > with freq of: 1.0 = termFreq=1.0 5.5240083 = idf(docFreq=122362,
> > maxDocs=11282414) 14.0 = fieldNorm(doc=967895) 5.913381 = (MATCH)
> > weight(pg_series_title:news^50.0 in 967895) [DefaultSimilarity], result
> of:
> > 5.913381 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product of:
> > 0.080834694 = queryWeight, product of: 50.0 = boost 5.2252855 =
> > idf(docFreq=164961, maxDocs=11282414) 3.093982E-4 = queryNorm 73.154 =
> > fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq of: 1.0
> =
> > termFreq=1.0 5.2252855 = idf(docFreq=164961, maxDocs=11282414) 14.0 =
> > fieldNorm(doc=967895) 0.18680073 = (MATCH) weight(p_programme_title:news
> in
> > 967895) [DefaultSimilarity], result of: 0.18680073 =
> > score(doc=967895,freq=1.0 = termFreq=1.0 ), product of: 0.002031815 =
> > queryWeight, product of: 6.5669904 = idf(docFreq=43120, maxDocs=11282414)
> > 3.093982E-4 = queryNorm 91.93787 = fieldWeight in 967895, product of:
> 1.0 =
> > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.5669904 =
> > idf(docFreq=43120, maxDocs=11282414) 14.0 = fieldNorm(doc=967895)
> 6.464123
> > = (MATCH) weight(pg_series_title_ci:news^500.0 in 967895)
> > [DefaultSimilarity], result of: 6.464123 = score(doc=967895,freq=1.0 =
> > termFreq=1.0 ), product of: 0.99999696 = queryWeight, product of: 500.0 =
> > boost 6.4641423 = idf(docFreq=47791, maxDocs=11282414) 3.093982E-4 =
> > queryNorm 6.4641423 = fieldWeight in 967895, product of: 1.0 =
> > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.4641423 =
> > idf(docFreq=47791, maxDocs=11282414) 1.0 = fieldNorm(doc=967895)
> 1.6107484
> > = (MATCH) weight(title_ci:news^100.0 in 967895) [DefaultSimilarity],
> result
> > of: 1.6107484 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product of:
> > 0.22324038 = queryWeight, product of: 100.0 = boost 7.2153096 =
> > idf(docFreq=22548, maxDocs=11282414) 3.093982E-4 = queryNorm 7.2153096 =
> > fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq of: 1.0
> =
> > termFreq=1.0 7.2153096 = idf(docFreq=22548, maxDocs=11282414) 1.0 =
> > fieldNorm(doc=967895)
> > </str>
> >
> >
> > On 30 January 2013 17:55, Felipe Lahti <fla...@thoughtworks.com> wrote:
> >
> > > Let me see if I understood your problem:
> > >
> > > By your first e-mail I think you are worried about the returned order
> of
> > > documents from Solr. Is that correct? If yes, as I said before it's not
> > > only the boosting that influence the order of returned documents.
> There's
> > > term frequency, IDF(inverse document frequency)... If I understood
> > > correctly by your first e-mail, you are interested in get rid of IDF.
> So
> > > for that, you can create a NoIDFSimilarity class to override the
> default
> > > similarity.
> > >
> > > Can you paste here the score calculation for one document?
> > >
> > >
> > > On Wed, Jan 30, 2013 at 2:06 PM, Sandeep Mestry <sanmes...@gmail.com
> > >wrote:
> > >
> > >> (Sorry for in complete reply in my previous mail, didn't know Ctrl F
> > sends
> > >> an email in Gmail.. ;-))
> > >>
> > >> Thanks Felipe, yes I have seen that and my requirement falls for
> > >>
> > >> How can I make exact-case matches score higher
> > >>
> > >> Example: a query of "Penguin" should score documents containing
> > "Penguin"
> > >> higher than docs containing "penguin".
> > >>
> > >> The general strategy is to index the content twice, using different
> > fields
> > >> with different fieldTypes (and different analyzers associated with
> those
> > >> fieldTypes). One analyzer will contain a lowercase filter for
> > >> case-insensitive matches, and one will preserve case for exact-case
> > >> matches.
> > >>
> > >> Use copyField <http://wiki.apache.org/solr/SchemaXml#copyField>
> > commands
> > >> in
> > >>
> > >> the schema to index a single input field multiple times.
> > >>
> > >> Once the content is indexed into multiple fields that are analyzed
> > >> differently, query across both
> > >> fields<http://wiki.apache.org/solr/SolrRelevancyFAQ#multiFieldQuery>
> > >>
> > >> .
> > >>
> > >> I have added a case insensitive field too to match the exact matches
> > >> higher, however the result is not even considering the matches in
> field
> > -
> > >> forget the exact matching part.
> > >>
> > >> And I have tried the debugQuery option as mentioned in my previous
> mail,
> > >> and I have also posted the parsed queries. From the debug query, I see
> > >> that
> > >> field boosted with lesser factor (contribution) is still resulting
> > higher
> > >> than the one with higher boost factor (series_title).
> > >>
> > >>
> > >> Thanks,
> > >>
> > >> Sandeep
> > >>
> > >>
> > >>
> > >>
> > >> On 30 January 2013 16:02, Sandeep Mestry <sanmes...@gmail.com> wrote:
> > >>
> > >> > Thanks Felipe, yes I have seen that and my requirement somewhere
> falls
> > >> for
> > >> >
> > >> >
> > >> > On 30 January 2013 15:53, Felipe Lahti <fla...@thoughtworks.com>
> > wrote:
> > >> >
> > >> >> Hi Sandeep,
> > >> >>
> > >> >> Quick answer is that not only the boost that you define in your
> > >> >> requestHandler is taken to calculate the score of each document.
> > There
> > >> are
> > >> >> others factors that contribute to score calculation. You can take a
> > >> look
> > >> >> here about http://wiki.apache.org/solr/SolrRelevancyFAQ. Also, you
> > can
> > >> >> see
> > >> >> using debugQuery=true the score calculation for each document
> > returned.
> > >> >>
> > >> >> Let me know you need something else.
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Wed, Jan 30, 2013 at 1:13 PM, Sandeep Mestry <
> sanmes...@gmail.com
> > >
> > >> >> wrote:
> > >> >>
> > >> >> > Hi All,
> > >> >> >
> > >> >> > I'm facing an issue in relevancy calculation by dismax query
> > parser.
> > >> >> > The boost factor applied does not work as expected in certain
> cases
> > >> when
> > >> >> > the keyword is generic and by generic I mean, if the keyword is
> > >> >> appearing
> > >> >> > many times in the document as well as in the index.
> > >> >> >
> > >> >> > I have parser configuration as below:
> > >> >> >
> > >> >> > <requestHandler name="querydismax" class="solr.SearchHandler" >
> > >> >> >         <lst name="defaults">
> > >> >> >             <str name="defType">edismax</str>
> > >> >> >             <str name="echoParams">explicit</str>
> > >> >> >             <float name="tie">0.01</float>
> > >> >> >             <str name="qf">series_title^500 title^100
> > description^15
> > >> >> > contribution</str>
> > >> >> >             <str name="pf">series_title^200</str>
> > >> >> >             <int name="ps">0</int>
> > >> >> >             <str name="q.alt">*:*</str>
> > >> >> >         </lst>
> > >> >> > </requestHandler>
> > >> >> >
> > >> >> > As you can see above, I'd expect the documents containing the
> > matches
> > >> >> for
> > >> >> > series title should rank higher than the ones in contribution.
> > >> >> >
> > >> >> > This works well, if I type in a query like 'wonderworld' which
> is a
> > >> less
> > >> >> > occurring term and the series titles rank higher. But, if I type
> > in a
> > >> >> > keyword like 'news' which is the most common term in the index, I
> > get
> > >> >> hits
> > >> >> > in contributions even though I have lots of documents having word
> > >> news
> > >> >> in
> > >> >> > series title.
> > >> >> >
> > >> >> > The field definition is as below:
> > >> >> >
> > >> >> > <field name="series_title" type="text_wc" indexed="true"
> > >> stored="true"
> > >> >> > multiValued="false" />
> > >> >> > <field name="title" type="text_wc" indexed="true" stored="true"
> > >> >> > multiValued="false" />
> > >> >> > <field name="description" type="text_wc" indexed="true"
> > stored="true"
> > >> >> > multiValued="false" />
> > >> >> > <field name="contribution" type="text" indexed="true"
> stored="true"
> > >> >> > multiValued="true" />
> > >> >> >
> > >> >> > <fieldType name="text" class="solr.TextField"
> > >> positionIncrementGap="100"
> > >> >> > compressThreshold="10">
> > >> >> >             <analyzer type="index">
> > >> >> >                 <tokenizer
> > class="solr.WhitespaceTokenizerFactory"/>
> > >> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> > >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > >> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >> >> >             </analyzer>
> > >> >> >             <analyzer type="query">
> > >> >> >                 <tokenizer
> > class="solr.WhitespaceTokenizerFactory"/>
> > >> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> > >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > >> >> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >> >> >             </analyzer>
> > >> >> >         </fieldType>
> > >> >> >
> > >> >> > <fieldType name="text_wc" class="solr.TextField"
> > >> >> positionIncrementGap="100"
> > >> >> > >
> > >> >> >             <analyzer type="index">
> > >> >> >                 <tokenizer
> > class="solr.WhitespaceTokenizerFactory"/>
> > >> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> > >> >> > stemEnglishPossessive="0" generateWordParts="1"
> > >> generateNumberParts="1"
> > >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1"
> > >> >> splitOnCaseChange="1"
> > >> >> > splitOnNumerics="0" preserveOriginal="1" />
> > >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >> >> >             </analyzer>
> > >> >> >             <analyzer type="query">
> > >> >> >                 <tokenizer
> > class="solr.WhitespaceTokenizerFactory"/>
> > >> >> >                 <filter class="solr.WordDelimiterFilterFactory"
> > >> >> > stemEnglishPossessive="0" generateWordParts="1"
> > >> generateNumberParts="1"
> > >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1"
> > >> >> splitOnCaseChange="1"
> > >> >> > splitOnNumerics="0" preserveOriginal="1" />
> > >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >> >> >             </analyzer>
> > >> >> >  </fieldType>
> > >> >> >
> > >> >> > I have tried debugging and when I use query term news, I see that
> > >> >> matches
> > >> >> > for contributions are ranked higher than series title. The parsed
> > >> >> queries
> > >> >> > look like below:
> > >> >> > (Note that I have edited the query as in reality I have lot of
> > fields
> > >> >> that
> > >> >> > are searchable and I have only mentioned the fields containing
> text
> > >> >> data -
> > >> >> > rest all contain uuids)
> > >> >> >
> > >> >> > <str name="parsedquery">
> > >> >> > (+DisjunctionMaxQuery((description:news^15.0 | title:news^100.0 |
> > >> >> > contributions:news | series_title:news^500.0)~0.01) () () () ()
> ()
> > ()
> > >> >> () ()
> > >> >> > () () () () () () () () () () () () () () () () () () ()
> > ())/no_coord
> > >> >> > </str>
> > >> >> > <str name="parsedquery_toString">
> > >> >> > +(description:news^15 | title:news^100.0 | contributions:news |
> > >> >> > series_title:news^500.0)~0.01 () () () () () () () () () () () ()
> > ()
> > >> ()
> > >> >> ()
> > >> >> > () () () () () () () () () () () () ()
> > >> >> >
> > >> >> >
> > >> >> > Could you guide me in right direction please?
> > >> >> >
> > >> >> > Many Thanks,
> > >> >> > Sandeep
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Felipe Lahti
> > >> >> Consultant Developer - ThoughtWorks Porto Alegre
> > >> >>
> > >> >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Felipe Lahti
> > > Consultant Developer - ThoughtWorks Porto Alegre
> > >
> >
>
>
>
> --
> Felipe Lahti
> Consultant Developer - ThoughtWorks Porto Alegre
>

Re: Possible issue in edismax?

Reply via email to