Re: Possible issue in edismax?

Sandeep Mestry Fri, 01 Feb 2013 11:01:12 -0800

Hi..

Could you tell me if changing default similarity to custom implementation
will require me to rebuild the index? Or will it be used only query time?


thanks,
Sandeep
 On 31 Jan 2013 13:55, "Felipe Lahti" <fla...@thoughtworks.com> wrote:

> So, it depends of your business requirement, right? If a document has
> matches in more searchable fields, at least for me, this document is more
> important than other document that has less matches.
>
> Example:
> Put this in your schema:
> <similarity class="com.your.namespace.NoIDFSimilarity" />
>
> And create a class in your classpath of your Solr:
>
> package com.your.namespace;
>
> import org.apache.lucene.search.similarities.DefaultSimilarity;
>
> public class NoIDFSimilarity extends DefaultSimilarity {
>
>     @Override
>
>     public float idf(long docFreq, long numDocs) {
>
>         return 1;
>
>     }
>
> }
>
>
> It will "neutralize" the idf (which is the rarity of term).
>
>
>
>
>
>
> On Thu, Jan 31, 2013 at 5:31 AM, Sandeep Mestry <sanmes...@gmail.com>
> wrote:
>
> > Thanks Felipe..
> > Can you point me an example please?
> >
> > Also forgive me but if a document has matches in more searchable fields
> > then should it not rank higher?
> >
> > Thanks,
> > Sandeep
> > On 30 Jan 2013 19:30, "Felipe Lahti" <fla...@thoughtworks.com> wrote:
> >
> > > If you compare the first and last document scores you will see that the
> > > last one matches more fields than first one. So, you maybe thinking
> why?
> > > The first doc only matches "contributions" field and the last matches a
> > > bunch of fields so if you want to  have behave more like (<str
> > > name="qf">series_title^500 title^100 description^15 contribution</str>)
> > you
> > > have to override the method of DefaultSimilarity.
> > >
> > >
> > > On Wed, Jan 30, 2013 at 4:12 PM, Sandeep Mestry <sanmes...@gmail.com>
> > > wrote:
> > >
> > > > I have pasted it below and it is slightly variant from the dismax
> > > > configuration I have mentioned above as I was playing with all sorts
> of
> > > > boost values, however it looks more lie below:
> > > >
> > > > <str name="c208c2ca-4270-27b8-e040-a8c00409063a">
> > > > 2675.7844 = (MATCH) sum of: 2675.7844 = (MATCH) max plus 0.01 times
> > > others
> > > > of: 2675.7844 = (MATCH) weight(contributions:news in 63298)
> > > > [DefaultSimilarity], result of: 2675.7844 = score(doc=63298,freq=1.0
> =
> > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 595177.7 = fieldWeight in 63298, product of: 1.0 = tf(freq=1.0), with
> > > freq
> > > > of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, maxDocs=11282414)
> > > > 40960.0 = fieldNorm(doc=63298)
> > > > </str>
> > > > <str name="c208c2a9-66bc-27b8-e040-a8c00409063a">
> > > > 2317.297 = (MATCH) sum of: 2317.297 = (MATCH) max plus 0.01 times
> > others
> > > > of: 2317.297 = (MATCH) weight(contributions:news in 9826415)
> > > > [DefaultSimilarity], result of: 2317.297 =
> score(doc=9826415,freq=3.0 =
> > > > termFreq=3.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 515439.0 = fieldWeight in 9826415, product of: 1.7320508 =
> > tf(freq=3.0),
> > > > with freq of: 3.0 = termFreq=3.0 14.530705 = idf(docFreq=14,
> > > > maxDocs=11282414) 20480.0 = fieldNorm(doc=9826415)
> > > > </str>
> > > > <str name="c208c2aa-1806-27b8-e040-a8c00409063a">
> > > > 2140.6274 = (MATCH) sum of: 2140.6274 = (MATCH) max plus 0.01 times
> > > others
> > > > of: 2140.6274 = (MATCH) weight(contributions:news in 9882325)
> > > > [DefaultSimilarity], result of: 2140.6274 =
> score(doc=9882325,freq=1.0
> > =
> > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 476142.16 = fieldWeight in 9882325, product of: 1.0 = tf(freq=1.0),
> > with
> > > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14,
> > maxDocs=11282414)
> > > > 32768.0 = fieldNorm(doc=9882325)
> > > > </str>
> > > > <str name="c208c2b0-5165-27b8-e040-a8c00409063a">
> > > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times
> > > others
> > > > of: 1605.4707 = (MATCH) weight(contributions:news in 220007)
> > > > [DefaultSimilarity], result of: 1605.4707 =
> score(doc=220007,freq=1.0 =
> > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 357106.62 = fieldWeight in 220007, product of: 1.0 = tf(freq=1.0),
> with
> > > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14,
> > maxDocs=11282414)
> > > > 24576.0 = fieldNorm(doc=220007)
> > > > </str>
> > > > <str name="c208c2cc-d01b-27b8-e040-a8c00409063a">
> > > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times
> > > others
> > > > of: 1605.4707 = (MATCH) weight(contributions:news in 241151)
> > > > [DefaultSimilarity], result of: 1605.4707 =
> score(doc=241151,freq=1.0 =
> > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of:
> > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = queryNorm
> > > > 357106.62 = fieldWeight in 241151, product of: 1.0 = tf(freq=1.0),
> with
> > > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14,
> > maxDocs=11282414)
> > > > 24576.0 = fieldNorm(doc=241151)
> > > > </str>
> > > > </lst>
> > > > <str name="otherQuery">id:c208c2b4-1b3e-27b8-e040-a8c00409063a</str>
> > > > <lst name="explainOther">
> > > > <str name="*c208c2b4-1b3e-27b8-e040-a8c00409063a*"> <!-- this should
> > rank
> > > > higher -->
> > > > 6.5742764 = (MATCH) sum of: 6.5742764 = (MATCH) max plus 0.01 times
> > > others
> > > > of: 3.304414 = (MATCH) weight(description:news^25.0 in 967895)
> > > > [DefaultSimilarity], result of: 3.304414 = score(doc=967895,freq=1.0
> =
> > > > termFreq=1.0 ), product of: 0.042727955 = queryWeight, product of:
> > 25.0 =
> > > > boost 5.5240083 = idf(docFreq=122362, maxDocs=11282414) 3.093982E-4 =
> > > > queryNorm 77.33611 = fieldWeight in 967895, product of: 1.0 =
> > > tf(freq=1.0),
> > > > with freq of: 1.0 = termFreq=1.0 5.5240083 = idf(docFreq=122362,
> > > > maxDocs=11282414) 14.0 = fieldNorm(doc=967895) 5.913381 = (MATCH)
> > > > weight(pg_series_title:news^50.0 in 967895) [DefaultSimilarity],
> result
> > > of:
> > > > 5.913381 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product of:
> > > > 0.080834694 = queryWeight, product of: 50.0 = boost 5.2252855 =
> > > > idf(docFreq=164961, maxDocs=11282414) 3.093982E-4 = queryNorm 73.154
> =
> > > > fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq of:
> > 1.0
> > > =
> > > > termFreq=1.0 5.2252855 = idf(docFreq=164961, maxDocs=11282414) 14.0 =
> > > > fieldNorm(doc=967895) 0.18680073 = (MATCH)
> > weight(p_programme_title:news
> > > in
> > > > 967895) [DefaultSimilarity], result of: 0.18680073 =
> > > > score(doc=967895,freq=1.0 = termFreq=1.0 ), product of: 0.002031815 =
> > > > queryWeight, product of: 6.5669904 = idf(docFreq=43120,
> > maxDocs=11282414)
> > > > 3.093982E-4 = queryNorm 91.93787 = fieldWeight in 967895, product of:
> > > 1.0 =
> > > > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.5669904 =
> > > > idf(docFreq=43120, maxDocs=11282414) 14.0 = fieldNorm(doc=967895)
> > > 6.464123
> > > > = (MATCH) weight(pg_series_title_ci:news^500.0 in 967895)
> > > > [DefaultSimilarity], result of: 6.464123 = score(doc=967895,freq=1.0
> =
> > > > termFreq=1.0 ), product of: 0.99999696 = queryWeight, product of:
> > 500.0 =
> > > > boost 6.4641423 = idf(docFreq=47791, maxDocs=11282414) 3.093982E-4 =
> > > > queryNorm 6.4641423 = fieldWeight in 967895, product of: 1.0 =
> > > > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.4641423 =
> > > > idf(docFreq=47791, maxDocs=11282414) 1.0 = fieldNorm(doc=967895)
> > > 1.6107484
> > > > = (MATCH) weight(title_ci:news^100.0 in 967895) [DefaultSimilarity],
> > > result
> > > > of: 1.6107484 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product
> of:
> > > > 0.22324038 = queryWeight, product of: 100.0 = boost 7.2153096 =
> > > > idf(docFreq=22548, maxDocs=11282414) 3.093982E-4 = queryNorm
> 7.2153096
> > =
> > > > fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq of:
> > 1.0
> > > =
> > > > termFreq=1.0 7.2153096 = idf(docFreq=22548, maxDocs=11282414) 1.0 =
> > > > fieldNorm(doc=967895)
> > > > </str>
> > > >
> > > >
> > > > On 30 January 2013 17:55, Felipe Lahti <fla...@thoughtworks.com>
> > wrote:
> > > >
> > > > > Let me see if I understood your problem:
> > > > >
> > > > > By your first e-mail I think you are worried about the returned
> order
> > > of
> > > > > documents from Solr. Is that correct? If yes, as I said before it's
> > not
> > > > > only the boosting that influence the order of returned documents.
> > > There's
> > > > > term frequency, IDF(inverse document frequency)... If I understood
> > > > > correctly by your first e-mail, you are interested in get rid of
> IDF.
> > > So
> > > > > for that, you can create a NoIDFSimilarity class to override the
> > > default
> > > > > similarity.
> > > > >
> > > > > Can you paste here the score calculation for one document?
> > > > >
> > > > >
> > > > > On Wed, Jan 30, 2013 at 2:06 PM, Sandeep Mestry <
> sanmes...@gmail.com
> > > > >wrote:
> > > > >
> > > > >> (Sorry for in complete reply in my previous mail, didn't know
> Ctrl F
> > > > sends
> > > > >> an email in Gmail.. ;-))
> > > > >>
> > > > >> Thanks Felipe, yes I have seen that and my requirement falls for
> > > > >>
> > > > >> How can I make exact-case matches score higher
> > > > >>
> > > > >> Example: a query of "Penguin" should score documents containing
> > > > "Penguin"
> > > > >> higher than docs containing "penguin".
> > > > >>
> > > > >> The general strategy is to index the content twice, using
> different
> > > > fields
> > > > >> with different fieldTypes (and different analyzers associated with
> > > those
> > > > >> fieldTypes). One analyzer will contain a lowercase filter for
> > > > >> case-insensitive matches, and one will preserve case for
> exact-case
> > > > >> matches.
> > > > >>
> > > > >> Use copyField <http://wiki.apache.org/solr/SchemaXml#copyField>
> > > > commands
> > > > >> in
> > > > >>
> > > > >> the schema to index a single input field multiple times.
> > > > >>
> > > > >> Once the content is indexed into multiple fields that are analyzed
> > > > >> differently, query across both
> > > > >> fields<
> http://wiki.apache.org/solr/SolrRelevancyFAQ#multiFieldQuery
> > >
> > > > >>
> > > > >> .
> > > > >>
> > > > >> I have added a case insensitive field too to match the exact
> matches
> > > > >> higher, however the result is not even considering the matches in
> > > field
> > > > -
> > > > >> forget the exact matching part.
> > > > >>
> > > > >> And I have tried the debugQuery option as mentioned in my previous
> > > mail,
> > > > >> and I have also posted the parsed queries. From the debug query, I
> > see
> > > > >> that
> > > > >> field boosted with lesser factor (contribution) is still resulting
> > > > higher
> > > > >> than the one with higher boost factor (series_title).
> > > > >>
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Sandeep
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 30 January 2013 16:02, Sandeep Mestry <sanmes...@gmail.com>
> > wrote:
> > > > >>
> > > > >> > Thanks Felipe, yes I have seen that and my requirement somewhere
> > > falls
> > > > >> for
> > > > >> >
> > > > >> >
> > > > >> > On 30 January 2013 15:53, Felipe Lahti <fla...@thoughtworks.com
> >
> > > > wrote:
> > > > >> >
> > > > >> >> Hi Sandeep,
> > > > >> >>
> > > > >> >> Quick answer is that not only the boost that you define in your
> > > > >> >> requestHandler is taken to calculate the score of each
> document.
> > > > There
> > > > >> are
> > > > >> >> others factors that contribute to score calculation. You can
> > take a
> > > > >> look
> > > > >> >> here about http://wiki.apache.org/solr/SolrRelevancyFAQ. Also,
> > you
> > > > can
> > > > >> >> see
> > > > >> >> using debugQuery=true the score calculation for each document
> > > > returned.
> > > > >> >>
> > > > >> >> Let me know you need something else.
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> On Wed, Jan 30, 2013 at 1:13 PM, Sandeep Mestry <
> > > sanmes...@gmail.com
> > > > >
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > Hi All,
> > > > >> >> >
> > > > >> >> > I'm facing an issue in relevancy calculation by dismax query
> > > > parser.
> > > > >> >> > The boost factor applied does not work as expected in certain
> > > cases
> > > > >> when
> > > > >> >> > the keyword is generic and by generic I mean, if the keyword
> is
> > > > >> >> appearing
> > > > >> >> > many times in the document as well as in the index.
> > > > >> >> >
> > > > >> >> > I have parser configuration as below:
> > > > >> >> >
> > > > >> >> > <requestHandler name="querydismax"
> class="solr.SearchHandler" >
> > > > >> >> >         <lst name="defaults">
> > > > >> >> >             <str name="defType">edismax</str>
> > > > >> >> >             <str name="echoParams">explicit</str>
> > > > >> >> >             <float name="tie">0.01</float>
> > > > >> >> >             <str name="qf">series_title^500 title^100
> > > > description^15
> > > > >> >> > contribution</str>
> > > > >> >> >             <str name="pf">series_title^200</str>
> > > > >> >> >             <int name="ps">0</int>
> > > > >> >> >             <str name="q.alt">*:*</str>
> > > > >> >> >         </lst>
> > > > >> >> > </requestHandler>
> > > > >> >> >
> > > > >> >> > As you can see above, I'd expect the documents containing the
> > > > matches
> > > > >> >> for
> > > > >> >> > series title should rank higher than the ones in
> contribution.
> > > > >> >> >
> > > > >> >> > This works well, if I type in a query like 'wonderworld'
> which
> > > is a
> > > > >> less
> > > > >> >> > occurring term and the series titles rank higher. But, if I
> > type
> > > > in a
> > > > >> >> > keyword like 'news' which is the most common term in the
> > index, I
> > > > get
> > > > >> >> hits
> > > > >> >> > in contributions even though I have lots of documents having
> > word
> > > > >> news
> > > > >> >> in
> > > > >> >> > series title.
> > > > >> >> >
> > > > >> >> > The field definition is as below:
> > > > >> >> >
> > > > >> >> > <field name="series_title" type="text_wc" indexed="true"
> > > > >> stored="true"
> > > > >> >> > multiValued="false" />
> > > > >> >> > <field name="title" type="text_wc" indexed="true"
> stored="true"
> > > > >> >> > multiValued="false" />
> > > > >> >> > <field name="description" type="text_wc" indexed="true"
> > > > stored="true"
> > > > >> >> > multiValued="false" />
> > > > >> >> > <field name="contribution" type="text" indexed="true"
> > > stored="true"
> > > > >> >> > multiValued="true" />
> > > > >> >> >
> > > > >> >> > <fieldType name="text" class="solr.TextField"
> > > > >> positionIncrementGap="100"
> > > > >> >> > compressThreshold="10">
> > > > >> >> >             <analyzer type="index">
> > > > >> >> >                 <tokenizer
> > > > class="solr.WhitespaceTokenizerFactory"/>
> > > > >> >> >                 <filter
> class="solr.WordDelimiterFilterFactory"
> > > > >> >> > generateWordParts="1" generateNumberParts="1"
> catenateWords="1"
> > > > >> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > > >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > > >> >> >             </analyzer>
> > > > >> >> >             <analyzer type="query">
> > > > >> >> >                 <tokenizer
> > > > class="solr.WhitespaceTokenizerFactory"/>
> > > > >> >> >                 <filter
> class="solr.WordDelimiterFilterFactory"
> > > > >> >> > generateWordParts="1" generateNumberParts="1"
> catenateWords="0"
> > > > >> >> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > > > >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > > >> >> >             </analyzer>
> > > > >> >> >         </fieldType>
> > > > >> >> >
> > > > >> >> > <fieldType name="text_wc" class="solr.TextField"
> > > > >> >> positionIncrementGap="100"
> > > > >> >> > >
> > > > >> >> >             <analyzer type="index">
> > > > >> >> >                 <tokenizer
> > > > class="solr.WhitespaceTokenizerFactory"/>
> > > > >> >> >                 <filter
> class="solr.WordDelimiterFilterFactory"
> > > > >> >> > stemEnglishPossessive="0" generateWordParts="1"
> > > > >> generateNumberParts="1"
> > > > >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1"
> > > > >> >> splitOnCaseChange="1"
> > > > >> >> > splitOnNumerics="0" preserveOriginal="1" />
> > > > >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > > >> >> >             </analyzer>
> > > > >> >> >             <analyzer type="query">
> > > > >> >> >                 <tokenizer
> > > > class="solr.WhitespaceTokenizerFactory"/>
> > > > >> >> >                 <filter
> class="solr.WordDelimiterFilterFactory"
> > > > >> >> > stemEnglishPossessive="0" generateWordParts="1"
> > > > >> generateNumberParts="1"
> > > > >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1"
> > > > >> >> splitOnCaseChange="1"
> > > > >> >> > splitOnNumerics="0" preserveOriginal="1" />
> > > > >> >> >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > > >> >> >             </analyzer>
> > > > >> >> >  </fieldType>
> > > > >> >> >
> > > > >> >> > I have tried debugging and when I use query term news, I see
> > that
> > > > >> >> matches
> > > > >> >> > for contributions are ranked higher than series title. The
> > parsed
> > > > >> >> queries
> > > > >> >> > look like below:
> > > > >> >> > (Note that I have edited the query as in reality I have lot
> of
> > > > fields
> > > > >> >> that
> > > > >> >> > are searchable and I have only mentioned the fields
> containing
> > > text
> > > > >> >> data -
> > > > >> >> > rest all contain uuids)
> > > > >> >> >
> > > > >> >> > <str name="parsedquery">
> > > > >> >> > (+DisjunctionMaxQuery((description:news^15.0 |
> > title:news^100.0 |
> > > > >> >> > contributions:news | series_title:news^500.0)~0.01) () () ()
> ()
> > > ()
> > > > ()
> > > > >> >> () ()
> > > > >> >> > () () () () () () () () () () () () () () () () () () ()
> > > > ())/no_coord
> > > > >> >> > </str>
> > > > >> >> > <str name="parsedquery_toString">
> > > > >> >> > +(description:news^15 | title:news^100.0 |
> contributions:news |
> > > > >> >> > series_title:news^500.0)~0.01 () () () () () () () () () ()
> ()
> > ()
> > > > ()
> > > > >> ()
> > > > >> >> ()
> > > > >> >> > () () () () () () () () () () () () ()
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > Could you guide me in right direction please?
> > > > >> >> >
> > > > >> >> > Many Thanks,
> > > > >> >> > Sandeep
> > > > >> >> >
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Felipe Lahti
> > > > >> >> Consultant Developer - ThoughtWorks Porto Alegre
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Felipe Lahti
> > > > > Consultant Developer - ThoughtWorks Porto Alegre
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Felipe Lahti
> > > Consultant Developer - ThoughtWorks Porto Alegre
> > >
> >
>
>
>
> --
> Felipe Lahti
> Consultant Developer - ThoughtWorks Porto Alegre
>

Re: Possible issue in edismax?

Reply via email to