Re: custom scoring

Carlos Gonzalez-Cadenas Fri, 17 Feb 2012 02:08:14 -0800

Thanks Em, Robert, Chris for your time and valuable advice. We'll make some
tests and will let you know soon.




On Thu, Feb 16, 2012 at 11:43 PM, Em <mailformailingli...@yahoo.de> wrote:

> Hello Carlos,
>
> I think we missunderstood eachother.
>
> As an example:
> BooleanQuery (
>  clauses: (
>     MustMatch(
>               DisjunctionMaxQuery(
>                   TermQuery("stopword_field", "barcelona"),
>                   TermQuery("stopword_field", "hoteles")
>               )
>     ),
>     ShouldMatch(
>                  FunctionQuery(
>                    *please insert your function here*
>                 )
>     )
>  )
> )
>
> Explanation:
> You construct an artificial BooleanQuery which wraps your user's query
> as well as your function query.
> Your user's query - in that case - is just a DisjunctionMaxQuery
> consisting of two TermQueries.
> In the real world you might construct another BooleanQuery around your
> DisjunctionMaxQuery in order to have more flexibility.
> However the interesting part of the given example is, that we specify
> the user's query as a MustMatch-condition of the BooleanQuery and the
> FunctionQuery just as a ShouldMatch.
> Constructed that way, I am expecting the FunctionQuery only scores those
> documents which fit the MustMatch-Condition.
>
> I conclude that from the fact that the FunctionQuery-class also has a
> skipTo-method and I would expect that the scorer will use it to score
> only matching documents (however I did not search where and how it might
> get called).
>
> If my conclusion is wrong than hopefully Robert Muir (as far as I can
> see the author of that class) can tell us what was the intention by
> constructing an every-time-match-all-function-query.
>
> Can you validate whether your QueryParser constructs a query in the form
> I drew above?
>
> Regards,
> Em
>
> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
> > Hello Em:
> >
> > 1) Here's a printout of an example DisMax query (as you can see mostly
> MUST
> > terms except for some SHOULD terms used for boosting scores for
> stopwords)
> > *
> > *
> > *((+stopword_shortened_phrase:hoteles
> +stopword_shortened_phrase:barcelona
> > stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
> > +stopword_phrase:barcelona
> > stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
> > ened_phrase:barcelona stopword_shortened_phrase:en) |
> (+stopword_phrase:hoteles
> > +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
> > tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
> > stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
> > ord_phrase:barcelona stopword_phrase:en) |
> (+stopword_shortened_phrase:hoteles
> > +wildcard_stopword_shortened_phrase:barcelona
> stopword_shortened_phrase:en)
> > | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
> > stopword_phrase:en))*
> > *
> > *
> > 2)* *The collector is inserted in the SolrIndexSearcher (replacing the
> > TimeLimitingCollector). We trigger it through the SOLR interface by
> passing
> > the timeAllowed parameter. We know this is a hack but AFAIK there's no
> > out-of-the-box way to specify custom collectors by now (
> > https://issues.apache.org/jira/browse/SOLR-1680). In any case the
> collector
> > part works perfectly as of now, so clearly this is not the problem.
> >
> > 3) Re: your sentence:
> > *
> > *
> > **I* would expect that with a shrinking set of matching documents to
> > the overall-query, the function query only checks those documents that
> are
> > guaranteed to be within the result set.*
> > *
> > *
> > Yes, I agree with this, but this snippet of code in FunctionQuery.java
> > seems to say otherwise:
> >
> >     // instead of matching all docs, we could also embed a query.
> >     // the score could either ignore the subscore, or boost it.
> >     // Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
> >     // Boost:        foo:myTerm^floatline("myFloatField",1.0,0.0f)
> >     @Override
> >     public int nextDoc() throws IOException {
> >       for(;;) {
> >         ++doc;
> >         if (doc>=maxDoc) {
> >           return doc=NO_MORE_DOCS;
> >         }
> >         if (acceptDocs != null && !acceptDocs.get(doc)) continue;
> >         return doc;
> >       }
> >     }
> >
> > It seems that the author also thought of maybe embedding a query in order
> > to restrict matches, but this doesn't seem to be in place as of now (or
> > maybe I'm not understanding how the whole thing works :) ).
> >
> > Thanks
> > Carlos
> > *
> > *
> >
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >
> >
> > On Thu, Feb 16, 2012 at 8:09 PM, Em <mailformailingli...@yahoo.de>
> wrote:
> >
> >> Hello Carlos,
> >>
> >>> We have some more tests on that matter: now we're moving from issuing
> >> this
> >>> large query through the SOLR interface to creating our own
> >> QueryParser. The
> >>> initial tests we've done in our QParser (that internally creates
> multiple
> >>> queries and inserts them inside a DisjunctionMaxQuery) are very good,
> >> we're
> >>> getting very good response times and high quality answers. But when
> we've
> >>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
> with a
> >>> QueryValueSource that wraps the DisMaxQuery), then the times move from
> >>> 10-20 msec to 200-300msec.
> >> I reviewed the sourcecode and yes, the FunctionQuery iterates over the
> >> whole index, however... let's see!
> >>
> >> In relation to the DisMaxQuery you create within your parser: What kind
> >> of clause is the FunctionQuery and what kind of clause are your other
> >> queries (MUST, SHOULD, MUST_NOT...)?
> >>
> >> *I* would expect that with a shrinking set of matching documents to the
> >> overall-query, the function query only checks those documents that are
> >> guaranteed to be within the result set.
> >>
> >>> Note that we're using early termination of queries (via a custom
> >>> collector), and therefore (as shown by the numbers I included above)
> even
> >>> if the query is very complex, we're getting very fast answers. The only
> >>> situation where the response time explodes is when we include a
> >>> FunctionQuery.
> >> Could you give us some details about how/where did you plugin the
> >> Collector, please?
> >>
> >> Kind regards,
> >> Em
> >>
> >> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
> >>> Hello Em:
> >>>
> >>> Thanks for your answer.
> >>>
> >>> Yes, we initially also thought that the excessive increase in response
> >> time
> >>> was caused by the several queries being executed, and we did another
> >> test.
> >>> We executed one of the subqueries that I've shown to you directly in
> the
> >>> "q" parameter and then we tested this same subquery (only this one,
> >> without
> >>> the others) with the function query "query($q1)" in the "q" parameter.
> >>>
> >>> Theoretically the times for these two queries should be more or less
> the
> >>> same, but the second one is several times slower than the first one.
> >> After
> >>> this observation we learned more about function queries and we learned
> >> from
> >>> the code and from some comments in the forums [1] that the
> >> FunctionQueries
> >>> are expected to match all documents.
> >>>
> >>> We have some more tests on that matter: now we're moving from issuing
> >> this
> >>> large query through the SOLR interface to creating our own QueryParser.
> >> The
> >>> initial tests we've done in our QParser (that internally creates
> multiple
> >>> queries and inserts them inside a DisjunctionMaxQuery) are very good,
> >> we're
> >>> getting very good response times and high quality answers. But when
> we've
> >>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
> with a
> >>> QueryValueSource that wraps the DisMaxQuery), then the times move from
> >>> 10-20 msec to 200-300msec.
> >>>
> >>> Note that we're using early termination of queries (via a custom
> >>> collector), and therefore (as shown by the numbers I included above)
> even
> >>> if the query is very complex, we're getting very fast answers. The only
> >>> situation where the response time explodes is when we include a
> >>> FunctionQuery.
> >>>
> >>> Re: your question of what we're trying to achieve ... We're
> implementing
> >> a
> >>> powerful query autocomplete system, and we use several fields to a)
> >> improve
> >>> performance on wildcard queries and b) have a very precise control over
> >> the
> >>> score.
> >>>
> >>> Thanks a lot for your help,
> >>> Carlos
> >>>
> >>> [1]:
> >> http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0
> >>>
> >>> Carlos Gonzalez-Cadenas
> >>> CEO, ExperienceOn - New generation search
> >>> http://www.experienceon.com
> >>>
> >>> Mobile: +34 652 911 201
> >>> Skype: carlosgonzalezcadenas
> >>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>
> >>>
> >>> On Thu, Feb 16, 2012 at 7:09 PM, Em <mailformailingli...@yahoo.de>
> >> wrote:
> >>>
> >>>> Hello Carlos,
> >>>>
> >>>> well, you must take into account that you are executing up to 8
> queries
> >>>> per request instead of one query per request.
> >>>>
> >>>> I am not totally sure about the details of the implementation of the
> >>>> max-function-query, but I guess it first iterates over the results of
> >>>> the first max-query, afterwards over the results of the second
> max-query
> >>>> and so on. This is a much higher complexity than in the case of a
> normal
> >>>> query.
> >>>>
> >>>> I would suggest you to optimize your request. I don't think that this
> >>>> particular function query is matching *all* docs. Instead I think it
> >>>> just matches those docs specified by your inner-query (although I
> might
> >>>> be wrong about that).
> >>>>
> >>>> What are you trying to achieve by your request?
> >>>>
> >>>> Regards,
> >>>> Em
> >>>>
> >>>> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
> >>>>> Hello Em:
> >>>>>
> >>>>> The URL is quite large (w/ shards, ...), maybe it's best if I paste
> the
> >>>>> relevant parts.
> >>>>>
> >>>>> Our "q" parameter is:
> >>>>>
> >>>>>
> >>>>
> >>
> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)))))\"",
> >>>>>
> >>>>> The subqueries q8, q7, q4 and q3 are regular queries, for example:
> >>>>>
> >>>>> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
> >>>>> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
> >>>>> (stopword_phrase:las AND stopword_phrase:de)"
> >>>>>
> >>>>> We've executed the subqueries q3-q8 independently and they're very
> >> fast,
> >>>>> but when we introduce the function queries as described below, it all
> >>>> goes
> >>>>> 10X slower.
> >>>>>
> >>>>> Let me know if you need anything else.
> >>>>>
> >>>>> Thanks
> >>>>> Carlos
> >>>>>
> >>>>>
> >>>>> Carlos Gonzalez-Cadenas
> >>>>> CEO, ExperienceOn - New generation search
> >>>>> http://www.experienceon.com
> >>>>>
> >>>>> Mobile: +34 652 911 201
> >>>>> Skype: carlosgonzalezcadenas
> >>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>>>
> >>>>>
> >>>>> On Thu, Feb 16, 2012 at 4:02 PM, Em <mailformailingli...@yahoo.de>
> >>>> wrote:
> >>>>>
> >>>>>> Hello carlos,
> >>>>>>
> >>>>>> could you show us how your Solr-call looks like?
> >>>>>>
> >>>>>> Regards,
> >>>>>> Em
> >>>>>>
> >>>>>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
> >>>>>>> Hello all:
> >>>>>>>
> >>>>>>> We'd like to score the matching documents using a combination of
> >> SOLR's
> >>>>>> IR
> >>>>>>> score with another application-specific score that we store within
> >> the
> >>>>>>> documents themselves (i.e. a float field containing the
> app-specific
> >>>>>>> score). In particular, we'd like to calculate the final score doing
> >>>> some
> >>>>>>> operations with both numbers (i.e product, sqrt, ...)
> >>>>>>>
> >>>>>>> According to what we know, there are two ways to do this in SOLR:
> >>>>>>>
> >>>>>>> A) Sort by function [1]: We've tested an expression like
> >>>>>>> "sort=product(score, query_score)" in the SOLR query, where score
> is
> >>>> the
> >>>>>>> common SOLR IR score and query_score is our own precalculated
> score,
> >>>> but
> >>>>>> it
> >>>>>>> seems that SOLR can only do this with stored/indexed fields (and
> >>>>>> obviously
> >>>>>>> "score" is not stored/indexed).
> >>>>>>>
> >>>>>>> B) Function queries: We've used _val_ and function queries like
> max,
> >>>> sqrt
> >>>>>>> and query, and we've obtained the desired results from a functional
> >>>> point
> >>>>>>> of view. However, our index is quite large (400M documents) and the
> >>>>>>> performance degrades heavily, given that function queries are AFAIK
> >>>>>>> matching all the documents.
> >>>>>>>
> >>>>>>> I have two questions:
> >>>>>>>
> >>>>>>> 1) Apart from the two options I mentioned, is there any other
> >> (simple)
> >>>>>> way
> >>>>>>> to achieve this that we're not aware of?
> >>>>>>>
> >>>>>>> 2) If we have to choose the function queries path, would it be very
> >>>>>>> difficult to modify the actual implementation so that it doesn't
> >> match
> >>>>>> all
> >>>>>>> the documents, that is, to pass a query so that it only operates
> over
> >>>> the
> >>>>>>> documents matching the query?. Looking at the FunctionQuery.java
> >> source
> >>>>>>> code, there's a comment that says "// instead of matching all docs,
> >> we
> >>>>>>> could also embed a query. the score could either ignore the
> subscore,
> >>>> or
> >>>>>>> boost it", which is giving us some hope that maybe it's possible
> and
> >>>> even
> >>>>>>> desirable to go in this direction. If you can give us some
> directions
> >>>>>> about
> >>>>>>> how to go about this, we may be able to do the actual
> implementation.
> >>>>>>>
> >>>>>>> BTW, we're using Lucene/SOLR trunk.
> >>>>>>>
> >>>>>>> Thanks a lot for your help.
> >>>>>>> Carlos
> >>>>>>>
> >>>>>>> [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: custom scoring

Reply via email to