Re: custom scoring

Em Mon, 20 Feb 2012 06:12:49 -0800

Could you please provide me the original request (the HTTP-request)?
I am a little bit confused to what "query_score" refers.
As far as I can see it isn't a magic-value.


Kind regards,
Em

Am 20.02.2012 14:05, schrieb Carlos Gonzalez-Cadenas:
> Yeah Em, it helped a lot :)
> 
> Here it is (for the user query "hoteles"):
> 
> *+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
> wildcard_stopword_shortened_phrase:hoteles |
> wildcard_stopword_phrase:hoteles) *
> 
> *product(pow(query((stopword_shortened_phrase:hoteles |
> stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |
> wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*
> 
> Thanks a lot for your help.
> 
> Carlos
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> http://www.experienceon.com
> 
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> 
> 
> On Mon, Feb 20, 2012 at 1:50 PM, Em <mailformailingli...@yahoo.de> wrote:
> 
>> Carlos,
>>
>> nice to hear that the approach helped you!
>>
>> Could you show us how your query-request looks like after reworking?
>>
>> Regards,
>> Em
>>
>> Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
>>> Hello all:
>>>
>>> We've done some tests with Em's approach of putting a BooleanQuery in
>> front
>>> of our user query, that means:
>>>
>>> BooleanQuery
>>>     must (DismaxQuery)
>>>     should (FunctionQuery)
>>>
>>> The FunctionQuery obtains the SOLR IR score by means of a
>> QueryValueSource,
>>> then does the SQRT of this value, and then multiplies it by our custom
>>> "query_score" float, pulling it by means of a FieldCacheSource.
>>>
>>> In particular, we've proceeded in the following way:
>>>
>>>    - we've loaded the whole index in the page cache of the OS to make
>> sure
>>>    we don't have disk IO problems that might affect the benchmarks (our
>>>    machine has enough memory to load all the index in RAM)
>>>    - we've executed an out-of-benchmark query 10-20 times to make sure
>> that
>>>    everything is jitted and that Lucene's FieldCache is properly
>> populated.
>>>    - we've disabled all the caches (filter query cache, document cache,
>>>    query cache)
>>>    - we've executed 8 different user queries with and without
>>>    FunctionQueries, with early termination in both cases (our collector
>> stops
>>>    after collecting 50 documents per shard)
>>>
>>> Em was correct, the query is much faster with the BooleanQuery in front,
>>> but it's still 30-40% slower than the query without FunctionQueries.
>>>
>>> Although one may think that it's reasonable that the query response time
>>> increases because of the extra computations, we believe that the increase
>>> is too big, given that we're collecting just 500-600 documents due to the
>>> early query termination techniques we currently use.
>>>
>>> Any ideas on how to make it faster?.
>>>
>>> Thanks a lot,
>>> Carlos
>>>
>>> Carlos Gonzalez-Cadenas
>>> CEO, ExperienceOn - New generation search
>>> http://www.experienceon.com
>>>
>>> Mobile: +34 652 911 201
>>> Skype: carlosgonzalezcadenas
>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>
>>>
>>> On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas <
>>> c...@experienceon.com> wrote:
>>>
>>>> Thanks Em, Robert, Chris for your time and valuable advice. We'll make
>>>> some tests and will let you know soon.
>>>>
>>>>
>>>>
>>>> On Thu, Feb 16, 2012 at 11:43 PM, Em <mailformailingli...@yahoo.de>
>> wrote:
>>>>
>>>>> Hello Carlos,
>>>>>
>>>>> I think we missunderstood eachother.
>>>>>
>>>>> As an example:
>>>>> BooleanQuery (
>>>>>  clauses: (
>>>>>     MustMatch(
>>>>>               DisjunctionMaxQuery(
>>>>>                   TermQuery("stopword_field", "barcelona"),
>>>>>                   TermQuery("stopword_field", "hoteles")
>>>>>               )
>>>>>     ),
>>>>>     ShouldMatch(
>>>>>                  FunctionQuery(
>>>>>                    *please insert your function here*
>>>>>                 )
>>>>>     )
>>>>>  )
>>>>> )
>>>>>
>>>>> Explanation:
>>>>> You construct an artificial BooleanQuery which wraps your user's query
>>>>> as well as your function query.
>>>>> Your user's query - in that case - is just a DisjunctionMaxQuery
>>>>> consisting of two TermQueries.
>>>>> In the real world you might construct another BooleanQuery around your
>>>>> DisjunctionMaxQuery in order to have more flexibility.
>>>>> However the interesting part of the given example is, that we specify
>>>>> the user's query as a MustMatch-condition of the BooleanQuery and the
>>>>> FunctionQuery just as a ShouldMatch.
>>>>> Constructed that way, I am expecting the FunctionQuery only scores
>> those
>>>>> documents which fit the MustMatch-Condition.
>>>>>
>>>>> I conclude that from the fact that the FunctionQuery-class also has a
>>>>> skipTo-method and I would expect that the scorer will use it to score
>>>>> only matching documents (however I did not search where and how it
>> might
>>>>> get called).
>>>>>
>>>>> If my conclusion is wrong than hopefully Robert Muir (as far as I can
>>>>> see the author of that class) can tell us what was the intention by
>>>>> constructing an every-time-match-all-function-query.
>>>>>
>>>>> Can you validate whether your QueryParser constructs a query in the
>> form
>>>>> I drew above?
>>>>>
>>>>> Regards,
>>>>> Em
>>>>>
>>>>> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
>>>>>> Hello Em:
>>>>>>
>>>>>> 1) Here's a printout of an example DisMax query (as you can see mostly
>>>>> MUST
>>>>>> terms except for some SHOULD terms used for boosting scores for
>>>>> stopwords)
>>>>>> *
>>>>>> *
>>>>>> *((+stopword_shortened_phrase:hoteles
>>>>> +stopword_shortened_phrase:barcelona
>>>>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
>>>>>> +stopword_phrase:barcelona
>>>>>> stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
>>>>> +stopword_short
>>>>>> ened_phrase:barcelona stopword_shortened_phrase:en) |
>>>>> (+stopword_phrase:hoteles
>>>>>> +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
>>>>>> tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
>>>>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
>>>>> +wildcard_stopw
>>>>>> ord_phrase:barcelona stopword_phrase:en) |
>>>>> (+stopword_shortened_phrase:hoteles
>>>>>> +wildcard_stopword_shortened_phrase:barcelona
>>>>> stopword_shortened_phrase:en)
>>>>>> | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
>>>>>> stopword_phrase:en))*
>>>>>> *
>>>>>> *
>>>>>> 2)* *The collector is inserted in the SolrIndexSearcher (replacing the
>>>>>> TimeLimitingCollector). We trigger it through the SOLR interface by
>>>>> passing
>>>>>> the timeAllowed parameter. We know this is a hack but AFAIK there's no
>>>>>> out-of-the-box way to specify custom collectors by now (
>>>>>> https://issues.apache.org/jira/browse/SOLR-1680). In any case the
>>>>> collector
>>>>>> part works perfectly as of now, so clearly this is not the problem.
>>>>>>
>>>>>> 3) Re: your sentence:
>>>>>> *
>>>>>> *
>>>>>> **I* would expect that with a shrinking set of matching documents to
>>>>>> the overall-query, the function query only checks those documents that
>>>>> are
>>>>>> guaranteed to be within the result set.*
>>>>>> *
>>>>>> *
>>>>>> Yes, I agree with this, but this snippet of code in FunctionQuery.java
>>>>>> seems to say otherwise:
>>>>>>
>>>>>>     // instead of matching all docs, we could also embed a query.
>>>>>>     // the score could either ignore the subscore, or boost it.
>>>>>>     // Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
>>>>>>     // Boost:        foo:myTerm^floatline("myFloatField",1.0,0.0f)
>>>>>>     @Override
>>>>>>     public int nextDoc() throws IOException {
>>>>>>       for(;;) {
>>>>>>         ++doc;
>>>>>>         if (doc>=maxDoc) {
>>>>>>           return doc=NO_MORE_DOCS;
>>>>>>         }
>>>>>>         if (acceptDocs != null && !acceptDocs.get(doc)) continue;
>>>>>>         return doc;
>>>>>>       }
>>>>>>     }
>>>>>>
>>>>>> It seems that the author also thought of maybe embedding a query in
>>>>> order
>>>>>> to restrict matches, but this doesn't seem to be in place as of now
>> (or
>>>>>> maybe I'm not understanding how the whole thing works :) ).
>>>>>>
>>>>>> Thanks
>>>>>> Carlos
>>>>>> *
>>>>>> *
>>>>>>
>>>>>> Carlos Gonzalez-Cadenas
>>>>>> CEO, ExperienceOn - New generation search
>>>>>> http://www.experienceon.com
>>>>>>
>>>>>> Mobile: +34 652 911 201
>>>>>> Skype: carlosgonzalezcadenas
>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 16, 2012 at 8:09 PM, Em <mailformailingli...@yahoo.de>
>>>>> wrote:
>>>>>>
>>>>>>> Hello Carlos,
>>>>>>>
>>>>>>>> We have some more tests on that matter: now we're moving from
>> issuing
>>>>>>> this
>>>>>>>> large query through the SOLR interface to creating our own
>>>>>>> QueryParser. The
>>>>>>>> initial tests we've done in our QParser (that internally creates
>>>>> multiple
>>>>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very
>> good,
>>>>>>> we're
>>>>>>>> getting very good response times and high quality answers. But when
>>>>> we've
>>>>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
>>>>> with a
>>>>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move
>> from
>>>>>>>> 10-20 msec to 200-300msec.
>>>>>>> I reviewed the sourcecode and yes, the FunctionQuery iterates over
>> the
>>>>>>> whole index, however... let's see!
>>>>>>>
>>>>>>> In relation to the DisMaxQuery you create within your parser: What
>> kind
>>>>>>> of clause is the FunctionQuery and what kind of clause are your other
>>>>>>> queries (MUST, SHOULD, MUST_NOT...)?
>>>>>>>
>>>>>>> *I* would expect that with a shrinking set of matching documents to
>> the
>>>>>>> overall-query, the function query only checks those documents that
>> are
>>>>>>> guaranteed to be within the result set.
>>>>>>>
>>>>>>>> Note that we're using early termination of queries (via a custom
>>>>>>>> collector), and therefore (as shown by the numbers I included above)
>>>>> even
>>>>>>>> if the query is very complex, we're getting very fast answers. The
>>>>> only
>>>>>>>> situation where the response time explodes is when we include a
>>>>>>>> FunctionQuery.
>>>>>>> Could you give us some details about how/where did you plugin the
>>>>>>> Collector, please?
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Em
>>>>>>>
>>>>>>> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
>>>>>>>> Hello Em:
>>>>>>>>
>>>>>>>> Thanks for your answer.
>>>>>>>>
>>>>>>>> Yes, we initially also thought that the excessive increase in
>> response
>>>>>>> time
>>>>>>>> was caused by the several queries being executed, and we did another
>>>>>>> test.
>>>>>>>> We executed one of the subqueries that I've shown to you directly in
>>>>> the
>>>>>>>> "q" parameter and then we tested this same subquery (only this one,
>>>>>>> without
>>>>>>>> the others) with the function query "query($q1)" in the "q"
>> parameter.
>>>>>>>>
>>>>>>>> Theoretically the times for these two queries should be more or less
>>>>> the
>>>>>>>> same, but the second one is several times slower than the first one.
>>>>>>> After
>>>>>>>> this observation we learned more about function queries and we
>> learned
>>>>>>> from
>>>>>>>> the code and from some comments in the forums [1] that the
>>>>>>> FunctionQueries
>>>>>>>> are expected to match all documents.
>>>>>>>>
>>>>>>>> We have some more tests on that matter: now we're moving from
>> issuing
>>>>>>> this
>>>>>>>> large query through the SOLR interface to creating our own
>>>>> QueryParser.
>>>>>>> The
>>>>>>>> initial tests we've done in our QParser (that internally creates
>>>>> multiple
>>>>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very
>> good,
>>>>>>> we're
>>>>>>>> getting very good response times and high quality answers. But when
>>>>> we've
>>>>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
>>>>> with a
>>>>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move
>> from
>>>>>>>> 10-20 msec to 200-300msec.
>>>>>>>>
>>>>>>>> Note that we're using early termination of queries (via a custom
>>>>>>>> collector), and therefore (as shown by the numbers I included above)
>>>>> even
>>>>>>>> if the query is very complex, we're getting very fast answers. The
>>>>> only
>>>>>>>> situation where the response time explodes is when we include a
>>>>>>>> FunctionQuery.
>>>>>>>>
>>>>>>>> Re: your question of what we're trying to achieve ... We're
>>>>> implementing
>>>>>>> a
>>>>>>>> powerful query autocomplete system, and we use several fields to a)
>>>>>>> improve
>>>>>>>> performance on wildcard queries and b) have a very precise control
>>>>> over
>>>>>>> the
>>>>>>>> score.
>>>>>>>>
>>>>>>>> Thanks a lot for your help,
>>>>>>>> Carlos
>>>>>>>>
>>>>>>>> [1]:
>>>>>>>
>>>>>
>> http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0
>>>>>>>>
>>>>>>>> Carlos Gonzalez-Cadenas
>>>>>>>> CEO, ExperienceOn - New generation search
>>>>>>>> http://www.experienceon.com
>>>>>>>>
>>>>>>>> Mobile: +34 652 911 201
>>>>>>>> Skype: carlosgonzalezcadenas
>>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 16, 2012 at 7:09 PM, Em <mailformailingli...@yahoo.de>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Carlos,
>>>>>>>>>
>>>>>>>>> well, you must take into account that you are executing up to 8
>>>>> queries
>>>>>>>>> per request instead of one query per request.
>>>>>>>>>
>>>>>>>>> I am not totally sure about the details of the implementation of
>> the
>>>>>>>>> max-function-query, but I guess it first iterates over the results
>> of
>>>>>>>>> the first max-query, afterwards over the results of the second
>>>>> max-query
>>>>>>>>> and so on. This is a much higher complexity than in the case of a
>>>>> normal
>>>>>>>>> query.
>>>>>>>>>
>>>>>>>>> I would suggest you to optimize your request. I don't think that
>> this
>>>>>>>>> particular function query is matching *all* docs. Instead I think
>> it
>>>>>>>>> just matches those docs specified by your inner-query (although I
>>>>> might
>>>>>>>>> be wrong about that).
>>>>>>>>>
>>>>>>>>> What are you trying to achieve by your request?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Em
>>>>>>>>>
>>>>>>>>> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
>>>>>>>>>> Hello Em:
>>>>>>>>>>
>>>>>>>>>> The URL is quite large (w/ shards, ...), maybe it's best if I
>> paste
>>>>> the
>>>>>>>>>> relevant parts.
>>>>>>>>>>
>>>>>>>>>> Our "q" parameter is:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)))))\"",
>>>>>>>>>>
>>>>>>>>>> The subqueries q8, q7, q4 and q3 are regular queries, for example:
>>>>>>>>>>
>>>>>>>>>> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
>>>>>>>>>> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
>>>>>>>>>> (stopword_phrase:las AND stopword_phrase:de)"
>>>>>>>>>>
>>>>>>>>>> We've executed the subqueries q3-q8 independently and they're very
>>>>>>> fast,
>>>>>>>>>> but when we introduce the function queries as described below, it
>>>>> all
>>>>>>>>> goes
>>>>>>>>>> 10X slower.
>>>>>>>>>>
>>>>>>>>>> Let me know if you need anything else.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Carlos
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Carlos Gonzalez-Cadenas
>>>>>>>>>> CEO, ExperienceOn - New generation search
>>>>>>>>>> http://www.experienceon.com
>>>>>>>>>>
>>>>>>>>>> Mobile: +34 652 911 201
>>>>>>>>>> Skype: carlosgonzalezcadenas
>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 16, 2012 at 4:02 PM, Em <mailformailingli...@yahoo.de
>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello carlos,
>>>>>>>>>>>
>>>>>>>>>>> could you show us how your Solr-call looks like?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Em
>>>>>>>>>>>
>>>>>>>>>>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
>>>>>>>>>>>> Hello all:
>>>>>>>>>>>>
>>>>>>>>>>>> We'd like to score the matching documents using a combination of
>>>>>>> SOLR's
>>>>>>>>>>> IR
>>>>>>>>>>>> score with another application-specific score that we store
>> within
>>>>>>> the
>>>>>>>>>>>> documents themselves (i.e. a float field containing the
>>>>> app-specific
>>>>>>>>>>>> score). In particular, we'd like to calculate the final score
>>>>> doing
>>>>>>>>> some
>>>>>>>>>>>> operations with both numbers (i.e product, sqrt, ...)
>>>>>>>>>>>>
>>>>>>>>>>>> According to what we know, there are two ways to do this in
>> SOLR:
>>>>>>>>>>>>
>>>>>>>>>>>> A) Sort by function [1]: We've tested an expression like
>>>>>>>>>>>> "sort=product(score, query_score)" in the SOLR query, where
>> score
>>>>> is
>>>>>>>>> the
>>>>>>>>>>>> common SOLR IR score and query_score is our own precalculated
>>>>> score,
>>>>>>>>> but
>>>>>>>>>>> it
>>>>>>>>>>>> seems that SOLR can only do this with stored/indexed fields (and
>>>>>>>>>>> obviously
>>>>>>>>>>>> "score" is not stored/indexed).
>>>>>>>>>>>>
>>>>>>>>>>>> B) Function queries: We've used _val_ and function queries like
>>>>> max,
>>>>>>>>> sqrt
>>>>>>>>>>>> and query, and we've obtained the desired results from a
>>>>> functional
>>>>>>>>> point
>>>>>>>>>>>> of view. However, our index is quite large (400M documents) and
>>>>> the
>>>>>>>>>>>> performance degrades heavily, given that function queries are
>>>>> AFAIK
>>>>>>>>>>>> matching all the documents.
>>>>>>>>>>>>
>>>>>>>>>>>> I have two questions:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Apart from the two options I mentioned, is there any other
>>>>>>> (simple)
>>>>>>>>>>> way
>>>>>>>>>>>> to achieve this that we're not aware of?
>>>>>>>>>>>>
>>>>>>>>>>>> 2) If we have to choose the function queries path, would it be
>>>>> very
>>>>>>>>>>>> difficult to modify the actual implementation so that it doesn't
>>>>>>> match
>>>>>>>>>>> all
>>>>>>>>>>>> the documents, that is, to pass a query so that it only operates
>>>>> over
>>>>>>>>> the
>>>>>>>>>>>> documents matching the query?. Looking at the FunctionQuery.java
>>>>>>> source
>>>>>>>>>>>> code, there's a comment that says "// instead of matching all
>>>>> docs,
>>>>>>> we
>>>>>>>>>>>> could also embed a query. the score could either ignore the
>>>>> subscore,
>>>>>>>>> or
>>>>>>>>>>>> boost it", which is giving us some hope that maybe it's possible
>>>>> and
>>>>>>>>> even
>>>>>>>>>>>> desirable to go in this direction. If you can give us some
>>>>> directions
>>>>>>>>>>> about
>>>>>>>>>>>> how to go about this, we may be able to do the actual
>>>>> implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> BTW, we're using Lucene/SOLR trunk.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for your help.
>>>>>>>>>>>> Carlos
>>>>>>>>>>>>
>>>>>>>>>>>> [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: custom scoring

Reply via email to