Re: custom scoring

Em Mon, 20 Feb 2012 07:43:49 -0800

Hi Carlos,

> "query_score" is a field that is indexed and stored
> with every document.
Thanks for clarifying that, now the whole query-string makes more sense
to me.


Did you check whether query() - without product() and pow() - is also
much slower than a normal query?

I guess, if the performance-decrease without product() and pow() is not
that large, you are hitting the small overhead that comes with every
function query.
It would be nice, if you could check that.

However, let's take a step back and look what you really want to achieve
instead of how you are trying to achieve it right now.

You want to influence the score of your actual query by a value that
represents a combination of some static values and the likelyness of how
good a query matches a document.

>From your query, I can see that you are using the same fields in your
FunctionQuery and within your MainQuery (let's call the q-param
"MainQuery").
This means that the scores of your query()-method and your MainQuery
should be identical.
Let's call this value just "score" and rename your field "query_score"
"popularity".

I don't know how you are implementing the FunctionQuery (boost by
multiplication, boost by addition), but it seems clear to me that your
formula looks this way:

score x (score^0.5*popularity) where x is kind of an operator (+,*,...)

Why don't you reduce it to

score * boost(log(popularity)).

This is a trade-off between precision and performance.

You could even improve the above by setting the doc's boost equal to
log(populary) at indexing time.

What do you think about that?

Regards,
Em



Am 20.02.2012 15:37, schrieb Carlos Gonzalez-Cadenas:
> Hi Em:
> 
> The HTTP request is not gonna help you a lot because we use a custom
> QParser (that builds the query that I've pasted before). In any case, here
> it is:
> 
> http://localhost:8080/solr/core0/select?shards=…(shards
> here)…&indent=on&wt=exon&timeAllowed=50&fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlighting&start=0&rows=16&limit=20&q=%7B!exonautocomplete%7Dhoteles<http://localhost:8080/solr/core0/select?shards=exp302%3A8983%2Fsolr%2Fcore0%2Cexp302%3A8983%2Fsolr%2Fcore1%2Cexp302%3A8983%2Fsolr%2Fcore2%2Cexp302%3A8983%2Fsolr%2Fcore3%2Cexp302%3A8983%2Fsolr%2Fcore4%2Cexp302%3A8983%2Fsolr%2Fcore5%2Cexp302%3A8983%2Fsolr%2Fcore6%2Cexp302%3A8983%2Fsolr%2Fcore7%2Cexp302%3A8983%2Fsolr%2Fcore8%2Cexp302%3A8983%2Fsolr%2Fcore9%2Cexp302%3A8983%2Fsolr%2Fcore10%2Cexp302%3A8983%2Fsolr%2Fcore11&sort=score%20desc%2C%20query_score%20desc&indent=on&wt=exon&timeAllowed=50&fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlighting&start=0&vrows=4&rows=16&limit=20&q=%7B!exonautocomplete%7DBARCELONA&gyvl7cn3>
> 
> We're implementing a query autocomplete system, therefore our Lucene
> documents are queries. "query_score" is a field that is indexed and stored
> with every document. It expresses how popular a given query is (i.e. common
> queries like "hotels in barcelona" have a bigger query_score than less
> common queries like "hotels in barcelona near the beach").
> 
> Let me know if you need something else.
> 
> Thanks,
> Carlos
> 
> 
> 
> 
> 
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> http://www.experienceon.com
> 
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> 
> 
> On Mon, Feb 20, 2012 at 3:12 PM, Em <mailformailingli...@yahoo.de> wrote:
> 
>> Could you please provide me the original request (the HTTP-request)?
>> I am a little bit confused to what "query_score" refers.
>> As far as I can see it isn't a magic-value.
>>
>> Kind regards,
>> Em
>>
>> Am 20.02.2012 14:05, schrieb Carlos Gonzalez-Cadenas:
>>> Yeah Em, it helped a lot :)
>>>
>>> Here it is (for the user query "hoteles"):
>>>
>>> *+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
>>> wildcard_stopword_shortened_phrase:hoteles |
>>> wildcard_stopword_phrase:hoteles) *
>>>
>>> *product(pow(query((stopword_shortened_phrase:hoteles |
>>> stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |
>>>
>> wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*
>>>
>>> Thanks a lot for your help.
>>>
>>> Carlos
>>> Carlos Gonzalez-Cadenas
>>> CEO, ExperienceOn - New generation search
>>> http://www.experienceon.com
>>>
>>> Mobile: +34 652 911 201
>>> Skype: carlosgonzalezcadenas
>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>
>>>
>>> On Mon, Feb 20, 2012 at 1:50 PM, Em <mailformailingli...@yahoo.de>
>> wrote:
>>>
>>>> Carlos,
>>>>
>>>> nice to hear that the approach helped you!
>>>>
>>>> Could you show us how your query-request looks like after reworking?
>>>>
>>>> Regards,
>>>> Em
>>>>
>>>> Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
>>>>> Hello all:
>>>>>
>>>>> We've done some tests with Em's approach of putting a BooleanQuery in
>>>> front
>>>>> of our user query, that means:
>>>>>
>>>>> BooleanQuery
>>>>>     must (DismaxQuery)
>>>>>     should (FunctionQuery)
>>>>>
>>>>> The FunctionQuery obtains the SOLR IR score by means of a
>>>> QueryValueSource,
>>>>> then does the SQRT of this value, and then multiplies it by our custom
>>>>> "query_score" float, pulling it by means of a FieldCacheSource.
>>>>>
>>>>> In particular, we've proceeded in the following way:
>>>>>
>>>>>    - we've loaded the whole index in the page cache of the OS to make
>>>> sure
>>>>>    we don't have disk IO problems that might affect the benchmarks (our
>>>>>    machine has enough memory to load all the index in RAM)
>>>>>    - we've executed an out-of-benchmark query 10-20 times to make sure
>>>> that
>>>>>    everything is jitted and that Lucene's FieldCache is properly
>>>> populated.
>>>>>    - we've disabled all the caches (filter query cache, document cache,
>>>>>    query cache)
>>>>>    - we've executed 8 different user queries with and without
>>>>>    FunctionQueries, with early termination in both cases (our collector
>>>> stops
>>>>>    after collecting 50 documents per shard)
>>>>>
>>>>> Em was correct, the query is much faster with the BooleanQuery in
>> front,
>>>>> but it's still 30-40% slower than the query without FunctionQueries.
>>>>>
>>>>> Although one may think that it's reasonable that the query response
>> time
>>>>> increases because of the extra computations, we believe that the
>> increase
>>>>> is too big, given that we're collecting just 500-600 documents due to
>> the
>>>>> early query termination techniques we currently use.
>>>>>
>>>>> Any ideas on how to make it faster?.
>>>>>
>>>>> Thanks a lot,
>>>>> Carlos
>>>>>
>>>>> Carlos Gonzalez-Cadenas
>>>>> CEO, ExperienceOn - New generation search
>>>>> http://www.experienceon.com
>>>>>
>>>>> Mobile: +34 652 911 201
>>>>> Skype: carlosgonzalezcadenas
>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>>
>>>>>
>>>>> On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas <
>>>>> c...@experienceon.com> wrote:
>>>>>
>>>>>> Thanks Em, Robert, Chris for your time and valuable advice. We'll make
>>>>>> some tests and will let you know soon.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 16, 2012 at 11:43 PM, Em <mailformailingli...@yahoo.de>
>>>> wrote:
>>>>>>
>>>>>>> Hello Carlos,
>>>>>>>
>>>>>>> I think we missunderstood eachother.
>>>>>>>
>>>>>>> As an example:
>>>>>>> BooleanQuery (
>>>>>>>  clauses: (
>>>>>>>     MustMatch(
>>>>>>>               DisjunctionMaxQuery(
>>>>>>>                   TermQuery("stopword_field", "barcelona"),
>>>>>>>                   TermQuery("stopword_field", "hoteles")
>>>>>>>               )
>>>>>>>     ),
>>>>>>>     ShouldMatch(
>>>>>>>                  FunctionQuery(
>>>>>>>                    *please insert your function here*
>>>>>>>                 )
>>>>>>>     )
>>>>>>>  )
>>>>>>> )
>>>>>>>
>>>>>>> Explanation:
>>>>>>> You construct an artificial BooleanQuery which wraps your user's
>> query
>>>>>>> as well as your function query.
>>>>>>> Your user's query - in that case - is just a DisjunctionMaxQuery
>>>>>>> consisting of two TermQueries.
>>>>>>> In the real world you might construct another BooleanQuery around
>> your
>>>>>>> DisjunctionMaxQuery in order to have more flexibility.
>>>>>>> However the interesting part of the given example is, that we specify
>>>>>>> the user's query as a MustMatch-condition of the BooleanQuery and the
>>>>>>> FunctionQuery just as a ShouldMatch.
>>>>>>> Constructed that way, I am expecting the FunctionQuery only scores
>>>> those
>>>>>>> documents which fit the MustMatch-Condition.
>>>>>>>
>>>>>>> I conclude that from the fact that the FunctionQuery-class also has a
>>>>>>> skipTo-method and I would expect that the scorer will use it to score
>>>>>>> only matching documents (however I did not search where and how it
>>>> might
>>>>>>> get called).
>>>>>>>
>>>>>>> If my conclusion is wrong than hopefully Robert Muir (as far as I can
>>>>>>> see the author of that class) can tell us what was the intention by
>>>>>>> constructing an every-time-match-all-function-query.
>>>>>>>
>>>>>>> Can you validate whether your QueryParser constructs a query in the
>>>> form
>>>>>>> I drew above?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Em
>>>>>>>
>>>>>>> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
>>>>>>>> Hello Em:
>>>>>>>>
>>>>>>>> 1) Here's a printout of an example DisMax query (as you can see
>> mostly
>>>>>>> MUST
>>>>>>>> terms except for some SHOULD terms used for boosting scores for
>>>>>>> stopwords)
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *((+stopword_shortened_phrase:hoteles
>>>>>>> +stopword_shortened_phrase:barcelona
>>>>>>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
>>>>>>>> +stopword_phrase:barcelona
>>>>>>>> stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
>>>>>>> +stopword_short
>>>>>>>> ened_phrase:barcelona stopword_shortened_phrase:en) |
>>>>>>> (+stopword_phrase:hoteles
>>>>>>>> +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
>>>>>>>> tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
>>>>>>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
>>>>>>> +wildcard_stopw
>>>>>>>> ord_phrase:barcelona stopword_phrase:en) |
>>>>>>> (+stopword_shortened_phrase:hoteles
>>>>>>>> +wildcard_stopword_shortened_phrase:barcelona
>>>>>>> stopword_shortened_phrase:en)
>>>>>>>> | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
>>>>>>>> stopword_phrase:en))*
>>>>>>>> *
>>>>>>>> *
>>>>>>>> 2)* *The collector is inserted in the SolrIndexSearcher (replacing
>> the
>>>>>>>> TimeLimitingCollector). We trigger it through the SOLR interface by
>>>>>>> passing
>>>>>>>> the timeAllowed parameter. We know this is a hack but AFAIK there's
>> no
>>>>>>>> out-of-the-box way to specify custom collectors by now (
>>>>>>>> https://issues.apache.org/jira/browse/SOLR-1680). In any case the
>>>>>>> collector
>>>>>>>> part works perfectly as of now, so clearly this is not the problem.
>>>>>>>>
>>>>>>>> 3) Re: your sentence:
>>>>>>>> *
>>>>>>>> *
>>>>>>>> **I* would expect that with a shrinking set of matching documents to
>>>>>>>> the overall-query, the function query only checks those documents
>> that
>>>>>>> are
>>>>>>>> guaranteed to be within the result set.*
>>>>>>>> *
>>>>>>>> *
>>>>>>>> Yes, I agree with this, but this snippet of code in
>> FunctionQuery.java
>>>>>>>> seems to say otherwise:
>>>>>>>>
>>>>>>>>     // instead of matching all docs, we could also embed a query.
>>>>>>>>     // the score could either ignore the subscore, or boost it.
>>>>>>>>     // Containment:  floatline(foo:myTerm, "myFloatField", 1.0,
>> 0.0f)
>>>>>>>>     // Boost:        foo:myTerm^floatline("myFloatField",1.0,0.0f)
>>>>>>>>     @Override
>>>>>>>>     public int nextDoc() throws IOException {
>>>>>>>>       for(;;) {
>>>>>>>>         ++doc;
>>>>>>>>         if (doc>=maxDoc) {
>>>>>>>>           return doc=NO_MORE_DOCS;
>>>>>>>>         }
>>>>>>>>         if (acceptDocs != null && !acceptDocs.get(doc)) continue;
>>>>>>>>         return doc;
>>>>>>>>       }
>>>>>>>>     }
>>>>>>>>
>>>>>>>> It seems that the author also thought of maybe embedding a query in
>>>>>>> order
>>>>>>>> to restrict matches, but this doesn't seem to be in place as of now
>>>> (or
>>>>>>>> maybe I'm not understanding how the whole thing works :) ).
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Carlos
>>>>>>>> *
>>>>>>>> *
>>>>>>>>
>>>>>>>> Carlos Gonzalez-Cadenas
>>>>>>>> CEO, ExperienceOn - New generation search
>>>>>>>> http://www.experienceon.com
>>>>>>>>
>>>>>>>> Mobile: +34 652 911 201
>>>>>>>> Skype: carlosgonzalezcadenas
>>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 16, 2012 at 8:09 PM, Em <mailformailingli...@yahoo.de>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Carlos,
>>>>>>>>>
>>>>>>>>>> We have some more tests on that matter: now we're moving from
>>>> issuing
>>>>>>>>> this
>>>>>>>>>> large query through the SOLR interface to creating our own
>>>>>>>>> QueryParser. The
>>>>>>>>>> initial tests we've done in our QParser (that internally creates
>>>>>>> multiple
>>>>>>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very
>>>> good,
>>>>>>>>> we're
>>>>>>>>>> getting very good response times and high quality answers. But
>> when
>>>>>>> we've
>>>>>>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
>>>>>>> with a
>>>>>>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move
>>>> from
>>>>>>>>>> 10-20 msec to 200-300msec.
>>>>>>>>> I reviewed the sourcecode and yes, the FunctionQuery iterates over
>>>> the
>>>>>>>>> whole index, however... let's see!
>>>>>>>>>
>>>>>>>>> In relation to the DisMaxQuery you create within your parser: What
>>>> kind
>>>>>>>>> of clause is the FunctionQuery and what kind of clause are your
>> other
>>>>>>>>> queries (MUST, SHOULD, MUST_NOT...)?
>>>>>>>>>
>>>>>>>>> *I* would expect that with a shrinking set of matching documents to
>>>> the
>>>>>>>>> overall-query, the function query only checks those documents that
>>>> are
>>>>>>>>> guaranteed to be within the result set.
>>>>>>>>>
>>>>>>>>>> Note that we're using early termination of queries (via a custom
>>>>>>>>>> collector), and therefore (as shown by the numbers I included
>> above)
>>>>>>> even
>>>>>>>>>> if the query is very complex, we're getting very fast answers. The
>>>>>>> only
>>>>>>>>>> situation where the response time explodes is when we include a
>>>>>>>>>> FunctionQuery.
>>>>>>>>> Could you give us some details about how/where did you plugin the
>>>>>>>>> Collector, please?
>>>>>>>>>
>>>>>>>>> Kind regards,
>>>>>>>>> Em
>>>>>>>>>
>>>>>>>>> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
>>>>>>>>>> Hello Em:
>>>>>>>>>>
>>>>>>>>>> Thanks for your answer.
>>>>>>>>>>
>>>>>>>>>> Yes, we initially also thought that the excessive increase in
>>>> response
>>>>>>>>> time
>>>>>>>>>> was caused by the several queries being executed, and we did
>> another
>>>>>>>>> test.
>>>>>>>>>> We executed one of the subqueries that I've shown to you directly
>> in
>>>>>>> the
>>>>>>>>>> "q" parameter and then we tested this same subquery (only this
>> one,
>>>>>>>>> without
>>>>>>>>>> the others) with the function query "query($q1)" in the "q"
>>>> parameter.
>>>>>>>>>>
>>>>>>>>>> Theoretically the times for these two queries should be more or
>> less
>>>>>>> the
>>>>>>>>>> same, but the second one is several times slower than the first
>> one.
>>>>>>>>> After
>>>>>>>>>> this observation we learned more about function queries and we
>>>> learned
>>>>>>>>> from
>>>>>>>>>> the code and from some comments in the forums [1] that the
>>>>>>>>> FunctionQueries
>>>>>>>>>> are expected to match all documents.
>>>>>>>>>>
>>>>>>>>>> We have some more tests on that matter: now we're moving from
>>>> issuing
>>>>>>>>> this
>>>>>>>>>> large query through the SOLR interface to creating our own
>>>>>>> QueryParser.
>>>>>>>>> The
>>>>>>>>>> initial tests we've done in our QParser (that internally creates
>>>>>>> multiple
>>>>>>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very
>>>> good,
>>>>>>>>> we're
>>>>>>>>>> getting very good response times and high quality answers. But
>> when
>>>>>>> we've
>>>>>>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
>>>>>>> with a
>>>>>>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move
>>>> from
>>>>>>>>>> 10-20 msec to 200-300msec.
>>>>>>>>>>
>>>>>>>>>> Note that we're using early termination of queries (via a custom
>>>>>>>>>> collector), and therefore (as shown by the numbers I included
>> above)
>>>>>>> even
>>>>>>>>>> if the query is very complex, we're getting very fast answers. The
>>>>>>> only
>>>>>>>>>> situation where the response time explodes is when we include a
>>>>>>>>>> FunctionQuery.
>>>>>>>>>>
>>>>>>>>>> Re: your question of what we're trying to achieve ... We're
>>>>>>> implementing
>>>>>>>>> a
>>>>>>>>>> powerful query autocomplete system, and we use several fields to
>> a)
>>>>>>>>> improve
>>>>>>>>>> performance on wildcard queries and b) have a very precise control
>>>>>>> over
>>>>>>>>> the
>>>>>>>>>> score.
>>>>>>>>>>
>>>>>>>>>> Thanks a lot for your help,
>>>>>>>>>> Carlos
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>
>>>>>>>
>>>> http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0
>>>>>>>>>>
>>>>>>>>>> Carlos Gonzalez-Cadenas
>>>>>>>>>> CEO, ExperienceOn - New generation search
>>>>>>>>>> http://www.experienceon.com
>>>>>>>>>>
>>>>>>>>>> Mobile: +34 652 911 201
>>>>>>>>>> Skype: carlosgonzalezcadenas
>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 16, 2012 at 7:09 PM, Em <mailformailingli...@yahoo.de
>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Carlos,
>>>>>>>>>>>
>>>>>>>>>>> well, you must take into account that you are executing up to 8
>>>>>>> queries
>>>>>>>>>>> per request instead of one query per request.
>>>>>>>>>>>
>>>>>>>>>>> I am not totally sure about the details of the implementation of
>>>> the
>>>>>>>>>>> max-function-query, but I guess it first iterates over the
>> results
>>>> of
>>>>>>>>>>> the first max-query, afterwards over the results of the second
>>>>>>> max-query
>>>>>>>>>>> and so on. This is a much higher complexity than in the case of a
>>>>>>> normal
>>>>>>>>>>> query.
>>>>>>>>>>>
>>>>>>>>>>> I would suggest you to optimize your request. I don't think that
>>>> this
>>>>>>>>>>> particular function query is matching *all* docs. Instead I think
>>>> it
>>>>>>>>>>> just matches those docs specified by your inner-query (although I
>>>>>>> might
>>>>>>>>>>> be wrong about that).
>>>>>>>>>>>
>>>>>>>>>>> What are you trying to achieve by your request?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Em
>>>>>>>>>>>
>>>>>>>>>>> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
>>>>>>>>>>>> Hello Em:
>>>>>>>>>>>>
>>>>>>>>>>>> The URL is quite large (w/ shards, ...), maybe it's best if I
>>>> paste
>>>>>>> the
>>>>>>>>>>>> relevant parts.
>>>>>>>>>>>>
>>>>>>>>>>>> Our "q" parameter is:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)))))\"",
>>>>>>>>>>>>
>>>>>>>>>>>> The subqueries q8, q7, q4 and q3 are regular queries, for
>> example:
>>>>>>>>>>>>
>>>>>>>>>>>> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
>>>>>>>>>>>> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
>>>>>>>>>>>> (stopword_phrase:las AND stopword_phrase:de)"
>>>>>>>>>>>>
>>>>>>>>>>>> We've executed the subqueries q3-q8 independently and they're
>> very
>>>>>>>>> fast,
>>>>>>>>>>>> but when we introduce the function queries as described below,
>> it
>>>>>>> all
>>>>>>>>>>> goes
>>>>>>>>>>>> 10X slower.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me know if you need anything else.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Carlos
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Carlos Gonzalez-Cadenas
>>>>>>>>>>>> CEO, ExperienceOn - New generation search
>>>>>>>>>>>> http://www.experienceon.com
>>>>>>>>>>>>
>>>>>>>>>>>> Mobile: +34 652 911 201
>>>>>>>>>>>> Skype: carlosgonzalezcadenas
>>>>>>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 16, 2012 at 4:02 PM, Em <
>> mailformailingli...@yahoo.de
>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello carlos,
>>>>>>>>>>>>>
>>>>>>>>>>>>> could you show us how your Solr-call looks like?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Em
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
>>>>>>>>>>>>>> Hello all:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We'd like to score the matching documents using a combination
>> of
>>>>>>>>> SOLR's
>>>>>>>>>>>>> IR
>>>>>>>>>>>>>> score with another application-specific score that we store
>>>> within
>>>>>>>>> the
>>>>>>>>>>>>>> documents themselves (i.e. a float field containing the
>>>>>>> app-specific
>>>>>>>>>>>>>> score). In particular, we'd like to calculate the final score
>>>>>>> doing
>>>>>>>>>>> some
>>>>>>>>>>>>>> operations with both numbers (i.e product, sqrt, ...)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> According to what we know, there are two ways to do this in
>>>> SOLR:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A) Sort by function [1]: We've tested an expression like
>>>>>>>>>>>>>> "sort=product(score, query_score)" in the SOLR query, where
>>>> score
>>>>>>> is
>>>>>>>>>>> the
>>>>>>>>>>>>>> common SOLR IR score and query_score is our own precalculated
>>>>>>> score,
>>>>>>>>>>> but
>>>>>>>>>>>>> it
>>>>>>>>>>>>>> seems that SOLR can only do this with stored/indexed fields
>> (and
>>>>>>>>>>>>> obviously
>>>>>>>>>>>>>> "score" is not stored/indexed).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> B) Function queries: We've used _val_ and function queries
>> like
>>>>>>> max,
>>>>>>>>>>> sqrt
>>>>>>>>>>>>>> and query, and we've obtained the desired results from a
>>>>>>> functional
>>>>>>>>>>> point
>>>>>>>>>>>>>> of view. However, our index is quite large (400M documents)
>> and
>>>>>>> the
>>>>>>>>>>>>>> performance degrades heavily, given that function queries are
>>>>>>> AFAIK
>>>>>>>>>>>>>> matching all the documents.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have two questions:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Apart from the two options I mentioned, is there any other
>>>>>>>>> (simple)
>>>>>>>>>>>>> way
>>>>>>>>>>>>>> to achieve this that we're not aware of?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) If we have to choose the function queries path, would it be
>>>>>>> very
>>>>>>>>>>>>>> difficult to modify the actual implementation so that it
>> doesn't
>>>>>>>>> match
>>>>>>>>>>>>> all
>>>>>>>>>>>>>> the documents, that is, to pass a query so that it only
>> operates
>>>>>>> over
>>>>>>>>>>> the
>>>>>>>>>>>>>> documents matching the query?. Looking at the
>> FunctionQuery.java
>>>>>>>>> source
>>>>>>>>>>>>>> code, there's a comment that says "// instead of matching all
>>>>>>> docs,
>>>>>>>>> we
>>>>>>>>>>>>>> could also embed a query. the score could either ignore the
>>>>>>> subscore,
>>>>>>>>>>> or
>>>>>>>>>>>>>> boost it", which is giving us some hope that maybe it's
>> possible
>>>>>>> and
>>>>>>>>>>> even
>>>>>>>>>>>>>> desirable to go in this direction. If you can give us some
>>>>>>> directions
>>>>>>>>>>>>> about
>>>>>>>>>>>>>> how to go about this, we may be able to do the actual
>>>>>>> implementation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> BTW, we're using Lucene/SOLR trunk.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot for your help.
>>>>>>>>>>>>>> Carlos
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]:
>> http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: custom scoring

Reply via email to