Thanks Em, Robert, Chris for your time and valuable advice. We'll make some tests and will let you know soon.
On Thu, Feb 16, 2012 at 11:43 PM, Em <mailformailingli...@yahoo.de> wrote: > Hello Carlos, > > I think we missunderstood eachother. > > As an example: > BooleanQuery ( > clauses: ( > MustMatch( > DisjunctionMaxQuery( > TermQuery("stopword_field", "barcelona"), > TermQuery("stopword_field", "hoteles") > ) > ), > ShouldMatch( > FunctionQuery( > *please insert your function here* > ) > ) > ) > ) > > Explanation: > You construct an artificial BooleanQuery which wraps your user's query > as well as your function query. > Your user's query - in that case - is just a DisjunctionMaxQuery > consisting of two TermQueries. > In the real world you might construct another BooleanQuery around your > DisjunctionMaxQuery in order to have more flexibility. > However the interesting part of the given example is, that we specify > the user's query as a MustMatch-condition of the BooleanQuery and the > FunctionQuery just as a ShouldMatch. > Constructed that way, I am expecting the FunctionQuery only scores those > documents which fit the MustMatch-Condition. > > I conclude that from the fact that the FunctionQuery-class also has a > skipTo-method and I would expect that the scorer will use it to score > only matching documents (however I did not search where and how it might > get called). > > If my conclusion is wrong than hopefully Robert Muir (as far as I can > see the author of that class) can tell us what was the intention by > constructing an every-time-match-all-function-query. > > Can you validate whether your QueryParser constructs a query in the form > I drew above? > > Regards, > Em > > Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas: > > Hello Em: > > > > 1) Here's a printout of an example DisMax query (as you can see mostly > MUST > > terms except for some SHOULD terms used for boosting scores for > stopwords) > > * > > * > > *((+stopword_shortened_phrase:hoteles > +stopword_shortened_phrase:barcelona > > stopword_shortened_phrase:en) | (+stopword_phrase:hoteles > > +stopword_phrase:barcelona > > stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short > > ened_phrase:barcelona stopword_shortened_phrase:en) | > (+stopword_phrase:hoteles > > +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor > > tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona > > stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw > > ord_phrase:barcelona stopword_phrase:en) | > (+stopword_shortened_phrase:hoteles > > +wildcard_stopword_shortened_phrase:barcelona > stopword_shortened_phrase:en) > > | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona > > stopword_phrase:en))* > > * > > * > > 2)* *The collector is inserted in the SolrIndexSearcher (replacing the > > TimeLimitingCollector). We trigger it through the SOLR interface by > passing > > the timeAllowed parameter. We know this is a hack but AFAIK there's no > > out-of-the-box way to specify custom collectors by now ( > > https://issues.apache.org/jira/browse/SOLR-1680). In any case the > collector > > part works perfectly as of now, so clearly this is not the problem. > > > > 3) Re: your sentence: > > * > > * > > **I* would expect that with a shrinking set of matching documents to > > the overall-query, the function query only checks those documents that > are > > guaranteed to be within the result set.* > > * > > * > > Yes, I agree with this, but this snippet of code in FunctionQuery.java > > seems to say otherwise: > > > > // instead of matching all docs, we could also embed a query. > > // the score could either ignore the subscore, or boost it. > > // Containment: floatline(foo:myTerm, "myFloatField", 1.0, 0.0f) > > // Boost: foo:myTerm^floatline("myFloatField",1.0,0.0f) > > @Override > > public int nextDoc() throws IOException { > > for(;;) { > > ++doc; > > if (doc>=maxDoc) { > > return doc=NO_MORE_DOCS; > > } > > if (acceptDocs != null && !acceptDocs.get(doc)) continue; > > return doc; > > } > > } > > > > It seems that the author also thought of maybe embedding a query in order > > to restrict matches, but this doesn't seem to be in place as of now (or > > maybe I'm not understanding how the whole thing works :) ). > > > > Thanks > > Carlos > > * > > * > > > > Carlos Gonzalez-Cadenas > > CEO, ExperienceOn - New generation search > > http://www.experienceon.com > > > > Mobile: +34 652 911 201 > > Skype: carlosgonzalezcadenas > > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas > > > > > > On Thu, Feb 16, 2012 at 8:09 PM, Em <mailformailingli...@yahoo.de> > wrote: > > > >> Hello Carlos, > >> > >>> We have some more tests on that matter: now we're moving from issuing > >> this > >>> large query through the SOLR interface to creating our own > >> QueryParser. The > >>> initial tests we've done in our QParser (that internally creates > multiple > >>> queries and inserts them inside a DisjunctionMaxQuery) are very good, > >> we're > >>> getting very good response times and high quality answers. But when > we've > >>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. > with a > >>> QueryValueSource that wraps the DisMaxQuery), then the times move from > >>> 10-20 msec to 200-300msec. > >> I reviewed the sourcecode and yes, the FunctionQuery iterates over the > >> whole index, however... let's see! > >> > >> In relation to the DisMaxQuery you create within your parser: What kind > >> of clause is the FunctionQuery and what kind of clause are your other > >> queries (MUST, SHOULD, MUST_NOT...)? > >> > >> *I* would expect that with a shrinking set of matching documents to the > >> overall-query, the function query only checks those documents that are > >> guaranteed to be within the result set. > >> > >>> Note that we're using early termination of queries (via a custom > >>> collector), and therefore (as shown by the numbers I included above) > even > >>> if the query is very complex, we're getting very fast answers. The only > >>> situation where the response time explodes is when we include a > >>> FunctionQuery. > >> Could you give us some details about how/where did you plugin the > >> Collector, please? > >> > >> Kind regards, > >> Em > >> > >> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas: > >>> Hello Em: > >>> > >>> Thanks for your answer. > >>> > >>> Yes, we initially also thought that the excessive increase in response > >> time > >>> was caused by the several queries being executed, and we did another > >> test. > >>> We executed one of the subqueries that I've shown to you directly in > the > >>> "q" parameter and then we tested this same subquery (only this one, > >> without > >>> the others) with the function query "query($q1)" in the "q" parameter. > >>> > >>> Theoretically the times for these two queries should be more or less > the > >>> same, but the second one is several times slower than the first one. > >> After > >>> this observation we learned more about function queries and we learned > >> from > >>> the code and from some comments in the forums [1] that the > >> FunctionQueries > >>> are expected to match all documents. > >>> > >>> We have some more tests on that matter: now we're moving from issuing > >> this > >>> large query through the SOLR interface to creating our own QueryParser. > >> The > >>> initial tests we've done in our QParser (that internally creates > multiple > >>> queries and inserts them inside a DisjunctionMaxQuery) are very good, > >> we're > >>> getting very good response times and high quality answers. But when > we've > >>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. > with a > >>> QueryValueSource that wraps the DisMaxQuery), then the times move from > >>> 10-20 msec to 200-300msec. > >>> > >>> Note that we're using early termination of queries (via a custom > >>> collector), and therefore (as shown by the numbers I included above) > even > >>> if the query is very complex, we're getting very fast answers. The only > >>> situation where the response time explodes is when we include a > >>> FunctionQuery. > >>> > >>> Re: your question of what we're trying to achieve ... We're > implementing > >> a > >>> powerful query autocomplete system, and we use several fields to a) > >> improve > >>> performance on wildcard queries and b) have a very precise control over > >> the > >>> score. > >>> > >>> Thanks a lot for your help, > >>> Carlos > >>> > >>> [1]: > >> http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0 > >>> > >>> Carlos Gonzalez-Cadenas > >>> CEO, ExperienceOn - New generation search > >>> http://www.experienceon.com > >>> > >>> Mobile: +34 652 911 201 > >>> Skype: carlosgonzalezcadenas > >>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas > >>> > >>> > >>> On Thu, Feb 16, 2012 at 7:09 PM, Em <mailformailingli...@yahoo.de> > >> wrote: > >>> > >>>> Hello Carlos, > >>>> > >>>> well, you must take into account that you are executing up to 8 > queries > >>>> per request instead of one query per request. > >>>> > >>>> I am not totally sure about the details of the implementation of the > >>>> max-function-query, but I guess it first iterates over the results of > >>>> the first max-query, afterwards over the results of the second > max-query > >>>> and so on. This is a much higher complexity than in the case of a > normal > >>>> query. > >>>> > >>>> I would suggest you to optimize your request. I don't think that this > >>>> particular function query is matching *all* docs. Instead I think it > >>>> just matches those docs specified by your inner-query (although I > might > >>>> be wrong about that). > >>>> > >>>> What are you trying to achieve by your request? > >>>> > >>>> Regards, > >>>> Em > >>>> > >>>> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas: > >>>>> Hello Em: > >>>>> > >>>>> The URL is quite large (w/ shards, ...), maybe it's best if I paste > the > >>>>> relevant parts. > >>>>> > >>>>> Our "q" parameter is: > >>>>> > >>>>> > >>>> > >> > "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)))))\"", > >>>>> > >>>>> The subqueries q8, q7, q4 and q3 are regular queries, for example: > >>>>> > >>>>> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND > >>>>> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR > >>>>> (stopword_phrase:las AND stopword_phrase:de)" > >>>>> > >>>>> We've executed the subqueries q3-q8 independently and they're very > >> fast, > >>>>> but when we introduce the function queries as described below, it all > >>>> goes > >>>>> 10X slower. > >>>>> > >>>>> Let me know if you need anything else. > >>>>> > >>>>> Thanks > >>>>> Carlos > >>>>> > >>>>> > >>>>> Carlos Gonzalez-Cadenas > >>>>> CEO, ExperienceOn - New generation search > >>>>> http://www.experienceon.com > >>>>> > >>>>> Mobile: +34 652 911 201 > >>>>> Skype: carlosgonzalezcadenas > >>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas > >>>>> > >>>>> > >>>>> On Thu, Feb 16, 2012 at 4:02 PM, Em <mailformailingli...@yahoo.de> > >>>> wrote: > >>>>> > >>>>>> Hello carlos, > >>>>>> > >>>>>> could you show us how your Solr-call looks like? > >>>>>> > >>>>>> Regards, > >>>>>> Em > >>>>>> > >>>>>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas: > >>>>>>> Hello all: > >>>>>>> > >>>>>>> We'd like to score the matching documents using a combination of > >> SOLR's > >>>>>> IR > >>>>>>> score with another application-specific score that we store within > >> the > >>>>>>> documents themselves (i.e. a float field containing the > app-specific > >>>>>>> score). In particular, we'd like to calculate the final score doing > >>>> some > >>>>>>> operations with both numbers (i.e product, sqrt, ...) > >>>>>>> > >>>>>>> According to what we know, there are two ways to do this in SOLR: > >>>>>>> > >>>>>>> A) Sort by function [1]: We've tested an expression like > >>>>>>> "sort=product(score, query_score)" in the SOLR query, where score > is > >>>> the > >>>>>>> common SOLR IR score and query_score is our own precalculated > score, > >>>> but > >>>>>> it > >>>>>>> seems that SOLR can only do this with stored/indexed fields (and > >>>>>> obviously > >>>>>>> "score" is not stored/indexed). > >>>>>>> > >>>>>>> B) Function queries: We've used _val_ and function queries like > max, > >>>> sqrt > >>>>>>> and query, and we've obtained the desired results from a functional > >>>> point > >>>>>>> of view. However, our index is quite large (400M documents) and the > >>>>>>> performance degrades heavily, given that function queries are AFAIK > >>>>>>> matching all the documents. > >>>>>>> > >>>>>>> I have two questions: > >>>>>>> > >>>>>>> 1) Apart from the two options I mentioned, is there any other > >> (simple) > >>>>>> way > >>>>>>> to achieve this that we're not aware of? > >>>>>>> > >>>>>>> 2) If we have to choose the function queries path, would it be very > >>>>>>> difficult to modify the actual implementation so that it doesn't > >> match > >>>>>> all > >>>>>>> the documents, that is, to pass a query so that it only operates > over > >>>> the > >>>>>>> documents matching the query?. Looking at the FunctionQuery.java > >> source > >>>>>>> code, there's a comment that says "// instead of matching all docs, > >> we > >>>>>>> could also embed a query. the score could either ignore the > subscore, > >>>> or > >>>>>>> boost it", which is giving us some hope that maybe it's possible > and > >>>> even > >>>>>>> desirable to go in this direction. If you can give us some > directions > >>>>>> about > >>>>>>> how to go about this, we may be able to do the actual > implementation. > >>>>>>> > >>>>>>> BTW, we're using Lucene/SOLR trunk. > >>>>>>> > >>>>>>> Thanks a lot for your help. > >>>>>>> Carlos > >>>>>>> > >>>>>>> [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > >