I was thinking along the lines 1. Retrieve the top result for one query. 2. Take the resulting document and evaluate the score that it would get in another query. 3. If the scores are similar, then the queries most likely overlap. I guess that if I had two simple query strings "archive crash" and query "archiving failure" then I could: 1. Use the query ?q="archive crash"&rows=1 which will return me one result (if any). 2. Read the score of the returned document. 3. Read the unique identifier field value, lets say it has field name 'URI' and value "50d1c07b-a635-4f9a-a6eb-f2e3ebcb6b55'. 4. Use the query ?q="archiving failure"&qf=URI:50d1c07b-a635-4f9a-a6eb-f2e3ebcb6b55&rows=1 5. Read the score of the returned document (the document will be the same as returned under 1, the score will be different, evaluated based on the second query). 6. Evaluate how similar the scores are. My question this approach is; is the score calculated in 4 affected by the subquery, whoes role is solely to select a specific result? I'm using the dismax by the way. Should I use the standard handler instead? Would it make a difference? Thanks, Gert.
________________________________ From: Erik Hatcher [mailto:erik.hatc...@gmail.com] Sent: Fri 4/23/2010 8:08 PM To: solr-user@lucene.apache.org Subject: Re: Comparing two queries Or, use facet.query to get the overlap. Here's ? q=<query1>&facet=on&facet.query=<query2> You'll get the hit count from query #1 in the results, and the overlapping count to query #2 in the facet query response. Erik - http://www.lucidimagination.com <http://www.lucidimagination.com/> On Apr 23, 2010, at 11:01 AM, Otis Gospodnetic wrote: > Hello Gert, > > I think you'd have to apply custom heuristics that involves looking > at top N hits for each query and looking at the % overlap. > > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message ---- >> From: "Villemos, Gert" <gert.ville...@logica.com> >> To: solr-user@lucene.apache.org >> Sent: Fri, April 23, 2010 10:20:54 AM >> Subject: Comparing two queries >> >> We want to support that a user can register for interest in >> information, > based on a query he has defined himself. For example that he >> type in a > query, press a save button, provides his email and the system will >> now > email him with a daily digest. > > > > As part of this, it would >> be nice to be able to tell the user that the > same / a similar query are >> already being monitored by another user, as > the users will likely have the >> same interests. I would therefore like to > evaluate whether two queries will >> return (almost) the same set of > results. > > > > But how can I >> compare two queries to determine if they will return > (almost) the same set of >> results? > > > > Thanks, > > Gert. > > > > Please help Logica >> to respect the environment by not printing this email / Pour >> contribuer >> comme Logica au respect de l'environnement, merci de ne pas >> imprimer ce mail >> / Bitte drucken Sie diese Nachricht nicht aus und helfen Sie so >> Logica >> dabei, die Umwelt zu schützen. / Por favor ajude a Logica a >> respeitar o >> ambiente nao imprimindo este correio electronico. > > > > This e-mail and >> any attachment is for authorised use by the intended recipient(s) >> only. It may >> contain proprietary material, confidential information and/or be >> subject to >> legal privilege. It should not be copied, disclosed to, retained or >> used by, any >> other party. If you are not an intended recipient then please >> promptly delete >> this e-mail and any attachment and all copies and inform the >> sender. Thank >> you. Please help Logica to respect the environment by not printing this email / Pour contribuer comme Logica au respect de l'environnement, merci de ne pas imprimer ce mail / Bitte drucken Sie diese Nachricht nicht aus und helfen Sie so Logica dabei, die Umwelt zu schützen. / Por favor ajude a Logica a respeitar o ambiente nao imprimindo este correio electronico. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.