On Fri, Mar 16, 2012 at 8:38 PM, Carlos Gonzalez-Cadenas < c...@experienceon.com> wrote:
> On Fri, Mar 16, 2012 at 9:26 AM, Mikhail Khludnev < > mkhlud...@griddynamics.com> wrote: > >> Hello Carlos, >> > >> so, search all terms with MUST first, you've got the best result in terms >> of precision and recall if you've got something. Otherwise you still have a >> lot of time. You need to drop one of the words or switch ones of them into >> SHOULD. > > > Agree, this is precisely what we're trying to do (the idea of having > multiple queries, from narrow to broad). My question was more of a > practical nature, that is, how can we do these queries without really > having to do independent SOLR queries. > Sorry I forget this question. I did a tiny DelegateRequestHandler which sequentially iterates through list of other request handlers until the first not empty results has been found. Every of slave RequestHandlers has own QParser params e.g. I search for conjunction, then apply spelling correction, and after that I search for disjunction. You can see from my "theory" that the total time is equal to the time of the last successful search. > Now we use DisjunctionMaxQuery, but it has the problems that I described > in my former email w.r.t. early-termination. > > This morning we found two potential directions that might work (we're > testing them as of now): > > 1. Implement a custom RequestHandler and execute several queries > within SOLR (https://issues.apache.org/jira/browse/SOLR-1093). This is > better than executing them from outside and having all the network / HTTP / > ... overhead, but still not very good. > 2. Modify DisjunctionMaxQuery. In particular, modifying > DisjunctionMaxScorer > so that it doesn't use a min heap for the subscorers. We'll try several > strategies to collect documents from the child subscorers, like round-robin > or collecting the narrower subscorers first and then go broader until the > upstream collector stops the collection. This looks like the most > interesting option. > > > Enumerating all combinations is NPcomplete task I believe. But you have a >> good heuristics: >> * zero docFreq means that you can drop this term off or pass it through >> spell correction >> * if you have a instant suggest like app and has zero result for some >> phrase, maybe dropping the last word gives you the phrase which had some >> results before, and present in cache. >> * otherwise excluding less frequent term from conjunction probably gives >> non-zero results >> > > This is not a problem in practice. We're using a bunch of heuristics in > our QueryParser (including a lot of info extracted from the TermsEnum, > stopword lists, etc ...) to severely cut the space. > > Thanks > Carlos > > > >> >> Regards >> >> >> On Thu, Mar 15, 2012 at 12:01 AM, Carlos Gonzalez-Cadenas < >> c...@experienceon.com> wrote: >> >>> Hello all, >>> >>> We have a SOLR index filled with user queries and we want to retrieve the >>> ones that are more similar to a given query entered by an end-user. It is >>> kind of a "related queries" system. >>> >>> The index is pretty big and we're using early-termination of queries >>> (with >>> the index sorted so that the "more popular" queries have lower docids and >>> therefore the termination yields higher-quality results) >>> >>> Clearly, when the user enters a user-level query into the search box, >>> i.e. >>> "cheap hotels barcelona offers", we don't know whether there exists a >>> document (query) in the index that contains these four words or not. >>> Therefore, when we're building the SOLR query, the first intuition would >>> be to do a query like this "cheap OR hotels OR barcelona OR offers". >>> >>> If all the documents in the index where evaluated, the results of this >>> query would be good. For example, if there is no query in the index with >>> these four words but there's a query in the index with the text "cheap >>> hotels barcelona", it will probably be one of the top results, which is >>> precisely what we want. >>> >>> The problem is that we're doing early termination and therefore this >>> query >>> will exhaust very fast the top-K result limit (our custom collector >>> limits >>> on the number of evaluated documents), given that queries like "hotels in >>> madrid" or "hotels in NYC" will match the OR expression described above >>> (because they all match "hotels"). >>> >>> Our next step was to think in a DisjunctionMaxQuery, trying to write a >>> query like this: >>> >>> DisjunctionMaxQuery: >>> 1) +cheap +hotels +barcelona +offers >>> 2) +cheap +hotels +barcelona >>> 3) +cheap +hotels >>> 4) +hotels >>> >>> We were thinking that perhaps the sub-queries within the >>> DisjunctionMaxQuery were going to get evaluated in "parallel" given that >>> they're separated queries, but in fact from a runtime perspective it does >>> behave in a similar way than the OR query that we described above. >>> >>> Our desired behavior is to try match documents with each subquery within >>> the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then >>> score them and return them all together (therefore we don't want all the >>> matches being done by a single sub-query, like it's happening now). >>> >>> Clearly, we could create a script external to SOLR that just runs the >>> several sub-queries as standalone queries and then joins all the results >>> together, but before going for this we'd like to know if you have any >>> ideas >>> on how to solve this problem within SOLR. We do have our own QParser, and >>> therefore we'd be able to implement any arbitrary query construction that >>> you can come up with, or even create a new Query type if it's needed. >>> >>> Thanks a lot for your help, >>> Carlos >>> >>> >>> Carlos Gonzalez-Cadenas >>> CEO, ExperienceOn - New generation search >>> http://www.experienceon.com >>> >>> Mobile: +34 652 911 201 >>> Skype: carlosgonzalezcadenas >>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas >>> >> >> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> Lucid Certified >> Apache Lucene/Solr Developer >> Grid Dynamics >> >> <http://www.griddynamics.com> >> <mkhlud...@griddynamics.com> >> >> > -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>