hi Andy, It seems like a common type of operation and I would be also curious what others think. My take on this is to create a compressed intbitset and send it as a query filter, then have the handler decompress/deserialize it, and use it as a filter query. We have already done experiments with intbitsets and it is fast to send/receive
look at page 20> http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component it is not on my immediate list of tasks, but if you want to help, it can be done sooner roman On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester <a...@petdance.com> wrote: > We've got an 11,000,000-document index. Most documents have a unique ID > called "flrid", plus a different ID called "solrid" that is Solr's PK. For > some searches, we need to be able to limit the searches to a subset of > documents defined by a list of FLRID values. The list of FLRID values can > change between every search and it will be rare enough to call it "never" > that any two searches will have the same set of FLRIDs to limit on. > > What we're doing right now is, roughly: > > q=title:dogs AND > (flrid:(123 125 139 .... 34823) OR > flrid:(34837 ... 59091) OR > ... OR > flrid:(101294813 ... 103049934)) > > Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We > have to subgroup to get past Solr's limitations on the number of terms that > can be ORed together. > > The problem with this approach (besides that it's clunky) is that it seems > to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms > or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 > FLRIDs, that jumps up to about 75000ms. We want it be on the order of > 1000-2000ms at most in all cases up to 100,000 FLRIDs. > > How can we do this better? > > Things we've tried or considered: > > * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No > improvement. > * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. > * Considered: dumping all the FLRIDs for a given search into another core > and doing a join between it and the main core, but if we do five or ten > searches per second, it seems like Solr would die from all the commits. > The set of FLRIDs is unique between searches so there is no reuse possible. > * Considered: Translating FLRIDs to SolrID and then limiting on SolrID > instead, so that Solr doesn't have to hit the documents in order to > translate FLRID->SolrID to do the matching. > > What we're hoping for: > > * An efficient way to pass a long set of IDs, or for Solr to be able to > pull them from the app's Oracle database. > * Have Solr do big ORs as a set operation not as (what we assume is) a > naive one-at-a-time matching. > * A way to create a match vector that gets passed to the query, because > strings of fqs in the query seems to be a suboptimal way to do it. > > I've searched SO and the web and found people asking about this type of > situation a few times, but no answers that I see beyond what we're doing > now. > > * > http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys > * > http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr > * > http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html > * > http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html > > Thanks, > Andy > > -- > Andy Lester => a...@petdance.com => www.petdance.com => AIM:petdance > >