We've got an 11,000,000-document index.  Most documents have a unique ID called 
"flrid", plus a different ID called "solrid" that is Solr's PK.  For some 
searches, we need to be able to limit the searches to a subset of documents 
defined by a list of FLRID values.  The list of FLRID values can change between 
every search and it will be rare enough to call it "never" that any two 
searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

    q=title:dogs AND 
        (flrid:(123 125 139 .... 34823) OR 
         flrid:(34837 ... 59091) OR 
         ... OR 
         flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together.  We have 
to subgroup to get past Solr's limitations on the number of terms that can be 
ORed together.

The problem with this approach (besides that it's clunky) is that it seems to 
perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in 50ms or so.  
If we have 10,000 FLRIDs, it comes back in 400-500ms.  With 100,000 FLRIDs, 
that jumps up to about 75000ms.  We want it be on the order of 1000-2000ms at 
most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query.  No 
improvement.
* Tried: Putting the FLRIDs into the fq instead of the q.  No improvement.
* Considered: dumping all the FLRIDs for a given search into another core and 
doing a join between it and the main core, but if we do five or ten searches 
per second, it seems like Solr would die from all the commits.  The set of 
FLRIDs is unique between searches so there is no reuse possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, 
so that Solr doesn't have to hit the documents in order to translate 
FLRID->SolrID to do the matching.

What we're hoping for:

* An efficient way to pass a long set of IDs, or for Solr to be able to pull 
them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a naive 
one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because strings 
of fqs in the query seems to be a suboptimal way to do it.

I've searched SO and the web and found people asking about this type of 
situation a few times, but no answers that I see beyond what we're doing now.

* 
http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
* 
http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
* 
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
* 
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Thanks,
Andy

--
Andy Lester => a...@petdance.com => www.petdance.com => AIM:petdance

Reply via email to