Great question. The set could be in the millions. I over simplified the use case somewhat to protect the innocent :-}. If a user is querying a large set of documents (for the sake of argument lets say its high tens of millions but could be in the small billions), they want to potentially mark a result set or subset of those docs with a label/tag and use that label /tag later. Now lets throw in its multi tenant system and we dont want to keep re-indexing documents to add these tags. Really what I would want todo is to execute a query filtering by this labeled set, the server fetches the labeled set out of local cache or over the wire or off disk and then incorporates it by one means or another as a filter (docset or hashtable in the hitcollector).
Personally I think the dictionary approach wouldnt be a good one. It may produce the most optimal filter mechanism but will cost a bunch to construct the OpenBitSet. In a prior company I built a more generic version of this for not only filtering but for sorting, aggregate stats, etc. We didn't use Solr. I was curious if there was any methodology for plugging in such a scheme without taking a branch of solr and hacking at it. This was a multi tenant system where we were producing aggregate graphs, filtering and ranking by things such as entity level sentiment so we produced a rather generic solution here that as you pointed out reinvented perhaps some things that smell similar. It was about 7B docs and was multi tenant. Users were able to overide these "features" on a document level which was necessary so their counts, sorts etc worked correctly. Saying how long it took me to build and debug it if I can take something close off the shelf.....well you know the rest of the story :-} C On Apr 18, 2012, at 4:38 AM, Erick Erickson wrote: > I guess my question is "what advantage are you trying > to get here?" > > At the start, this feels like an "XY" problem. How are > you intending to use the fq after you've built it? Because > if there's any way to just create an "fq" clause, Solr > will take care of it for you. Caching it, autowarming > it when searchers are re-opened, etc. Otherwise, you're > going to be re-inventing a bunch of stuff it seems to me, > you'll have to intercept the queries coming in in order > to apply the filter from the cache, etc. > > Which also may be another way of asking "How big > is this set of document IDs?" If it's in the 100s, I'd > just go with an fq. If it's more than that, I'd index > some kind of set identifier that you could create for > your fqs. > > And if this is gibberish, ignore me <G>.. > > Best > Erick > > On Tue, Apr 17, 2012 at 4:34 PM, Chris Collins <ch...@geekychris.com> wrote: >> Hi, I am a long time Lucene user but new to solr. I would like to use >> something like the filterCache but build a such a cache not from a query but >> custom code. I guess I will ask my question by using techniques and vocab I >> am familiar with. Not sure its actually the right way so I appologize if >> its just the wrong approach. >> >> The scenario is that I would like to filter a result set by a set of labeled >> documents, I will call that set L. >> L contains app specific document IDs that are indexed as literals in the >> lucenefield "myid". >> I would imagine I could build a OpenBitSet from enumerating the termdocs and >> look for the intersecting ids in my label set. >> Now I have my bitset that I assume I could use in a filter. >> >> Another approach would be to implement a hits collector, compute a >> fieldcache from that myid field and look for the intersection in a hashtable >> of L at scoring time, throwing out results that are not contained in the >> hashtable. >> >> Of course I am working within the confines / concepts that SOLR has layed >> out. Without going completely off the reservation is their a neat way of >> doing such a thing with SOLR? >> >> Glad to clarify if my question makes absolutely no sense. >> >> Best >> >> C >