Re: Populating a filter cache by means other than a query

Chris Collins Wed, 18 Apr 2012 14:15:20 -0700

Great question. 

The set could be in the millions.  I over simplified the use case somewhat to 
protect the innocent :-}.  If a user is querying a large set of documents (for 
the sake of argument lets say its high tens of millions but could be in the 
small billions), they want to potentially mark a result set or subset of those 
docs with a label/tag and use that label /tag later. Now lets throw in its 
multi tenant system and we dont want to keep re-indexing documents to add these 
tags.  Really what I would want todo is to execute a query filtering by this 
labeled set, the server fetches the labeled set out of local cache or over the 
wire or off disk and then incorporates it by one means or another as a filter 
(docset or hashtable in the hitcollector).

Personally I think the dictionary approach wouldnt be a good one.  It may 
produce the most optimal filter mechanism but will cost a bunch to construct 
the OpenBitSet.   

In a prior company I built a more generic version of this for not only 
filtering but for sorting, aggregate stats, etc.   We didn't use Solr.   I was 
curious if there was any methodology for plugging in such a scheme without 
taking a branch of solr and hacking at it.  This was a multi tenant system 
where we were producing aggregate graphs, filtering and ranking by things such 
as entity level sentiment so we produced a rather generic solution here that as 
you pointed out reinvented perhaps some things that smell similar.  It was 
about 7B docs and was multi tenant.  Users were able to overide these 
"features" on a document level which was necessary so their counts, sorts etc 
worked correctly.  Saying how long it took me to build and debug it if I can 
take something close off the shelf.....well you know the rest of the story :-}

C

On Apr 18, 2012, at 4:38 AM, Erick Erickson wrote:

> I guess my question is "what advantage are you trying
> to get here?"
> 
> At the start, this feels like an "XY" problem. How are
> you intending to use the fq after you've built it? Because
> if there's any way to just create an "fq" clause, Solr
> will take care of it for you. Caching it, autowarming
> it when searchers are re-opened, etc. Otherwise, you're
> going to be re-inventing a bunch of stuff it seems to me,
> you'll have to intercept the queries coming in in order
> to apply the filter from the cache, etc.
> 
> Which also may be another way of asking "How big
> is this set of document IDs?" If it's in the 100s, I'd
> just go with an fq. If it's more than that, I'd index
> some kind of set identifier that you could create for
> your fqs.
> 
> And if this is gibberish, ignore me <G>..
> 
> Best
> Erick
> 
> On Tue, Apr 17, 2012 at 4:34 PM, Chris Collins <ch...@geekychris.com> wrote:
>> Hi, I am a long time Lucene user but new to solr.  I would like to use 
>> something like the filterCache but build a such a cache not from a query but 
>> custom code.  I guess I will ask my question by using techniques and vocab I 
>> am familiar with.  Not sure its actually the right way so I appologize if 
>> its just the wrong approach.
>> 
>> The scenario is that I would like to filter a result set by a set of labeled 
>> documents, I will call that set L.
>> L contains app specific document IDs that are indexed as literals in the 
>> lucenefield "myid".
>> I would imagine I could build a OpenBitSet from enumerating the termdocs and 
>> look for the intersecting ids in my label set.
>> Now I have my bitset that I assume I could use in a filter.
>> 
>> Another approach would be to implement a hits collector, compute a 
>> fieldcache from that myid field and look for the intersection in a hashtable 
>> of L at scoring time, throwing out results that are not contained in the 
>> hashtable.
>> 
>> Of course I am working within the confines / concepts that SOLR has layed 
>> out.  Without going completely off the reservation is their a neat way of 
>> doing such a thing with SOLR?
>> 
>> Glad to clarify if my question makes absolutely no sense.
>> 
>> Best
>> 
>> C
>

Re: Populating a filter cache by means other than a query

Reply via email to