Re: Populating a filter cache by means other than a query

Erick Erickson Wed, 18 Apr 2012 16:31:44 -0700

Pesky users. Life would be sooooo much easier if they'd just leave
devs alone <G>....



Right. Well, you can certainly create your own SearchComponent and attach your
custom filter at that point, note how I'm skimping on the details here.

>From left field, you might create a custom FunctionQuery that returns 0 in the
case of excluded documents. Since that gets multiplied into the score, the
resulting score is 0. Returning 1 for docs that should be kept wouldn't change
the score.

But other than that, I'll leave it to the folks in the code. Chris,
you there? <G>..

Best
Erick

On Wed, Apr 18, 2012 at 5:14 PM, Chris Collins <ch...@geekychris.com> wrote:
> Great question.
>
> The set could be in the millions.  I over simplified the use case somewhat to 
> protect the innocent :-}.  If a user is querying a large set of documents 
> (for the sake of argument lets say its high tens of millions but could be in 
> the small billions), they want to potentially mark a result set or subset of 
> those docs with a label/tag and use that label /tag later. Now lets throw in 
> its multi tenant system and we dont want to keep re-indexing documents to add 
> these tags.  Really what I would want todo is to execute a query filtering by 
> this labeled set, the server fetches the labeled set out of local cache or 
> over the wire or off disk and then incorporates it by one means or another as 
> a filter (docset or hashtable in the hitcollector).
>
> Personally I think the dictionary approach wouldnt be a good one.  It may 
> produce the most optimal filter mechanism but will cost a bunch to construct 
> the OpenBitSet.
>
> In a prior company I built a more generic version of this for not only 
> filtering but for sorting, aggregate stats, etc.   We didn't use Solr.   I 
> was curious if there was any methodology for plugging in such a scheme 
> without taking a branch of solr and hacking at it.  This was a multi tenant 
> system where we were producing aggregate graphs, filtering and ranking by 
> things such as entity level sentiment so we produced a rather generic 
> solution here that as you pointed out reinvented perhaps some things that 
> smell similar.  It was about 7B docs and was multi tenant.  Users were able 
> to overide these "features" on a document level which was necessary so their 
> counts, sorts etc worked correctly.  Saying how long it took me to build and 
> debug it if I can take something close off the shelf.....well you know the 
> rest of the story :-}
>
> C
>
>
> On Apr 18, 2012, at 4:38 AM, Erick Erickson wrote:
>
>> I guess my question is "what advantage are you trying
>> to get here?"
>>
>> At the start, this feels like an "XY" problem. How are
>> you intending to use the fq after you've built it? Because
>> if there's any way to just create an "fq" clause, Solr
>> will take care of it for you. Caching it, autowarming
>> it when searchers are re-opened, etc. Otherwise, you're
>> going to be re-inventing a bunch of stuff it seems to me,
>> you'll have to intercept the queries coming in in order
>> to apply the filter from the cache, etc.
>>
>> Which also may be another way of asking "How big
>> is this set of document IDs?" If it's in the 100s, I'd
>> just go with an fq. If it's more than that, I'd index
>> some kind of set identifier that you could create for
>> your fqs.
>>
>> And if this is gibberish, ignore me <G>..
>>
>> Best
>> Erick
>>
>> On Tue, Apr 17, 2012 at 4:34 PM, Chris Collins <ch...@geekychris.com> wrote:
>>> Hi, I am a long time Lucene user but new to solr.  I would like to use 
>>> something like the filterCache but build a such a cache not from a query 
>>> but custom code.  I guess I will ask my question by using techniques and 
>>> vocab I am familiar with.  Not sure its actually the right way so I 
>>> appologize if its just the wrong approach.
>>>
>>> The scenario is that I would like to filter a result set by a set of 
>>> labeled documents, I will call that set L.
>>> L contains app specific document IDs that are indexed as literals in the 
>>> lucenefield "myid".
>>> I would imagine I could build a OpenBitSet from enumerating the termdocs 
>>> and look for the intersecting ids in my label set.
>>> Now I have my bitset that I assume I could use in a filter.
>>>
>>> Another approach would be to implement a hits collector, compute a 
>>> fieldcache from that myid field and look for the intersection in a 
>>> hashtable of L at scoring time, throwing out results that are not contained 
>>> in the hashtable.
>>>
>>> Of course I am working within the confines / concepts that SOLR has layed 
>>> out.  Without going completely off the reservation is their a neat way of 
>>> doing such a thing with SOLR?
>>>
>>> Glad to clarify if my question makes absolutely no sense.
>>>
>>> Best
>>>
>>> C
>>
>

Re: Populating a filter cache by means other than a query

Reply via email to