Ok, think I got it. Basically the issue was that I can't modify the offset and start params when the search is a distributed one, otherwise the correct offset and max are lost, a simple check in prepare fixed this.
On Thu, Sep 1, 2011 at 11:10 AM, Jamie Johnson <jej2...@gmail.com> wrote: > Ok, so I feel like I'm 90% of the way there. For standard queries > things work fine, but for distributed queries I'm running into a bit > of an issue. Right now the queries run fine but when doing > distributed queries (using SolrCloud) the numFound is always getting > set to the number of requested rows. Can anyone shed some light on > why this might be happening? > > On Tue, Aug 30, 2011 at 8:53 AM, Jamie Johnson <jej2...@gmail.com> wrote: >> This might work in conjunction with what POST processing to help to >> pair down the results, but the logic for the actual access to the data >> is too complex to have entirely in solr. >> >> On Mon, Aug 29, 2011 at 2:02 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >>> It's reasonable, but post-filtering is often difficult, you have >>> too many documents to wade through. If you can see any way >>> at all to just include a clause in the query, you'll save a world >>> of effort... >>> >>> Is there any way you can include a value in some kind of >>> "permissions" field? Let's say you have a document that >>> is only to be visible for "tier 1" customers. If your permissions >>> field contained the tiers (e.g. tier0, tier1), then a simple >>> AND permissions:tier1 would do the trick... >>> >>> I know this is a trivial example, but you see where this is headed. >>> The documents can contain as many of these tokens in permissions >>> as you want. As long as you can string together a clause >>> like "AND permissions:(A OR B OR C)" and not have the clause >>> get ridiculously long (as in thousands of values), that works best. >>> >>> Any such scheme depends upon being able to assign the documents >>> some kind of code that doesn't change too often (because when it does >>> you have to re-index) and figure out, at query time, what permissions >>> a user has. >>> >>> Using FieldCache or low-level Lucene routines can answer the question >>> "Does doc X contain token Y in field Z" reasonably easily. What it has >>> a hard time doing is answering "For document X, what are all the value >>> in the inverted index in field Z". >>> >>> If this doesn't make sense, could you explain a bit more about your >>> permissions model? >>> >>> Hope this helps >>> Erick >>> >>> On Mon, Aug 29, 2011 at 11:46 AM, Jamie Johnson <jej2...@gmail.com> wrote: >>>> Thanks guys, perhaps I am just going about this the wrong way. So let >>>> me explain my problem and perhaps there is a more appropriate >>>> solution. What I need to do is basically hide certain results based >>>> on some passed in user parameter (say their service tier for >>>> instance). What I'd like to do is have some way to plugin my custom >>>> logic to basically remove certain documents from the result set using >>>> this information. Now that being said I technically don't need to >>>> remove the documents from the full result set, I really only need to >>>> remove them from current page (but still ensuring that a page is >>>> filled and sorted). At present I'm trying to see if there is a way >>>> for me to add this type of logic after the QueryComponent has >>>> executed, perhaps by going through the DocIdandSet at this point and >>>> then intersecting the DocIdSet with a DocIdSet which would filter out >>>> the stuff I don't want seen. Does this sound reasonable or like a >>>> fools errand? >>>> >>>> >>>> >>>> On Mon, Aug 29, 2011 at 10:51 AM, Erik Hatcher <erik.hatc...@gmail.com> >>>> wrote: >>>>> I haven't followed the details, but what I'm guessing you want here is >>>>> Lucene's FieldCache. Perhaps something along the lines of how faceting >>>>> uses it (in SimpleFacets.java) - >>>>> >>>>> FieldCache.DocTermsIndex si = >>>>> FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName); >>>>> >>>>> Erik >>>>> >>>>> On Aug 29, 2011, at 09:58 , Erick Erickson wrote: >>>>> >>>>>> If you're asking whether there's a way to find, say, >>>>>> all the values for the "auth" field associated with >>>>>> a document... no. The nature of an inverted >>>>>> index makes this hard (think of finding all >>>>>> the definitions in a dictionary where the word >>>>>> "earth" was in the definition). >>>>>> >>>>>> Best >>>>>> Erick >>>>>> >>>>>> On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson <jej2...@gmail.com> wrote: >>>>>>> Thanks Erick, if I did not know the token up front that could be in >>>>>>> the index is there not an efficient way to get the field for a >>>>>>> specific document and do some custom processing on it? >>>>>>> >>>>>>> On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson >>>>>>> <erickerick...@gmail.com> wrote: >>>>>>>> Start here I think: >>>>>>>> >>>>>>>> http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html >>>>>>>> >>>>>>>> Best >>>>>>>> Erick >>>>>>>> >>>>>>>> On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson <jej2...@gmail.com> >>>>>>>> wrote: >>>>>>>>> Thanks for the reply. The fields I want are indexed, but how would I >>>>>>>>> go directly at the fields I wanted? >>>>>>>>> >>>>>>>>> In regards to indexing the auth tokens I've thought about this and am >>>>>>>>> trying to get confirmation if that is reasonable given our >>>>>>>>> constraints. >>>>>>>>> >>>>>>>>> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson >>>>>>>>> <erickerick...@gmail.com> wrote: >>>>>>>>>> Yeah, loading the document inside a Collector is a >>>>>>>>>> definite no-no. Have you tried going directly >>>>>>>>>> at the fields you want (assuming they're >>>>>>>>>> indexed)? That *should* be much faster, but >>>>>>>>>> whether it'll be fast enough is a good question. I'm >>>>>>>>>> thinking some of the Terms methods here. You >>>>>>>>>> *might* get some joy out of making sure lazy >>>>>>>>>> field loading is enabled (and make sure the >>>>>>>>>> fields you're accessing for your logic are >>>>>>>>>> indexed), but I'm not entirely sure about >>>>>>>>>> that bit. >>>>>>>>>> >>>>>>>>>> This kind of problem is sometimes handled >>>>>>>>>> by indexing "auth tokens" with the documents >>>>>>>>>> and including an OR clause on the query >>>>>>>>>> with the authorizations for a particular >>>>>>>>>> user, but that works best if there is an upper >>>>>>>>>> limit (in the 100s) of tokens that a user can possibly >>>>>>>>>> have, often this works best with some kind of >>>>>>>>>> grouping. Making this work when a user can >>>>>>>>>> have tens of thousands of auth tokens is...er... >>>>>>>>>> contra-indicated... >>>>>>>>>> >>>>>>>>>> Hope this helps a bit... >>>>>>>>>> Erick >>>>>>>>>> >>>>>>>>>> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson <jej2...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> Just a bit more information. Inside my class which extends >>>>>>>>>>> FilteredDocIdSet all of the time seems to be getting spent in >>>>>>>>>>> retrieving the document from the readerCtx, doing this >>>>>>>>>>> >>>>>>>>>>> Document doc = readerCtx.reader.document(docid); >>>>>>>>>>> >>>>>>>>>>> If I comment out this and just return true things fly along as I >>>>>>>>>>> expect. My query is returning a total of 2 million documents also. >>>>>>>>>>> >>>>>>>>>>> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson <jej2...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> I have a need to post process Solr results based on some access >>>>>>>>>>>> controls which are setup outside of Solr, currently we've written >>>>>>>>>>>> something that extends SearchComponent and in the prepare method >>>>>>>>>>>> I'm >>>>>>>>>>>> doing something like this >>>>>>>>>>>> >>>>>>>>>>>> QueryWrapperFilter qwf = new >>>>>>>>>>>> QueryWrapperFilter(rb.getQuery()); >>>>>>>>>>>> Filter filter = new CustomFilter(qwf); >>>>>>>>>>>> FilteredQuery fq = new >>>>>>>>>>>> FilteredQuery(rb.getQuery(), filter); >>>>>>>>>>>> rb.setQuery(fq); >>>>>>>>>>>> >>>>>>>>>>>> Inside my CustomFilter I have a FilteredDocIdSet which checks if >>>>>>>>>>>> the >>>>>>>>>>>> document should be returned. This works as I expect but for some >>>>>>>>>>>> reason is very very slow. Even if I take out any of the machinery >>>>>>>>>>>> which does any logic with the document and only return true in the >>>>>>>>>>>> FilteredDocIdSets match method the query still takes an inordinate >>>>>>>>>>>> amount of time as compared to not including this custom filter. >>>>>>>>>>>> So my >>>>>>>>>>>> question, is this the most appropriate way of handling this? What >>>>>>>>>>>> should the performance out of such a setup be expected to be? Any >>>>>>>>>>>> information/pointers would be greatly appreciated. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>>> >>>> >>> >> >