Well you've at least confirmed what I was thinking :). I am using payloads now for this and I think I have something very basic working. The results don't get dropped out when the scores are 0 so I had to also write a custom collector that could be plugged into the AnalyticQueryAPI (maybe there is somewhere better) that drops docs with a 0 score.
On a side note it would be really nice to be able to plug in a custom collector somewhere, I couldn't find anywhere to do that without using the AnalyticsQueryAPI. I had hoped to use the PositiveScoresOnlyCollector to not have to do anything but didn't see where I could do that. Again I really appreciate all of the feedback on this! On Thu, Jul 23, 2015 at 12:30 PM, Erick Erickson <erickerick...@gmail.com> wrote: > bq: Your "ugly problem" is my situation I think ;) > > No, your problem is much worse ;( > > The _contents_ of fields are restricted, which is > horrible. > > OK, here's another idea out of waaaaaaay left field: Payloads. > > It hinges on there being an OK number of possible combinations > which seems to be the case here. "OK" here means < 1B say. It > also hinges on being able to pre-calculate the access rights for > each term as you index it. > > Then you attach a payload to each term which is, in effect, the > authorization token for that term that expresses your possibilities, > A, B, A&B, A|B, whatever. Payloads are simply a float that > gets carried along with the term and is accessible at scoring > time. > > Now at scoring time, you "drop out" any terms that have "bad" > auth tokens. WARNING: this is totally off the top of my head, > so I'm sure there are gotchas in here. Like does returning 0 > from the scoring negate the search..... > > No clue whether this can work for you, but here's some sample > code that could give you an idea of how it all works: > https://lucidworks.com/blog/end-to-end-payload-example-in-solr/ > > Good Luck. You're going places Solr wasn't designed to deal > with so whatever you do will be "exciting". And you're right, > creating huge clauses will be a performance issue, the payloads > thing may help you tame that. > > Best, > Erick > > On Thu, Jul 23, 2015 at 7:30 AM, Jamie Johnson <jej2...@gmail.com> wrote: > > Sorry for being vague, I'll try to explain more. In my use case a > > particular field does not have a security control, it's the data in the > > field. So for instance if I had a schema with a field called name, there > > could be data that should be secured at A, B, A&B, A|B, etc within that > > field. So again it's not the field that has this control it's the data > in > > the field. My thought based on your suggestion was to dynamically > generate > > the fields based on the authorizations, this way the user would only see > > name, but it would get translated to the fields in the index that they > can > > see. So at index time if a field was added to the solr document that > said > > name:foo with authorizations A&B I would need to translate that to > > name_A&B_txt:foo. Then subsequently on search I would check what fields > in > > the index the user should be able to see and rewrite queries that said > > name:foo to name_A&B_txt:foo (assuming the user can see A&B). > > > > We do not explicitly control the fields the user or calling application > has > > access to because I don't want to expose the name_A&B_txt:foo fields to > > calling applications, they know that a field "name" exists, based on > that I > > need to translate a name:foo query into the appropriately controlled > > version. Does that make sense? > > > > My biggest concern with this (beyond the query rewrite) is how it will > > impact scoring (especially in the case information is available with > > multiple markings, i.e. name_A_txt has a value of foo and name_B_txt has > a > > value of foo and the user has authorizations A and B) and possibly > bumping > > up against the maximum clause limit as we expand the query. > > > > These reasons were why I thought it best to use payloads to make terms > with > > authorizations a user can't see not impact the score and then resolve the > > actual object the user can see using a store that already supports this > > type of access pattern (specifically Accumulo in this case). > > > > Your "ugly problem" is my situation I think ;) > > > > On Thu, Jul 23, 2015 at 12:06 AM, Erick Erickson < > erickerick...@gmail.com> > > wrote: > > > >> I'm not quite getting it here. I'm guessing that you do not > >> allow fielded queries or you strictly control the fields a user > >> sees to pick from. Otherwise your security stuff goes out the > >> window, say you have a drop-down list of fields to choose from > >> or something. > >> > >> Assuming you do NOT have such a thing, the user is just typing > >> words in a box, then you have to figure out, once at the > >> app layer, what fields they have access to and just append a > >> qf=field_secure1,field_secure2..... > >> parameter to the query. > >> > >> That's it. You do not have to rewrite the user query at all, the q > >> parameter is just passed through as is. > >> > >> bq: I guess in a search component I could look up all of the fields > >> that are in the index and only run queries against fields they should be > >> able to see once I know what is in the index (this is what you're > >> suggesting right?). > >> > >> Kind of, except not in a search component. You have to have modeled > >> the access rights somewhere, so I'm not getting why you can't just use > >> that model to generate the list of restricted fields the user has access > >> to. > >> You haven't explained that model other than to say it's "complex". So I > >> have no clue whether you're talking about not _knowing_ what fields are > >> in the docs in the first place (quite possible with dynamic fields) or > >> whether you do know the complete field list but calculating the user's > >> access > >> rights to which fields is complex. > >> > >> But I should emphasize again that my assumption is that once calculated, > >> this list is invariant so it does not need to be done for every request. > >> Indeed, > >> what I'm envisioning is not writing any Solr code at all, all done in > >> the app layer. > >> > >> As far as extra work, there isn't any as far as Solr is concerned. > >> It's exactly as though you were specifying this in, say, the request > >> handler. So I don't get your concern about lots and lots of fields. > >> Now, I'm assuming a simple document model with some number > >> of fields. The access rights to which of those fields a user can > >> see may be a complex calculation, but again you only need to do it > >> once. For that matter, you could pre-calculate that set of fields > >> or otherwise cache it. > >> > >> Now, this breaks down if the document model isn't that simple, > >> say the same field in doc1 can be seen by userX, but userX > >> can't see the _same_ field in doc2. That's an ugly problem... > >> > >> And let's further say there are a number of fields that _everyone_ > >> can see. They can be placed in an <appends> section of the request > >> handler so you don't have to specify them for each request. > >> > >> Best, > >> Erick > >> > >> On Wed, Jul 22, 2015 at 4:12 PM, Jamie Johnson <jej2...@gmail.com> > wrote: > >> > Looks like this may be what I'm looking for > >> > > >> > *SolrRequestInfo* > >> > > >> > I have not tried this yet but looks promising. > >> > > >> > Assuming this works, thinking about your suggestion I would need to > >> rewrite > >> > the users query with the appropriate fields, are there any utilities > for > >> > doing this? I'd be looking to rewrite a fielded query like > +field:value > >> > possibly to something like +(field.secure:value field.secure2:value) > >> > > >> > Again thanks for suggestions > >> > On Jul 22, 2015 5:20 PM, "Jamie Johnson" <jej2...@gmail.com> wrote: > >> > > >> >> I answered my own question, looks like the field infos are always > read > >> >> within the IndexSearcher so that cost is already being paid. > >> >> > >> >> I would potentially have to duplicate information in multiple fields > if > >> it > >> >> was present at multiple authorization levels, is there a limit to the > >> >> number of fields within a document? I'm also concerned this might > skew > >> my > >> >> search results as terms that had more authorizations would appear in > >> more > >> >> fields and would result in more matches on query. I'll play with > this a > >> >> little but I am still wondering about my original question. > >> >> > >> >> On Wed, Jul 22, 2015 at 4:45 PM, Jamie Johnson <jej2...@gmail.com> > >> wrote: > >> >> > >> >>> I had thought about this in the past, but thought it might be too > >> >>> expensive. I guess in a search component I could look up all of the > >> fields > >> >>> that are in the index and only run queries against fields they > should > >> be > >> >>> able to see once I know what is in the index (this is what you're > >> >>> suggesting right?). > >> >>> > >> >>> My concern would be that the number of fields per document would > grow > >> too > >> >>> large to support this. Our controls aren't simple like user or > admin > >> they > >> >>> are complex combinations of authorizations so I would think there > >> might be > >> >>> a large number of fields that are generated using this approach. > Would > >> >>> retrieving all field infos from Solr be expensive on each request to > >> see > >> >>> what they should be able to query? > >> >>> > >> >>> On Wed, Jul 22, 2015 at 4:19 PM, Erick Erickson < > >> erickerick...@gmail.com> > >> >>> wrote: > >> >>> > >> >>>> Why don't you handle it all at the app level? Here's what I mean: > >> >>>> > >> >>>> I'm assuming that you're using edismax here, but the same principle > >> >>>> applies if not. > >> >>>> > >> >>>> Your handler (say the "/select" handler) has a "qf" parameter which > >> >>>> defines > >> >>>> the fields that are searched over in the absence of a field > qualifier, > >> >>>> e.g. > >> >>>> q=whatever&qf=title,description > >> >>>> > >> >>>> causes the search term to be looked for in the two fields "title" > and > >> >>>> "description" > >> >>>> You can also set up the qf fields in the "/select" handler as one > of > >> >>>> the items in > >> >>>> the <defaults> section.... > >> >>>> > >> >>>> But, the qf param in the <defaults> section is just that... a > default. > >> >>>> So individual > >> >>>> queries can override it. What I have in mind is that you'd look up > the > >> >>>> user's > >> >>>> field-access list and append that list as necessary to the query > and > >> >>>> just pass it > >> >>>> on through. > >> >>>> > >> >>>> Things to watch out for: > >> >>>> 1> if the user specifies a field, you'll have to strip that off if > >> >>>> they don't have rights, > >> >>>> i.e. q=field1:whatever whenever > >> >>>> ignores the qf parameter for "whatever" but does respect the qf > param > >> >>>> for "whenever". > >> >>>> 2> If you have some kind of date field say that you want to facet > >> >>>> over, you'd have > >> >>>> to control that. > >> >>>> 3> if you have a "bag of words" where you use copyField to add a > bunch > >> >>>> of field's > >> >>>> data to an uber-field then the user can infer some things from that > >> >>>> info, so you probably > >> >>>> don't want to be careful about what copyFields you use. > >> >>>> > >> >>>> Best, > >> >>>> Erick > >> >>>> > >> >>>> On Wed, Jul 22, 2015 at 12:21 PM, Jamie Johnson <jej2...@gmail.com > > > >> >>>> wrote: > >> >>>> > I am looking for a way to prevent fields that users shouldn't be > >> able > >> >>>> to > >> >>>> > know exist from contributing to the score. The goal is to > provide a > >> >>>> way to > >> >>>> > essentially hide certain fields from requests based on an access > >> level > >> >>>> > provided on the query. I have managed to make terms that users > >> >>>> shouldn't > >> >>>> > be able to see not impact the score by implementing a custom > >> Similarity > >> >>>> > class that looks at the terms payloads and returns 0 for the > score > >> if > >> >>>> they > >> >>>> > shouldn't know the field exists. The issue however is that I > don't > >> >>>> have > >> >>>> > access to the request at this point so getting the users access > >> level > >> >>>> is > >> >>>> > proving problematic. Is there a way to get the current request > >> that is > >> >>>> > being processed via some thread local variable or something > similar > >> >>>> that > >> >>>> > Solr maintains? If not is there another approach that I could be > >> >>>> using to > >> >>>> > access information from the request within my Similarity > >> >>>> implementation? > >> >>>> > Any thoughts on this would be greatly appreciated. > >> >>>> > > >> >>>> > -Jamie > >> >>>> > >> >>> > >> >>> > >> >> > >> >