Re: Using payloads and user provided data in score

Jamie Johnson Thu, 23 Jul 2015 14:22:45 -0700

Well you've at least confirmed what I was thinking :).

I am using payloads now for this and I think I have something very basic
working.  The results don't get dropped out when the scores are 0 so I had
to also write a custom collector that could be plugged into the
AnalyticQueryAPI (maybe there is somewhere better) that drops docs with a 0
score.


On a side note it would be really nice to be able to plug in a custom
collector somewhere, I couldn't find anywhere to do that without using the
AnalyticsQueryAPI.  I had hoped to use the PositiveScoresOnlyCollector to
not have to do anything but didn't see where I could do that.

Again I really appreciate all of the feedback on this!

On Thu, Jul 23, 2015 at 12:30 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: Your "ugly problem" is my situation I think ;)
>
> No, your problem is much worse ;(
>
> The _contents_ of fields are restricted, which is
> horrible.
>
> OK, here's another idea out of waaaaaaay left field: Payloads.
>
> It hinges on there being an OK number of possible combinations
> which seems to be the case here. "OK" here means < 1B say. It
> also hinges on being able to pre-calculate the access rights for
> each term as you index it.
>
> Then you attach a payload to each term which is, in effect, the
> authorization token for that term that expresses your possibilities,
> A, B, A&B, A|B, whatever. Payloads are simply a float that
> gets carried along with the term and is accessible at scoring
> time.
>
> Now at scoring time, you "drop out" any terms that have "bad"
> auth tokens. WARNING: this is totally off the top of my head,
> so I'm sure there are gotchas in here. Like does returning 0
> from the scoring negate the search.....
>
> No clue whether this can work for you, but here's some sample
> code that could give you an idea of how it all works:
> https://lucidworks.com/blog/end-to-end-payload-example-in-solr/
>
> Good Luck. You're going places Solr wasn't designed to deal
> with so whatever you do will be "exciting". And you're right,
> creating huge clauses will be a performance issue, the payloads
> thing may help you tame that.
>
> Best,
> Erick
>
> On Thu, Jul 23, 2015 at 7:30 AM, Jamie Johnson <jej2...@gmail.com> wrote:
> > Sorry for being vague, I'll try to explain more.  In my use case a
> > particular field does not have a security control, it's the data in the
> > field.  So for instance if I had a schema with a field called name, there
> > could be data that should be secured at A, B, A&B, A|B, etc within that
> > field.  So again it's not the field that has this control it's the data
> in
> > the field.  My thought based on your suggestion was to dynamically
> generate
> > the fields based on the authorizations, this way the user would only see
> > name, but it would get translated to the fields in the index that they
> can
> > see.  So at index time if a field was added to the solr document that
> said
> > name:foo with authorizations A&B I would need to translate that to
> > name_A&B_txt:foo.  Then subsequently on search I would check what fields
> in
> > the index the user should be able to see and rewrite queries that said
> > name:foo to name_A&B_txt:foo (assuming the user can see A&B).
> >
> > We do not explicitly control the fields the user or calling application
> has
> > access to because I don't want to expose the name_A&B_txt:foo fields to
> > calling applications, they know that a field "name" exists, based on
> that I
> > need to translate a name:foo query into the appropriately controlled
> > version.  Does that make sense?
> >
> > My biggest concern with this (beyond the query rewrite) is how it will
> > impact scoring (especially in the case information is available with
> > multiple markings, i.e. name_A_txt has a value of foo and name_B_txt has
> a
> > value of foo and the user has authorizations A and B) and possibly
> bumping
> > up against the maximum clause limit as we expand the query.
> >
> > These reasons were why I thought it best to use payloads to make terms
> with
> > authorizations a user can't see not impact the score and then resolve the
> > actual object the user can see using a store that already supports this
> > type of access pattern (specifically Accumulo in this case).
> >
> > Your "ugly problem" is my situation I think ;)
> >
> > On Thu, Jul 23, 2015 at 12:06 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> I'm not quite getting it here. I'm guessing that you do not
> >> allow fielded queries or you strictly control the fields a user
> >> sees to pick from. Otherwise your security stuff goes out the
> >> window, say you have a drop-down list of fields to choose from
> >> or something.
> >>
> >> Assuming you do NOT have such a thing, the user is just typing
> >> words in a box, then you have to figure out, once at the
> >> app layer, what fields they have access to and just append a
> >> qf=field_secure1,field_secure2.....
> >> parameter to the query.
> >>
> >> That's it. You do not have to rewrite the user query at all, the q
> >> parameter is just passed through as is.
> >>
> >> bq:  I guess in a search component I could look up all of the fields
> >> that are in the index and only run queries against fields they should be
> >> able to see once I know what is in the index (this is what you're
> >> suggesting right?).
> >>
> >> Kind of, except not in a search component. You have to have modeled
> >> the access rights somewhere, so I'm not getting why you can't just use
> >> that model to generate the list of restricted fields the user has access
> >> to.
> >> You haven't explained that model other than to say it's "complex". So I
> >> have no clue whether you're talking about not _knowing_ what fields are
> >> in the docs in the first place (quite possible with dynamic fields) or
> >> whether you do know the complete field list but calculating the user's
> >> access
> >> rights to which fields is complex.
> >>
> >> But I should emphasize again that my assumption is that once calculated,
> >> this list is invariant so it does not need to be done for every request.
> >> Indeed,
> >> what I'm envisioning is not writing any Solr code at all, all done in
> >> the app layer.
> >>
> >> As far as extra work, there isn't any as far as Solr is concerned.
> >> It's exactly as though you were specifying this in, say, the request
> >> handler. So I don't get your concern about lots and lots of fields.
> >> Now, I'm assuming a simple document model with some number
> >> of fields. The access rights to which of those fields a user can
> >> see may be a complex calculation, but again you only need to do it
> >> once. For that matter, you could pre-calculate that set of fields
> >> or otherwise cache it.
> >>
> >> Now, this breaks down if the document model isn't that simple,
> >> say the same field in doc1 can be seen by userX, but userX
> >> can't see the _same_ field in doc2. That's an ugly problem...
> >>
> >> And let's further say there are a number of fields that _everyone_
> >> can see. They can be placed in an <appends> section of the request
> >> handler so you don't have to specify them for each request.
> >>
> >> Best,
> >> Erick
> >>
> >> On Wed, Jul 22, 2015 at 4:12 PM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> >> > Looks like this may be what I'm looking for
> >> >
> >> > *SolrRequestInfo*
> >> >
> >> > I have not tried this yet but looks promising.
> >> >
> >> > Assuming this works, thinking about your suggestion I would need to
> >> rewrite
> >> > the users query with the appropriate fields, are there any utilities
> for
> >> > doing this?  I'd be looking to rewrite a fielded query like
> +field:value
> >> > possibly to something like +(field.secure:value field.secure2:value)
> >> >
> >> > Again thanks for suggestions
> >> > On Jul 22, 2015 5:20 PM, "Jamie Johnson" <jej2...@gmail.com> wrote:
> >> >
> >> >> I answered my own question, looks like the field infos are always
> read
> >> >> within the IndexSearcher so that cost is already being paid.
> >> >>
> >> >> I would potentially have to duplicate information in multiple fields
> if
> >> it
> >> >> was present at multiple authorization levels, is there a limit to the
> >> >> number of fields within a document?  I'm also concerned this might
> skew
> >> my
> >> >> search results as terms that had more authorizations would appear in
> >> more
> >> >> fields and would result in more matches on query.  I'll play with
> this a
> >> >> little but I am still wondering about my original question.
> >> >>
> >> >> On Wed, Jul 22, 2015 at 4:45 PM, Jamie Johnson <jej2...@gmail.com>
> >> wrote:
> >> >>
> >> >>> I had thought about this in the past, but thought it might be too
> >> >>> expensive.  I guess in a search component I could look up all of the
> >> fields
> >> >>> that are in the index and only run queries against fields they
> should
> >> be
> >> >>> able to see once I know what is in the index (this is what you're
> >> >>> suggesting right?).
> >> >>>
> >> >>> My concern would be that the number of fields per document would
> grow
> >> too
> >> >>> large to support this.  Our controls aren't simple like user or
> admin
> >> they
> >> >>> are complex combinations of authorizations so I would think there
> >> might be
> >> >>> a large number of fields that are generated using this approach.
> Would
> >> >>> retrieving all field infos from Solr be expensive on each request to
> >> see
> >> >>> what they should be able to query?
> >> >>>
> >> >>> On Wed, Jul 22, 2015 at 4:19 PM, Erick Erickson <
> >> erickerick...@gmail.com>
> >> >>> wrote:
> >> >>>
> >> >>>> Why don't you handle it all at the app level? Here's what I mean:
> >> >>>>
> >> >>>> I'm assuming that you're using edismax here, but the same principle
> >> >>>> applies if not.
> >> >>>>
> >> >>>> Your handler (say the "/select" handler) has a "qf" parameter which
> >> >>>> defines
> >> >>>> the fields that are searched over in the absence of a field
> qualifier,
> >> >>>> e.g.
> >> >>>> q=whatever&qf=title,description
> >> >>>>
> >> >>>> causes the search term to be looked for in the two fields "title"
> and
> >> >>>> "description"
> >> >>>> You can also set up the qf fields in the "/select" handler as one
> of
> >> >>>> the items in
> >> >>>> the <defaults> section....
> >> >>>>
> >> >>>> But, the qf param in the <defaults> section is just that... a
> default.
> >> >>>> So individual
> >> >>>> queries can override it. What I have in mind is that you'd look up
> the
> >> >>>> user's
> >> >>>> field-access list and append that list as necessary to the query
> and
> >> >>>> just pass it
> >> >>>> on through.
> >> >>>>
> >> >>>> Things to watch out for:
> >> >>>> 1> if the user specifies a field, you'll have to strip that off if
> >> >>>> they don't have rights,
> >> >>>> i.e. q=field1:whatever whenever
> >> >>>> ignores the qf parameter for "whatever" but does respect the qf
> param
> >> >>>> for "whenever".
> >> >>>> 2> If you have some kind of date field say that you want to facet
> >> >>>> over, you'd have
> >> >>>> to control that.
> >> >>>> 3> if you have a "bag of words" where you use copyField to add a
> bunch
> >> >>>> of field's
> >> >>>> data to an uber-field then the user can infer some things from that
> >> >>>> info, so you probably
> >> >>>> don't want to be careful about what copyFields you use.
> >> >>>>
> >> >>>> Best,
> >> >>>> Erick
> >> >>>>
> >> >>>> On Wed, Jul 22, 2015 at 12:21 PM, Jamie Johnson <jej2...@gmail.com
> >
> >> >>>> wrote:
> >> >>>> > I am looking for a way to prevent fields that users shouldn't be
> >> able
> >> >>>> to
> >> >>>> > know exist from contributing to the score.  The goal is to
> provide a
> >> >>>> way to
> >> >>>> > essentially hide certain fields from requests based on an access
> >> level
> >> >>>> > provided on the query.  I have managed to make terms that users
> >> >>>> shouldn't
> >> >>>> > be able to see not impact the score by implementing a custom
> >> Similarity
> >> >>>> > class that looks at the terms payloads and returns 0 for the
> score
> >> if
> >> >>>> they
> >> >>>> > shouldn't know the field exists.  The issue however is that I
> don't
> >> >>>> have
> >> >>>> > access to the request at this point so getting the users access
> >> level
> >> >>>> is
> >> >>>> > proving problematic.  Is there a way to get the current request
> >> that is
> >> >>>> > being processed via some thread local variable or something
> similar
> >> >>>> that
> >> >>>> > Solr maintains?  If not is there another approach that I could be
> >> >>>> using to
> >> >>>> > access information from the request within my Similarity
> >> >>>> implementation?
> >> >>>> > Any thoughts on this would be greatly appreciated.
> >> >>>> >
> >> >>>> > -Jamie
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >>
>

Re: Using payloads and user provided data in score

Reply via email to