Re: ACL implementation: Pseudo-join performance & Atomic Updates

Erick Erickson Tue, 16 Jul 2013 05:02:44 -0700

Roman:

Did this ever make into a JIRA? Somehow I missed it if it did, and this would
be pretty cool....


Erick

On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla <roman.ch...@gmail.com> wrote:
> On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca <oburl...@gmail.com> wrote:
>
>> Hello Erick,
>>
>> > Join performance is most sensitive to the number of values
>> > in the field being joined on. So if you have lots and lots of
>> > distinct values in the corpus, join performance will be affected.
>> Yep, we have a list of unique Id's that we get by first searching for
>> records
>> where loggedInUser IS IN (userIDs)
>> This corpus is stored in memory I suppose? (not a problem) and then the
>> bottleneck is to match this huge set with the core where I'm searching?
>>
>> Somewhere in maillist archive people were talking about "external list of
>> Solr unique IDs"
>> but didn't find if there is a solution.
>> Back in 2010 Yonik posted a comment:
>> http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd
>>
>
> sorry, haven't the previous thread in its entirety, but few weeks back that
> Yonik's proposal got implemented, it seems ;)
>
> http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter
>
> You could use this to send very large bitset filter (which can be
> translated into any integers, if you can come up with a mapping function).
>
> roman
>
>
>>
>> > bq: I suppose the delete/reindex approach will not change soon
>> > There is ongoing work (search the JIRA for "Stacked Segments")
>> Ah, ok, I was feeling it affects the architecture, ok, now the only hope is
>> Pseudo-Joins ))
>>
>> > One way to deal with this is to implement a "post filter", sometimes
>> called
>> > a "no cache" filter.
>> thanks, will have a look, but as you describe it, it's not the best option.
>>
>> The approach
>> "too many documents, man. Please refine your query. Partial results below"
>> means faceting will not work correctly?
>>
>> ... I have in mind a hybrid approach, comments welcome:
>> Most of the time users are not searching, but browsing content, so our
>> "virtual filesystem" stored in SOLR will use only the index with the Id of
>> the file and the list of users that have access to it. i.e. not touching
>> the fulltext index at all.
>>
>> Files may have metadata (EXIF info for images for ex) that we'd like to
>> filter by, calculate facets.
>> Meta will be stored in both indexes.
>>
>> In case of a fulltext query:
>> 1. search FT index (the fulltext index), get only the number of search
>> results, let it be Rf
>> 2. search DAC index (the index with permissions), get number of search
>> results, let it be Rd
>>
>> let maxR be the maximum size of the corpus for the pseudo-join.
>> *That was actually my question: what is a reasonable number? 10, 100, 1000
>> ?
>> *
>>
>> if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join onto the
>> second one.
>> this happens when (only a few documents contains the search query) OR (user
>> has access to a small number of files).
>>
>> In case none of these happens, we can use the
>> "too many documents, man. Please refine your query. Partial results below"
>> but first searching the FT index, because we want relevant results first.
>>
>> What do you think?
>>
>> Regards,
>> Oleg
>>
>>
>>
>>
>> On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson <erickerick...@gmail.com
>> >wrote:
>>
>> > Join performance is most sensitive to the number of values
>> > in the field being joined on. So if you have lots and lots of
>> > distinct values in the corpus, join performance will be affected.
>> >
>> > bq: I suppose the delete/reindex approach will not change soon
>> >
>> > There is ongoing work (search the JIRA for "Stacked Segments")
>> > on actually doing something about this, but it's been "under
>> consideration"
>> > for at least 3 years so your guess is as good as mine.
>> >
>> > bq: notice that the worst situation is when everyone has access to all
>> the
>> > files, it means the first filter will be the full index.
>> >
>> > One way to deal with this is to implement a "post filter", sometimes
>> called
>> > a "no cache" filter. The distinction here is that
>> > 1> it is not cached (duh!)
>> > 2> it is only called for documents that have made it through all the
>> >      other "lower cost" filters (and the main query of course).
>> > 3> "lower cost" means the filter is either a standard, cached filters
>> >     and any "no cache" filters with a cost (explicitly stated in the
>> query)
>> >     lower than this one's.
>> >
>> > Critically, and unlike "normal" filter queries, the result set is NOT
>> > calculated for all documents ahead of time....
>> >
>> > You _still_ have to deal with the sysadmin doing a *:* query as you
>> > are well aware. But one can mitigate that by having the post-filter
>> > fail all documents after some arbitrary N, and display a message in the
>> > app like "too many documents, man. Please refine your query. Partial
>> > results below". Of course this may not be acceptable, but....
>> >
>> > HTH
>> > Erick
>> >
>> > On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky
>> > <j...@basetechnology.com> wrote:
>> > > Take a look at LucidWorks Search and its access control:
>> > >
>> >
>> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
>> > >
>> > > Role-based security is an easier nut to crack.
>> > >
>> > > Karl Wright of ManifoldCF had a Solr patch for document access control
>> at
>> > > one point:
>> > > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF
>> > > security at search time
>> > > https://issues.apache.org/jira/browse/SOLR-1895
>> > >
>> > >
>> >
>> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
>> > >
>> > > For some other thoughts:
>> > > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
>> > >
>> > > I'm not sure if external file fields will be of any value in this
>> > situation.
>> > >
>> > > There is also a proposal for bitwise operations:
>> > > SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on
>> > > Bitwise Operations on Integer Fields
>> > > https://issues.apache.org/jira/browse/SOLR-1913
>> > >
>> > > But the bottom line is that clearly updating all documents in the index
>> > is a
>> > > non-starter.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > -----Original Message----- From: Oleg Burlaca
>> > > Sent: Sunday, July 14, 2013 11:02 AM
>> > > To: solr-user@lucene.apache.org
>> > > Subject: ACL implementation: Pseudo-join performance & Atomic Updates
>> > >
>> > >
>> > > Hello all,
>> > >
>> > > Situation:
>> > > We have a collection of files in SOLR with ACL applied: each file has a
>> > > multi-valued field that contains the list of userID's that can read it:
>> > >
>> > > here is sample data:
>> > > Id | content  | userId
>> > > 1  | text text | 4,5,6,2
>> > > 2  | text text | 4,5,9
>> > > 3  | text text | 4,2
>> > >
>> > > Problem:
>> > > when ACL is changed for a big folder, we compute the ACL for all child
>> > > items and reindex in SOLR using atomic updates (updating only 'userIds'
>> > > column), but because it deletes/reindexes the record, the performance
>> is
>> > > very poor.
>> > >
>> > > Question:
>> > > I suppose the delete/reindex approach will not change soon (probably
>> it's
>> > > due to actual SOLR architecture), ?
>> > >
>> > > Possible solution: assuming atomic updates will be super fast on an
>> index
>> > > without fulltext, keep a separate ACLIndex and FullTextIndex and use
>> > > Pseudo-Joins:
>> > >
>> > > Example: searching 'foo' as user '999'
>> > > /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex from=Id
>> > to=Id
>> > > }userId:999
>> > >
>> > > Question: what about performance here? what if the index is 100,000
>> > > records?
>> > > notice that the worst situation is when everyone has access to all the
>> > > files, it means the first filter will be the full index.
>> > >
>> > > Would be happy to get any links that deal with the issue of Pseudo-join
>> > > performance for large datasets (i.e. initial filtered set of IDs).
>> > >
>> > > Regards,
>> > > Oleg
>> > >
>> > > P.S. we found that having the list of all users that have access for
>> each
>> > > record is better overall, because there are much more read requests
>> > (people
>> > > accessing the library) then write requests (a new user is
>> added/removed).
>> >
>>

Re: ACL implementation: Pseudo-join performance & Atomic Updates

Reply via email to