Re: ACL implementation: Pseudo-join performance & Atomic Updates

Roman Chyla Mon, 15 Jul 2013 15:53:56 -0700

On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca <oburl...@gmail.com> wrote:


> Hello Erick,
>
> > Join performance is most sensitive to the number of values
> > in the field being joined on. So if you have lots and lots of
> > distinct values in the corpus, join performance will be affected.
> Yep, we have a list of unique Id's that we get by first searching for
> records
> where loggedInUser IS IN (userIDs)
> This corpus is stored in memory I suppose? (not a problem) and then the
> bottleneck is to match this huge set with the core where I'm searching?
>
> Somewhere in maillist archive people were talking about "external list of
> Solr unique IDs"
> but didn't find if there is a solution.
> Back in 2010 Yonik posted a comment:
> http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd
>

sorry, haven't the previous thread in its entirety, but few weeks back that
Yonik's proposal got implemented, it seems ;)

http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter

You could use this to send very large bitset filter (which can be
translated into any integers, if you can come up with a mapping function).

roman


>
> > bq: I suppose the delete/reindex approach will not change soon
> > There is ongoing work (search the JIRA for "Stacked Segments")
> Ah, ok, I was feeling it affects the architecture, ok, now the only hope is
> Pseudo-Joins ))
>
> > One way to deal with this is to implement a "post filter", sometimes
> called
> > a "no cache" filter.
> thanks, will have a look, but as you describe it, it's not the best option.
>
> The approach
> "too many documents, man. Please refine your query. Partial results below"
> means faceting will not work correctly?
>
> ... I have in mind a hybrid approach, comments welcome:
> Most of the time users are not searching, but browsing content, so our
> "virtual filesystem" stored in SOLR will use only the index with the Id of
> the file and the list of users that have access to it. i.e. not touching
> the fulltext index at all.
>
> Files may have metadata (EXIF info for images for ex) that we'd like to
> filter by, calculate facets.
> Meta will be stored in both indexes.
>
> In case of a fulltext query:
> 1. search FT index (the fulltext index), get only the number of search
> results, let it be Rf
> 2. search DAC index (the index with permissions), get number of search
> results, let it be Rd
>
> let maxR be the maximum size of the corpus for the pseudo-join.
> *That was actually my question: what is a reasonable number? 10, 100, 1000
> ?
> *
>
> if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join onto the
> second one.
> this happens when (only a few documents contains the search query) OR (user
> has access to a small number of files).
>
> In case none of these happens, we can use the
> "too many documents, man. Please refine your query. Partial results below"
> but first searching the FT index, because we want relevant results first.
>
> What do you think?
>
> Regards,
> Oleg
>
>
>
>
> On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > Join performance is most sensitive to the number of values
> > in the field being joined on. So if you have lots and lots of
> > distinct values in the corpus, join performance will be affected.
> >
> > bq: I suppose the delete/reindex approach will not change soon
> >
> > There is ongoing work (search the JIRA for "Stacked Segments")
> > on actually doing something about this, but it's been "under
> consideration"
> > for at least 3 years so your guess is as good as mine.
> >
> > bq: notice that the worst situation is when everyone has access to all
> the
> > files, it means the first filter will be the full index.
> >
> > One way to deal with this is to implement a "post filter", sometimes
> called
> > a "no cache" filter. The distinction here is that
> > 1> it is not cached (duh!)
> > 2> it is only called for documents that have made it through all the
> >      other "lower cost" filters (and the main query of course).
> > 3> "lower cost" means the filter is either a standard, cached filters
> >     and any "no cache" filters with a cost (explicitly stated in the
> query)
> >     lower than this one's.
> >
> > Critically, and unlike "normal" filter queries, the result set is NOT
> > calculated for all documents ahead of time....
> >
> > You _still_ have to deal with the sysadmin doing a *:* query as you
> > are well aware. But one can mitigate that by having the post-filter
> > fail all documents after some arbitrary N, and display a message in the
> > app like "too many documents, man. Please refine your query. Partial
> > results below". Of course this may not be acceptable, but....
> >
> > HTH
> > Erick
> >
> > On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky
> > <j...@basetechnology.com> wrote:
> > > Take a look at LucidWorks Search and its access control:
> > >
> >
> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
> > >
> > > Role-based security is an easier nut to crack.
> > >
> > > Karl Wright of ManifoldCF had a Solr patch for document access control
> at
> > > one point:
> > > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF
> > > security at search time
> > > https://issues.apache.org/jira/browse/SOLR-1895
> > >
> > >
> >
> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
> > >
> > > For some other thoughts:
> > > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
> > >
> > > I'm not sure if external file fields will be of any value in this
> > situation.
> > >
> > > There is also a proposal for bitwise operations:
> > > SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on
> > > Bitwise Operations on Integer Fields
> > > https://issues.apache.org/jira/browse/SOLR-1913
> > >
> > > But the bottom line is that clearly updating all documents in the index
> > is a
> > > non-starter.
> > >
> > > -- Jack Krupansky
> > >
> > > -----Original Message----- From: Oleg Burlaca
> > > Sent: Sunday, July 14, 2013 11:02 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: ACL implementation: Pseudo-join performance & Atomic Updates
> > >
> > >
> > > Hello all,
> > >
> > > Situation:
> > > We have a collection of files in SOLR with ACL applied: each file has a
> > > multi-valued field that contains the list of userID's that can read it:
> > >
> > > here is sample data:
> > > Id | content  | userId
> > > 1  | text text | 4,5,6,2
> > > 2  | text text | 4,5,9
> > > 3  | text text | 4,2
> > >
> > > Problem:
> > > when ACL is changed for a big folder, we compute the ACL for all child
> > > items and reindex in SOLR using atomic updates (updating only 'userIds'
> > > column), but because it deletes/reindexes the record, the performance
> is
> > > very poor.
> > >
> > > Question:
> > > I suppose the delete/reindex approach will not change soon (probably
> it's
> > > due to actual SOLR architecture), ?
> > >
> > > Possible solution: assuming atomic updates will be super fast on an
> index
> > > without fulltext, keep a separate ACLIndex and FullTextIndex and use
> > > Pseudo-Joins:
> > >
> > > Example: searching 'foo' as user '999'
> > > /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex from=Id
> > to=Id
> > > }userId:999
> > >
> > > Question: what about performance here? what if the index is 100,000
> > > records?
> > > notice that the worst situation is when everyone has access to all the
> > > files, it means the first filter will be the full index.
> > >
> > > Would be happy to get any links that deal with the issue of Pseudo-join
> > > performance for large datasets (i.e. initial filtered set of IDs).
> > >
> > > Regards,
> > > Oleg
> > >
> > > P.S. we found that having the list of all users that have access for
> each
> > > record is better overall, because there are much more read requests
> > (people
> > > accessing the library) then write requests (a new user is
> added/removed).
> >
>

Re: ACL implementation: Pseudo-join performance & Atomic Updates

Reply via email to