Re: ACL implementation: Pseudo-join performance & Atomic Updates

Roman Chyla Wed, 17 Jul 2013 14:55:20 -0700

Hello Oleg,


On Wed, Jul 17, 2013 at 3:49 PM, Oleg Burlaca <oburl...@gmail.com> wrote:

> Hello Roman and all,
>
> > sorry, haven't the previous thread in its entirety, but few weeks back
> that
> > Yonik's proposal got implemented, it seems ;)
>
> http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter
>
> In that post I see a reference to your plugin BitSetQParserPlugin, right ?
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java
>
> I understood it as follows:
> 1. query the core and get ALL search results,
>    search results == (id1, id2, id7 .. id28263)   // a long arrays of
> Unique IDs
> 2. Generate a bitset from this array of IDs
> 3. search a core using a bitsetfilter
>
> Correct?
>

yes, the BitSetQParserPlugin does the 3rd step

the unittest, may explain it better:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java



>
> I was thinking that pseudo-joins can help exactly with this situation
> (actually didn't even tried yet pseudo-joins, still watching the mail
> list).
> i.e. to make the first step efficient and at the same time perform a second
> query without to send a lot of data to the client and then receiving this
> data back.
>
> I have a feeling that such a situation: a list of Unique IDs from query1
> participates in filter in query2
> happens frequently, and would be very useful if SOLR has an optimized
> approach to handle it.
> mmm, it's transform the pseudo-join in a real JOIN like in SQL world.
>
> I think I'll just test to see the performance of pseudo-joins with large
> datasets (was waiting to find the perfect solution).
>

I'd be very curious,if you do some experiments, please let us know. Thanks,

roman


>
> Thanks for all the ideas/links, now I have a better view of the situation.
>
> Regards.
>
>
>
>
> On Wed, Jul 17, 2013 at 3:34 PM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > Roman:
> >
> > I think that SOLR-1913 is completely different. It's
> > about having a field in a document and being able
> > to do bitwise operations on it. So say I have a
> > field in a Solr doc with the value 6 in it. I can then
> > form a query like
> > {!bitwise field=myfield op=AND source=2}
> > and it would match.
> >
> > You're talking about a much different operation as I
> > understand it.
> >
> > In which case, go ahead and open up a JIRA, there's
> > no harm in it.
> >
> > Best
> > Erick
> >
> > On Tue, Jul 16, 2013 at 1:32 PM, Roman Chyla <roman.ch...@gmail.com>
> > wrote:
> > > Erick,
> > >
> > > I wasn't sure this issue is important, so I wanted first solicit some
> > > feedback. You and Otis expressed interest, and I could create the JIRA
> -
> > > however, as Alexandre, points out, the SOLR-1913 seems similar
> (actually,
> > > closer to the Otis request to have the elasticsearch named filter) but
> > the
> > > SOLR-1913 was created in 2010 and is not integrated yet, so I am
> > wondering
> > > whether this new feature (somewhat overlapping, but still different
> from
> > > SOLR-1913) is something people would really want and the effort on the
> > JIRA
> > > is well spent. What's your view?
> > >
> > > Thanks,
> > >
> > >   roman
> > >
> > >
> > >
> > >
> > > On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch
> > > <arafa...@gmail.com>wrote:
> > >
> > >> Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ?
> > >>
> > >> Regards,
> > >>    Alex.
> > >>
> > >> Personal website: http://www.outerthoughts.com/
> > >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > >> - Time is the quality of nature that keeps events from happening all
> at
> > >> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> > book)
> > >>
> > >>
> > >> On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson <
> > erickerick...@gmail.com
> > >> >wrote:
> > >>
> > >> > Roman:
> > >> >
> > >> > Did this ever make into a JIRA? Somehow I missed it if it did, and
> > this
> > >> > would
> > >> > be pretty cool....
> > >> >
> > >> > Erick
> > >> >
> > >> > On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla <roman.ch...@gmail.com
> >
> > >> > wrote:
> > >> > > On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca <oburl...@gmail.com
> >
> > >> > wrote:
> > >> > >
> > >> > >> Hello Erick,
> > >> > >>
> > >> > >> > Join performance is most sensitive to the number of values
> > >> > >> > in the field being joined on. So if you have lots and lots of
> > >> > >> > distinct values in the corpus, join performance will be
> affected.
> > >> > >> Yep, we have a list of unique Id's that we get by first searching
> > for
> > >> > >> records
> > >> > >> where loggedInUser IS IN (userIDs)
> > >> > >> This corpus is stored in memory I suppose? (not a problem) and
> then
> > >> the
> > >> > >> bottleneck is to match this huge set with the core where I'm
> > >> searching?
> > >> > >>
> > >> > >> Somewhere in maillist archive people were talking about "external
> > list
> > >> > of
> > >> > >> Solr unique IDs"
> > >> > >> but didn't find if there is a solution.
> > >> > >> Back in 2010 Yonik posted a comment:
> > >> > >>
> http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd
> > >> > >>
> > >> > >
> > >> > > sorry, haven't the previous thread in its entirety, but few weeks
> > back
> > >> > that
> > >> > > Yonik's proposal got implemented, it seems ;)
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter
> > >> > >
> > >> > > You could use this to send very large bitset filter (which can be
> > >> > > translated into any integers, if you can come up with a mapping
> > >> > function).
> > >> > >
> > >> > > roman
> > >> > >
> > >> > >
> > >> > >>
> > >> > >> > bq: I suppose the delete/reindex approach will not change soon
> > >> > >> > There is ongoing work (search the JIRA for "Stacked Segments")
> > >> > >> Ah, ok, I was feeling it affects the architecture, ok, now the
> only
> > >> > hope is
> > >> > >> Pseudo-Joins ))
> > >> > >>
> > >> > >> > One way to deal with this is to implement a "post filter",
> > sometimes
> > >> > >> called
> > >> > >> > a "no cache" filter.
> > >> > >> thanks, will have a look, but as you describe it, it's not the
> best
> > >> > option.
> > >> > >>
> > >> > >> The approach
> > >> > >> "too many documents, man. Please refine your query. Partial
> results
> > >> > below"
> > >> > >> means faceting will not work correctly?
> > >> > >>
> > >> > >> ... I have in mind a hybrid approach, comments welcome:
> > >> > >> Most of the time users are not searching, but browsing content,
> so
> > our
> > >> > >> "virtual filesystem" stored in SOLR will use only the index with
> > the
> > >> Id
> > >> > of
> > >> > >> the file and the list of users that have access to it. i.e. not
> > >> touching
> > >> > >> the fulltext index at all.
> > >> > >>
> > >> > >> Files may have metadata (EXIF info for images for ex) that we'd
> > like
> > >> to
> > >> > >> filter by, calculate facets.
> > >> > >> Meta will be stored in both indexes.
> > >> > >>
> > >> > >> In case of a fulltext query:
> > >> > >> 1. search FT index (the fulltext index), get only the number of
> > search
> > >> > >> results, let it be Rf
> > >> > >> 2. search DAC index (the index with permissions), get number of
> > search
> > >> > >> results, let it be Rd
> > >> > >>
> > >> > >> let maxR be the maximum size of the corpus for the pseudo-join.
> > >> > >> *That was actually my question: what is a reasonable number? 10,
> > 100,
> > >> > 1000
> > >> > >> ?
> > >> > >> *
> > >> > >>
> > >> > >> if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join
> > onto
> > >> > the
> > >> > >> second one.
> > >> > >> this happens when (only a few documents contains the search
> query)
> > OR
> > >> > (user
> > >> > >> has access to a small number of files).
> > >> > >>
> > >> > >> In case none of these happens, we can use the
> > >> > >> "too many documents, man. Please refine your query. Partial
> results
> > >> > below"
> > >> > >> but first searching the FT index, because we want relevant
> results
> > >> > first.
> > >> > >>
> > >> > >> What do you think?
> > >> > >>
> > >> > >> Regards,
> > >> > >> Oleg
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson <
> > >> > erickerick...@gmail.com
> > >> > >> >wrote:
> > >> > >>
> > >> > >> > Join performance is most sensitive to the number of values
> > >> > >> > in the field being joined on. So if you have lots and lots of
> > >> > >> > distinct values in the corpus, join performance will be
> affected.
> > >> > >> >
> > >> > >> > bq: I suppose the delete/reindex approach will not change soon
> > >> > >> >
> > >> > >> > There is ongoing work (search the JIRA for "Stacked Segments")
> > >> > >> > on actually doing something about this, but it's been "under
> > >> > >> consideration"
> > >> > >> > for at least 3 years so your guess is as good as mine.
> > >> > >> >
> > >> > >> > bq: notice that the worst situation is when everyone has access
> > to
> > >> all
> > >> > >> the
> > >> > >> > files, it means the first filter will be the full index.
> > >> > >> >
> > >> > >> > One way to deal with this is to implement a "post filter",
> > sometimes
> > >> > >> called
> > >> > >> > a "no cache" filter. The distinction here is that
> > >> > >> > 1> it is not cached (duh!)
> > >> > >> > 2> it is only called for documents that have made it through
> all
> > the
> > >> > >> >      other "lower cost" filters (and the main query of course).
> > >> > >> > 3> "lower cost" means the filter is either a standard, cached
> > >> filters
> > >> > >> >     and any "no cache" filters with a cost (explicitly stated
> in
> > the
> > >> > >> query)
> > >> > >> >     lower than this one's.
> > >> > >> >
> > >> > >> > Critically, and unlike "normal" filter queries, the result set
> is
> > >> NOT
> > >> > >> > calculated for all documents ahead of time....
> > >> > >> >
> > >> > >> > You _still_ have to deal with the sysadmin doing a *:* query as
> > you
> > >> > >> > are well aware. But one can mitigate that by having the
> > post-filter
> > >> > >> > fail all documents after some arbitrary N, and display a
> message
> > in
> > >> > the
> > >> > >> > app like "too many documents, man. Please refine your query.
> > Partial
> > >> > >> > results below". Of course this may not be acceptable, but....
> > >> > >> >
> > >> > >> > HTH
> > >> > >> > Erick
> > >> > >> >
> > >> > >> > On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky
> > >> > >> > <j...@basetechnology.com> wrote:
> > >> > >> > > Take a look at LucidWorks Search and its access control:
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
> > >> > >> > >
> > >> > >> > > Role-based security is an easier nut to crack.
> > >> > >> > >
> > >> > >> > > Karl Wright of ManifoldCF had a Solr patch for document
> access
> > >> > control
> > >> > >> at
> > >> > >> > > one point:
> > >> > >> > > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing
> > >> > ManifoldCF
> > >> > >> > > security at search time
> > >> > >> > > https://issues.apache.org/jira/browse/SOLR-1895
> > >> > >> > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
> > >> > >> > >
> > >> > >> > > For some other thoughts:
> > >> > >> > >
> > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
> > >> > >> > >
> > >> > >> > > I'm not sure if external file fields will be of any value in
> > this
> > >> > >> > situation.
> > >> > >> > >
> > >> > >> > > There is also a proposal for bitwise operations:
> > >> > >> > > SOLR-1913 - QParserPlugin plugin for Search Results Filtering
> > >> Based
> > >> > on
> > >> > >> > > Bitwise Operations on Integer Fields
> > >> > >> > > https://issues.apache.org/jira/browse/SOLR-1913
> > >> > >> > >
> > >> > >> > > But the bottom line is that clearly updating all documents in
> > the
> > >> > index
> > >> > >> > is a
> > >> > >> > > non-starter.
> > >> > >> > >
> > >> > >> > > -- Jack Krupansky
> > >> > >> > >
> > >> > >> > > -----Original Message----- From: Oleg Burlaca
> > >> > >> > > Sent: Sunday, July 14, 2013 11:02 AM
> > >> > >> > > To: solr-user@lucene.apache.org
> > >> > >> > > Subject: ACL implementation: Pseudo-join performance & Atomic
> > >> > Updates
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > Hello all,
> > >> > >> > >
> > >> > >> > > Situation:
> > >> > >> > > We have a collection of files in SOLR with ACL applied: each
> > file
> > >> > has a
> > >> > >> > > multi-valued field that contains the list of userID's that
> can
> > >> read
> > >> > it:
> > >> > >> > >
> > >> > >> > > here is sample data:
> > >> > >> > > Id | content  | userId
> > >> > >> > > 1  | text text | 4,5,6,2
> > >> > >> > > 2  | text text | 4,5,9
> > >> > >> > > 3  | text text | 4,2
> > >> > >> > >
> > >> > >> > > Problem:
> > >> > >> > > when ACL is changed for a big folder, we compute the ACL for
> > all
> > >> > child
> > >> > >> > > items and reindex in SOLR using atomic updates (updating only
> > >> > 'userIds'
> > >> > >> > > column), but because it deletes/reindexes the record, the
> > >> > performance
> > >> > >> is
> > >> > >> > > very poor.
> > >> > >> > >
> > >> > >> > > Question:
> > >> > >> > > I suppose the delete/reindex approach will not change soon
> > >> (probably
> > >> > >> it's
> > >> > >> > > due to actual SOLR architecture), ?
> > >> > >> > >
> > >> > >> > > Possible solution: assuming atomic updates will be super fast
> > on
> > >> an
> > >> > >> index
> > >> > >> > > without fulltext, keep a separate ACLIndex and FullTextIndex
> > and
> > >> use
> > >> > >> > > Pseudo-Joins:
> > >> > >> > >
> > >> > >> > > Example: searching 'foo' as user '999'
> > >> > >> > > /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex
> > >> > from=Id
> > >> > >> > to=Id
> > >> > >> > > }userId:999
> > >> > >> > >
> > >> > >> > > Question: what about performance here? what if the index is
> > >> 100,000
> > >> > >> > > records?
> > >> > >> > > notice that the worst situation is when everyone has access
> to
> > all
> > >> > the
> > >> > >> > > files, it means the first filter will be the full index.
> > >> > >> > >
> > >> > >> > > Would be happy to get any links that deal with the issue of
> > >> > Pseudo-join
> > >> > >> > > performance for large datasets (i.e. initial filtered set of
> > IDs).
> > >> > >> > >
> > >> > >> > > Regards,
> > >> > >> > > Oleg
> > >> > >> > >
> > >> > >> > > P.S. we found that having the list of all users that have
> > access
> > >> for
> > >> > >> each
> > >> > >> > > record is better overall, because there are much more read
> > >> requests
> > >> > >> > (people
> > >> > >> > > accessing the library) then write requests (a new user is
> > >> > >> added/removed).
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
>

Re: ACL implementation: Pseudo-join performance & Atomic Updates

Reply via email to