On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca <oburl...@gmail.com> wrote:
> Hello Erick, > > > Join performance is most sensitive to the number of values > > in the field being joined on. So if you have lots and lots of > > distinct values in the corpus, join performance will be affected. > Yep, we have a list of unique Id's that we get by first searching for > records > where loggedInUser IS IN (userIDs) > This corpus is stored in memory I suppose? (not a problem) and then the > bottleneck is to match this huge set with the core where I'm searching? > > Somewhere in maillist archive people were talking about "external list of > Solr unique IDs" > but didn't find if there is a solution. > Back in 2010 Yonik posted a comment: > http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd > sorry, haven't the previous thread in its entirety, but few weeks back that Yonik's proposal got implemented, it seems ;) http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter You could use this to send very large bitset filter (which can be translated into any integers, if you can come up with a mapping function). roman > > > bq: I suppose the delete/reindex approach will not change soon > > There is ongoing work (search the JIRA for "Stacked Segments") > Ah, ok, I was feeling it affects the architecture, ok, now the only hope is > Pseudo-Joins )) > > > One way to deal with this is to implement a "post filter", sometimes > called > > a "no cache" filter. > thanks, will have a look, but as you describe it, it's not the best option. > > The approach > "too many documents, man. Please refine your query. Partial results below" > means faceting will not work correctly? > > ... I have in mind a hybrid approach, comments welcome: > Most of the time users are not searching, but browsing content, so our > "virtual filesystem" stored in SOLR will use only the index with the Id of > the file and the list of users that have access to it. i.e. not touching > the fulltext index at all. > > Files may have metadata (EXIF info for images for ex) that we'd like to > filter by, calculate facets. > Meta will be stored in both indexes. > > In case of a fulltext query: > 1. search FT index (the fulltext index), get only the number of search > results, let it be Rf > 2. search DAC index (the index with permissions), get number of search > results, let it be Rd > > let maxR be the maximum size of the corpus for the pseudo-join. > *That was actually my question: what is a reasonable number? 10, 100, 1000 > ? > * > > if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join onto the > second one. > this happens when (only a few documents contains the search query) OR (user > has access to a small number of files). > > In case none of these happens, we can use the > "too many documents, man. Please refine your query. Partial results below" > but first searching the FT index, because we want relevant results first. > > What do you think? > > Regards, > Oleg > > > > > On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > Join performance is most sensitive to the number of values > > in the field being joined on. So if you have lots and lots of > > distinct values in the corpus, join performance will be affected. > > > > bq: I suppose the delete/reindex approach will not change soon > > > > There is ongoing work (search the JIRA for "Stacked Segments") > > on actually doing something about this, but it's been "under > consideration" > > for at least 3 years so your guess is as good as mine. > > > > bq: notice that the worst situation is when everyone has access to all > the > > files, it means the first filter will be the full index. > > > > One way to deal with this is to implement a "post filter", sometimes > called > > a "no cache" filter. The distinction here is that > > 1> it is not cached (duh!) > > 2> it is only called for documents that have made it through all the > > other "lower cost" filters (and the main query of course). > > 3> "lower cost" means the filter is either a standard, cached filters > > and any "no cache" filters with a cost (explicitly stated in the > query) > > lower than this one's. > > > > Critically, and unlike "normal" filter queries, the result set is NOT > > calculated for all documents ahead of time.... > > > > You _still_ have to deal with the sysadmin doing a *:* query as you > > are well aware. But one can mitigate that by having the post-filter > > fail all documents after some arbitrary N, and display a message in the > > app like "too many documents, man. Please refine your query. Partial > > results below". Of course this may not be acceptable, but.... > > > > HTH > > Erick > > > > On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky > > <j...@basetechnology.com> wrote: > > > Take a look at LucidWorks Search and its access control: > > > > > > http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control > > > > > > Role-based security is an easier nut to crack. > > > > > > Karl Wright of ManifoldCF had a Solr patch for document access control > at > > > one point: > > > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF > > > security at search time > > > https://issues.apache.org/jira/browse/SOLR-1895 > > > > > > > > > http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 > > > > > > For some other thoughts: > > > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security > > > > > > I'm not sure if external file fields will be of any value in this > > situation. > > > > > > There is also a proposal for bitwise operations: > > > SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on > > > Bitwise Operations on Integer Fields > > > https://issues.apache.org/jira/browse/SOLR-1913 > > > > > > But the bottom line is that clearly updating all documents in the index > > is a > > > non-starter. > > > > > > -- Jack Krupansky > > > > > > -----Original Message----- From: Oleg Burlaca > > > Sent: Sunday, July 14, 2013 11:02 AM > > > To: solr-user@lucene.apache.org > > > Subject: ACL implementation: Pseudo-join performance & Atomic Updates > > > > > > > > > Hello all, > > > > > > Situation: > > > We have a collection of files in SOLR with ACL applied: each file has a > > > multi-valued field that contains the list of userID's that can read it: > > > > > > here is sample data: > > > Id | content | userId > > > 1 | text text | 4,5,6,2 > > > 2 | text text | 4,5,9 > > > 3 | text text | 4,2 > > > > > > Problem: > > > when ACL is changed for a big folder, we compute the ACL for all child > > > items and reindex in SOLR using atomic updates (updating only 'userIds' > > > column), but because it deletes/reindexes the record, the performance > is > > > very poor. > > > > > > Question: > > > I suppose the delete/reindex approach will not change soon (probably > it's > > > due to actual SOLR architecture), ? > > > > > > Possible solution: assuming atomic updates will be super fast on an > index > > > without fulltext, keep a separate ACLIndex and FullTextIndex and use > > > Pseudo-Joins: > > > > > > Example: searching 'foo' as user '999' > > > /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex from=Id > > to=Id > > > }userId:999 > > > > > > Question: what about performance here? what if the index is 100,000 > > > records? > > > notice that the worst situation is when everyone has access to all the > > > files, it means the first filter will be the full index. > > > > > > Would be happy to get any links that deal with the issue of Pseudo-join > > > performance for large datasets (i.e. initial filtered set of IDs). > > > > > > Regards, > > > Oleg > > > > > > P.S. we found that having the list of all users that have access for > each > > > record is better overall, because there are much more read requests > > (people > > > accessing the library) then write requests (a new user is > added/removed). > > >