Hello Oleg,
On Wed, Jul 17, 2013 at 3:49 PM, Oleg Burlaca <oburl...@gmail.com> wrote: > Hello Roman and all, > > > sorry, haven't the previous thread in its entirety, but few weeks back > that > > Yonik's proposal got implemented, it seems ;) > > http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter > > In that post I see a reference to your plugin BitSetQParserPlugin, right ? > > https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java > > I understood it as follows: > 1. query the core and get ALL search results, > search results == (id1, id2, id7 .. id28263) // a long arrays of > Unique IDs > 2. Generate a bitset from this array of IDs > 3. search a core using a bitsetfilter > > Correct? > yes, the BitSetQParserPlugin does the 3rd step the unittest, may explain it better: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java > > I was thinking that pseudo-joins can help exactly with this situation > (actually didn't even tried yet pseudo-joins, still watching the mail > list). > i.e. to make the first step efficient and at the same time perform a second > query without to send a lot of data to the client and then receiving this > data back. > > I have a feeling that such a situation: a list of Unique IDs from query1 > participates in filter in query2 > happens frequently, and would be very useful if SOLR has an optimized > approach to handle it. > mmm, it's transform the pseudo-join in a real JOIN like in SQL world. > > I think I'll just test to see the performance of pseudo-joins with large > datasets (was waiting to find the perfect solution). > I'd be very curious,if you do some experiments, please let us know. Thanks, roman > > Thanks for all the ideas/links, now I have a better view of the situation. > > Regards. > > > > > On Wed, Jul 17, 2013 at 3:34 PM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > Roman: > > > > I think that SOLR-1913 is completely different. It's > > about having a field in a document and being able > > to do bitwise operations on it. So say I have a > > field in a Solr doc with the value 6 in it. I can then > > form a query like > > {!bitwise field=myfield op=AND source=2} > > and it would match. > > > > You're talking about a much different operation as I > > understand it. > > > > In which case, go ahead and open up a JIRA, there's > > no harm in it. > > > > Best > > Erick > > > > On Tue, Jul 16, 2013 at 1:32 PM, Roman Chyla <roman.ch...@gmail.com> > > wrote: > > > Erick, > > > > > > I wasn't sure this issue is important, so I wanted first solicit some > > > feedback. You and Otis expressed interest, and I could create the JIRA > - > > > however, as Alexandre, points out, the SOLR-1913 seems similar > (actually, > > > closer to the Otis request to have the elasticsearch named filter) but > > the > > > SOLR-1913 was created in 2010 and is not integrated yet, so I am > > wondering > > > whether this new feature (somewhat overlapping, but still different > from > > > SOLR-1913) is something people would really want and the effort on the > > JIRA > > > is well spent. What's your view? > > > > > > Thanks, > > > > > > roman > > > > > > > > > > > > > > > On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch > > > <arafa...@gmail.com>wrote: > > > > > >> Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ? > > >> > > >> Regards, > > >> Alex. > > >> > > >> Personal website: http://www.outerthoughts.com/ > > >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > > >> - Time is the quality of nature that keeps events from happening all > at > > >> once. Lately, it doesn't seem to be working. (Anonymous - via GTD > > book) > > >> > > >> > > >> On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson < > > erickerick...@gmail.com > > >> >wrote: > > >> > > >> > Roman: > > >> > > > >> > Did this ever make into a JIRA? Somehow I missed it if it did, and > > this > > >> > would > > >> > be pretty cool.... > > >> > > > >> > Erick > > >> > > > >> > On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla <roman.ch...@gmail.com > > > > >> > wrote: > > >> > > On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca <oburl...@gmail.com > > > > >> > wrote: > > >> > > > > >> > >> Hello Erick, > > >> > >> > > >> > >> > Join performance is most sensitive to the number of values > > >> > >> > in the field being joined on. So if you have lots and lots of > > >> > >> > distinct values in the corpus, join performance will be > affected. > > >> > >> Yep, we have a list of unique Id's that we get by first searching > > for > > >> > >> records > > >> > >> where loggedInUser IS IN (userIDs) > > >> > >> This corpus is stored in memory I suppose? (not a problem) and > then > > >> the > > >> > >> bottleneck is to match this huge set with the core where I'm > > >> searching? > > >> > >> > > >> > >> Somewhere in maillist archive people were talking about "external > > list > > >> > of > > >> > >> Solr unique IDs" > > >> > >> but didn't find if there is a solution. > > >> > >> Back in 2010 Yonik posted a comment: > > >> > >> > http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd > > >> > >> > > >> > > > > >> > > sorry, haven't the previous thread in its entirety, but few weeks > > back > > >> > that > > >> > > Yonik's proposal got implemented, it seems ;) > > >> > > > > >> > > > > >> > > > >> > > > http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter > > >> > > > > >> > > You could use this to send very large bitset filter (which can be > > >> > > translated into any integers, if you can come up with a mapping > > >> > function). > > >> > > > > >> > > roman > > >> > > > > >> > > > > >> > >> > > >> > >> > bq: I suppose the delete/reindex approach will not change soon > > >> > >> > There is ongoing work (search the JIRA for "Stacked Segments") > > >> > >> Ah, ok, I was feeling it affects the architecture, ok, now the > only > > >> > hope is > > >> > >> Pseudo-Joins )) > > >> > >> > > >> > >> > One way to deal with this is to implement a "post filter", > > sometimes > > >> > >> called > > >> > >> > a "no cache" filter. > > >> > >> thanks, will have a look, but as you describe it, it's not the > best > > >> > option. > > >> > >> > > >> > >> The approach > > >> > >> "too many documents, man. Please refine your query. Partial > results > > >> > below" > > >> > >> means faceting will not work correctly? > > >> > >> > > >> > >> ... I have in mind a hybrid approach, comments welcome: > > >> > >> Most of the time users are not searching, but browsing content, > so > > our > > >> > >> "virtual filesystem" stored in SOLR will use only the index with > > the > > >> Id > > >> > of > > >> > >> the file and the list of users that have access to it. i.e. not > > >> touching > > >> > >> the fulltext index at all. > > >> > >> > > >> > >> Files may have metadata (EXIF info for images for ex) that we'd > > like > > >> to > > >> > >> filter by, calculate facets. > > >> > >> Meta will be stored in both indexes. > > >> > >> > > >> > >> In case of a fulltext query: > > >> > >> 1. search FT index (the fulltext index), get only the number of > > search > > >> > >> results, let it be Rf > > >> > >> 2. search DAC index (the index with permissions), get number of > > search > > >> > >> results, let it be Rd > > >> > >> > > >> > >> let maxR be the maximum size of the corpus for the pseudo-join. > > >> > >> *That was actually my question: what is a reasonable number? 10, > > 100, > > >> > 1000 > > >> > >> ? > > >> > >> * > > >> > >> > > >> > >> if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join > > onto > > >> > the > > >> > >> second one. > > >> > >> this happens when (only a few documents contains the search > query) > > OR > > >> > (user > > >> > >> has access to a small number of files). > > >> > >> > > >> > >> In case none of these happens, we can use the > > >> > >> "too many documents, man. Please refine your query. Partial > results > > >> > below" > > >> > >> but first searching the FT index, because we want relevant > results > > >> > first. > > >> > >> > > >> > >> What do you think? > > >> > >> > > >> > >> Regards, > > >> > >> Oleg > > >> > >> > > >> > >> > > >> > >> > > >> > >> > > >> > >> On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson < > > >> > erickerick...@gmail.com > > >> > >> >wrote: > > >> > >> > > >> > >> > Join performance is most sensitive to the number of values > > >> > >> > in the field being joined on. So if you have lots and lots of > > >> > >> > distinct values in the corpus, join performance will be > affected. > > >> > >> > > > >> > >> > bq: I suppose the delete/reindex approach will not change soon > > >> > >> > > > >> > >> > There is ongoing work (search the JIRA for "Stacked Segments") > > >> > >> > on actually doing something about this, but it's been "under > > >> > >> consideration" > > >> > >> > for at least 3 years so your guess is as good as mine. > > >> > >> > > > >> > >> > bq: notice that the worst situation is when everyone has access > > to > > >> all > > >> > >> the > > >> > >> > files, it means the first filter will be the full index. > > >> > >> > > > >> > >> > One way to deal with this is to implement a "post filter", > > sometimes > > >> > >> called > > >> > >> > a "no cache" filter. The distinction here is that > > >> > >> > 1> it is not cached (duh!) > > >> > >> > 2> it is only called for documents that have made it through > all > > the > > >> > >> > other "lower cost" filters (and the main query of course). > > >> > >> > 3> "lower cost" means the filter is either a standard, cached > > >> filters > > >> > >> > and any "no cache" filters with a cost (explicitly stated > in > > the > > >> > >> query) > > >> > >> > lower than this one's. > > >> > >> > > > >> > >> > Critically, and unlike "normal" filter queries, the result set > is > > >> NOT > > >> > >> > calculated for all documents ahead of time.... > > >> > >> > > > >> > >> > You _still_ have to deal with the sysadmin doing a *:* query as > > you > > >> > >> > are well aware. But one can mitigate that by having the > > post-filter > > >> > >> > fail all documents after some arbitrary N, and display a > message > > in > > >> > the > > >> > >> > app like "too many documents, man. Please refine your query. > > Partial > > >> > >> > results below". Of course this may not be acceptable, but.... > > >> > >> > > > >> > >> > HTH > > >> > >> > Erick > > >> > >> > > > >> > >> > On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky > > >> > >> > <j...@basetechnology.com> wrote: > > >> > >> > > Take a look at LucidWorks Search and its access control: > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control > > >> > >> > > > > >> > >> > > Role-based security is an easier nut to crack. > > >> > >> > > > > >> > >> > > Karl Wright of ManifoldCF had a Solr patch for document > access > > >> > control > > >> > >> at > > >> > >> > > one point: > > >> > >> > > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing > > >> > ManifoldCF > > >> > >> > > security at search time > > >> > >> > > https://issues.apache.org/jira/browse/SOLR-1895 > > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > > > http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 > > >> > >> > > > > >> > >> > > For some other thoughts: > > >> > >> > > > > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security > > >> > >> > > > > >> > >> > > I'm not sure if external file fields will be of any value in > > this > > >> > >> > situation. > > >> > >> > > > > >> > >> > > There is also a proposal for bitwise operations: > > >> > >> > > SOLR-1913 - QParserPlugin plugin for Search Results Filtering > > >> Based > > >> > on > > >> > >> > > Bitwise Operations on Integer Fields > > >> > >> > > https://issues.apache.org/jira/browse/SOLR-1913 > > >> > >> > > > > >> > >> > > But the bottom line is that clearly updating all documents in > > the > > >> > index > > >> > >> > is a > > >> > >> > > non-starter. > > >> > >> > > > > >> > >> > > -- Jack Krupansky > > >> > >> > > > > >> > >> > > -----Original Message----- From: Oleg Burlaca > > >> > >> > > Sent: Sunday, July 14, 2013 11:02 AM > > >> > >> > > To: solr-user@lucene.apache.org > > >> > >> > > Subject: ACL implementation: Pseudo-join performance & Atomic > > >> > Updates > > >> > >> > > > > >> > >> > > > > >> > >> > > Hello all, > > >> > >> > > > > >> > >> > > Situation: > > >> > >> > > We have a collection of files in SOLR with ACL applied: each > > file > > >> > has a > > >> > >> > > multi-valued field that contains the list of userID's that > can > > >> read > > >> > it: > > >> > >> > > > > >> > >> > > here is sample data: > > >> > >> > > Id | content | userId > > >> > >> > > 1 | text text | 4,5,6,2 > > >> > >> > > 2 | text text | 4,5,9 > > >> > >> > > 3 | text text | 4,2 > > >> > >> > > > > >> > >> > > Problem: > > >> > >> > > when ACL is changed for a big folder, we compute the ACL for > > all > > >> > child > > >> > >> > > items and reindex in SOLR using atomic updates (updating only > > >> > 'userIds' > > >> > >> > > column), but because it deletes/reindexes the record, the > > >> > performance > > >> > >> is > > >> > >> > > very poor. > > >> > >> > > > > >> > >> > > Question: > > >> > >> > > I suppose the delete/reindex approach will not change soon > > >> (probably > > >> > >> it's > > >> > >> > > due to actual SOLR architecture), ? > > >> > >> > > > > >> > >> > > Possible solution: assuming atomic updates will be super fast > > on > > >> an > > >> > >> index > > >> > >> > > without fulltext, keep a separate ACLIndex and FullTextIndex > > and > > >> use > > >> > >> > > Pseudo-Joins: > > >> > >> > > > > >> > >> > > Example: searching 'foo' as user '999' > > >> > >> > > /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex > > >> > from=Id > > >> > >> > to=Id > > >> > >> > > }userId:999 > > >> > >> > > > > >> > >> > > Question: what about performance here? what if the index is > > >> 100,000 > > >> > >> > > records? > > >> > >> > > notice that the worst situation is when everyone has access > to > > all > > >> > the > > >> > >> > > files, it means the first filter will be the full index. > > >> > >> > > > > >> > >> > > Would be happy to get any links that deal with the issue of > > >> > Pseudo-join > > >> > >> > > performance for large datasets (i.e. initial filtered set of > > IDs). > > >> > >> > > > > >> > >> > > Regards, > > >> > >> > > Oleg > > >> > >> > > > > >> > >> > > P.S. we found that having the list of all users that have > > access > > >> for > > >> > >> each > > >> > >> > > record is better overall, because there are much more read > > >> requests > > >> > >> > (people > > >> > >> > > accessing the library) then write requests (a new user is > > >> > >> added/removed). > > >> > >> > > > >> > >> > > >> > > > >> > > >