Roman: I think that SOLR-1913 is completely different. It's about having a field in a document and being able to do bitwise operations on it. So say I have a field in a Solr doc with the value 6 in it. I can then form a query like {!bitwise field=myfield op=AND source=2} and it would match.
You're talking about a much different operation as I understand it. In which case, go ahead and open up a JIRA, there's no harm in it. Best Erick On Tue, Jul 16, 2013 at 1:32 PM, Roman Chyla <roman.ch...@gmail.com> wrote: > Erick, > > I wasn't sure this issue is important, so I wanted first solicit some > feedback. You and Otis expressed interest, and I could create the JIRA - > however, as Alexandre, points out, the SOLR-1913 seems similar (actually, > closer to the Otis request to have the elasticsearch named filter) but the > SOLR-1913 was created in 2010 and is not integrated yet, so I am wondering > whether this new feature (somewhat overlapping, but still different from > SOLR-1913) is something people would really want and the effort on the JIRA > is well spent. What's your view? > > Thanks, > > roman > > > > > On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch > <arafa...@gmail.com>wrote: > >> Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ? >> >> Regards, >> Alex. >> >> Personal website: http://www.outerthoughts.com/ >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >> - Time is the quality of nature that keeps events from happening all at >> once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) >> >> >> On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson <erickerick...@gmail.com >> >wrote: >> >> > Roman: >> > >> > Did this ever make into a JIRA? Somehow I missed it if it did, and this >> > would >> > be pretty cool.... >> > >> > Erick >> > >> > On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla <roman.ch...@gmail.com> >> > wrote: >> > > On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca <oburl...@gmail.com> >> > wrote: >> > > >> > >> Hello Erick, >> > >> >> > >> > Join performance is most sensitive to the number of values >> > >> > in the field being joined on. So if you have lots and lots of >> > >> > distinct values in the corpus, join performance will be affected. >> > >> Yep, we have a list of unique Id's that we get by first searching for >> > >> records >> > >> where loggedInUser IS IN (userIDs) >> > >> This corpus is stored in memory I suppose? (not a problem) and then >> the >> > >> bottleneck is to match this huge set with the core where I'm >> searching? >> > >> >> > >> Somewhere in maillist archive people were talking about "external list >> > of >> > >> Solr unique IDs" >> > >> but didn't find if there is a solution. >> > >> Back in 2010 Yonik posted a comment: >> > >> http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd >> > >> >> > > >> > > sorry, haven't the previous thread in its entirety, but few weeks back >> > that >> > > Yonik's proposal got implemented, it seems ;) >> > > >> > > >> > >> http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter >> > > >> > > You could use this to send very large bitset filter (which can be >> > > translated into any integers, if you can come up with a mapping >> > function). >> > > >> > > roman >> > > >> > > >> > >> >> > >> > bq: I suppose the delete/reindex approach will not change soon >> > >> > There is ongoing work (search the JIRA for "Stacked Segments") >> > >> Ah, ok, I was feeling it affects the architecture, ok, now the only >> > hope is >> > >> Pseudo-Joins )) >> > >> >> > >> > One way to deal with this is to implement a "post filter", sometimes >> > >> called >> > >> > a "no cache" filter. >> > >> thanks, will have a look, but as you describe it, it's not the best >> > option. >> > >> >> > >> The approach >> > >> "too many documents, man. Please refine your query. Partial results >> > below" >> > >> means faceting will not work correctly? >> > >> >> > >> ... I have in mind a hybrid approach, comments welcome: >> > >> Most of the time users are not searching, but browsing content, so our >> > >> "virtual filesystem" stored in SOLR will use only the index with the >> Id >> > of >> > >> the file and the list of users that have access to it. i.e. not >> touching >> > >> the fulltext index at all. >> > >> >> > >> Files may have metadata (EXIF info for images for ex) that we'd like >> to >> > >> filter by, calculate facets. >> > >> Meta will be stored in both indexes. >> > >> >> > >> In case of a fulltext query: >> > >> 1. search FT index (the fulltext index), get only the number of search >> > >> results, let it be Rf >> > >> 2. search DAC index (the index with permissions), get number of search >> > >> results, let it be Rd >> > >> >> > >> let maxR be the maximum size of the corpus for the pseudo-join. >> > >> *That was actually my question: what is a reasonable number? 10, 100, >> > 1000 >> > >> ? >> > >> * >> > >> >> > >> if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join onto >> > the >> > >> second one. >> > >> this happens when (only a few documents contains the search query) OR >> > (user >> > >> has access to a small number of files). >> > >> >> > >> In case none of these happens, we can use the >> > >> "too many documents, man. Please refine your query. Partial results >> > below" >> > >> but first searching the FT index, because we want relevant results >> > first. >> > >> >> > >> What do you think? >> > >> >> > >> Regards, >> > >> Oleg >> > >> >> > >> >> > >> >> > >> >> > >> On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson < >> > erickerick...@gmail.com >> > >> >wrote: >> > >> >> > >> > Join performance is most sensitive to the number of values >> > >> > in the field being joined on. So if you have lots and lots of >> > >> > distinct values in the corpus, join performance will be affected. >> > >> > >> > >> > bq: I suppose the delete/reindex approach will not change soon >> > >> > >> > >> > There is ongoing work (search the JIRA for "Stacked Segments") >> > >> > on actually doing something about this, but it's been "under >> > >> consideration" >> > >> > for at least 3 years so your guess is as good as mine. >> > >> > >> > >> > bq: notice that the worst situation is when everyone has access to >> all >> > >> the >> > >> > files, it means the first filter will be the full index. >> > >> > >> > >> > One way to deal with this is to implement a "post filter", sometimes >> > >> called >> > >> > a "no cache" filter. The distinction here is that >> > >> > 1> it is not cached (duh!) >> > >> > 2> it is only called for documents that have made it through all the >> > >> > other "lower cost" filters (and the main query of course). >> > >> > 3> "lower cost" means the filter is either a standard, cached >> filters >> > >> > and any "no cache" filters with a cost (explicitly stated in the >> > >> query) >> > >> > lower than this one's. >> > >> > >> > >> > Critically, and unlike "normal" filter queries, the result set is >> NOT >> > >> > calculated for all documents ahead of time.... >> > >> > >> > >> > You _still_ have to deal with the sysadmin doing a *:* query as you >> > >> > are well aware. But one can mitigate that by having the post-filter >> > >> > fail all documents after some arbitrary N, and display a message in >> > the >> > >> > app like "too many documents, man. Please refine your query. Partial >> > >> > results below". Of course this may not be acceptable, but.... >> > >> > >> > >> > HTH >> > >> > Erick >> > >> > >> > >> > On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky >> > >> > <j...@basetechnology.com> wrote: >> > >> > > Take a look at LucidWorks Search and its access control: >> > >> > > >> > >> > >> > >> >> > >> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control >> > >> > > >> > >> > > Role-based security is an easier nut to crack. >> > >> > > >> > >> > > Karl Wright of ManifoldCF had a Solr patch for document access >> > control >> > >> at >> > >> > > one point: >> > >> > > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing >> > ManifoldCF >> > >> > > security at search time >> > >> > > https://issues.apache.org/jira/browse/SOLR-1895 >> > >> > > >> > >> > > >> > >> > >> > >> >> > >> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 >> > >> > > >> > >> > > For some other thoughts: >> > >> > > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security >> > >> > > >> > >> > > I'm not sure if external file fields will be of any value in this >> > >> > situation. >> > >> > > >> > >> > > There is also a proposal for bitwise operations: >> > >> > > SOLR-1913 - QParserPlugin plugin for Search Results Filtering >> Based >> > on >> > >> > > Bitwise Operations on Integer Fields >> > >> > > https://issues.apache.org/jira/browse/SOLR-1913 >> > >> > > >> > >> > > But the bottom line is that clearly updating all documents in the >> > index >> > >> > is a >> > >> > > non-starter. >> > >> > > >> > >> > > -- Jack Krupansky >> > >> > > >> > >> > > -----Original Message----- From: Oleg Burlaca >> > >> > > Sent: Sunday, July 14, 2013 11:02 AM >> > >> > > To: solr-user@lucene.apache.org >> > >> > > Subject: ACL implementation: Pseudo-join performance & Atomic >> > Updates >> > >> > > >> > >> > > >> > >> > > Hello all, >> > >> > > >> > >> > > Situation: >> > >> > > We have a collection of files in SOLR with ACL applied: each file >> > has a >> > >> > > multi-valued field that contains the list of userID's that can >> read >> > it: >> > >> > > >> > >> > > here is sample data: >> > >> > > Id | content | userId >> > >> > > 1 | text text | 4,5,6,2 >> > >> > > 2 | text text | 4,5,9 >> > >> > > 3 | text text | 4,2 >> > >> > > >> > >> > > Problem: >> > >> > > when ACL is changed for a big folder, we compute the ACL for all >> > child >> > >> > > items and reindex in SOLR using atomic updates (updating only >> > 'userIds' >> > >> > > column), but because it deletes/reindexes the record, the >> > performance >> > >> is >> > >> > > very poor. >> > >> > > >> > >> > > Question: >> > >> > > I suppose the delete/reindex approach will not change soon >> (probably >> > >> it's >> > >> > > due to actual SOLR architecture), ? >> > >> > > >> > >> > > Possible solution: assuming atomic updates will be super fast on >> an >> > >> index >> > >> > > without fulltext, keep a separate ACLIndex and FullTextIndex and >> use >> > >> > > Pseudo-Joins: >> > >> > > >> > >> > > Example: searching 'foo' as user '999' >> > >> > > /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex >> > from=Id >> > >> > to=Id >> > >> > > }userId:999 >> > >> > > >> > >> > > Question: what about performance here? what if the index is >> 100,000 >> > >> > > records? >> > >> > > notice that the worst situation is when everyone has access to all >> > the >> > >> > > files, it means the first filter will be the full index. >> > >> > > >> > >> > > Would be happy to get any links that deal with the issue of >> > Pseudo-join >> > >> > > performance for large datasets (i.e. initial filtered set of IDs). >> > >> > > >> > >> > > Regards, >> > >> > > Oleg >> > >> > > >> > >> > > P.S. we found that having the list of all users that have access >> for >> > >> each >> > >> > > record is better overall, because there are much more read >> requests >> > >> > (people >> > >> > > accessing the library) then write requests (a new user is >> > >> added/removed). >> > >> > >> > >> >> > >>