Searching with access controls

2006-08-10 Thread Martyn Smith
I'm trying to index data in a system that implements some rather nasty
access controls on the data.

Basically, there are users, and communities, and users are members of
the communities. Potentially a user could be a member of hundreds or
even thousands of communities (there's no enforced upper limit).

Now I'm trying for a solution such that a user only gets documents that
are either "public" or belong to a community that they're a member of.

I figure there are two approaches (if there are other/better ones,
please let me know).

1) For each document in the index, I store userid in a multivalued
field. I simply store every single userid that IS allowed access to the
document. This has the advantage of the query being quite simple (e.g.
useracecss:MYUSERID) but I will have to store HEAPS of data, and
potentially have to do many more updates (as users join/leave
communities).

2) For each document in the index, store the community id that it
belongs to. The obvious advantage here is less updates, and less
storage. HOWEVER, this means queries get bigger and bigger as users are
in more and more communities (e.g. communityid:(myCID1 OR myCID2 OR
myCID3 )

Does anyone have any thoughts on this?, are there blindingly obvious
options I'm missing that would take all this complication away?, what
performance implications do each of these methods have?

Many thanks in advance for any comments or helpful suggestions :)


--
Martyn





Re: Searching with access controls

2006-08-10 Thread Martyn Smith
I was just reading about the limit on boolean operators in a query (it
seems to default to 1024 in Solr).

Using option 2 would mean that a user can't be in any more than 1024
communities (assuming no other boolean logic in the query).

Potentially a huge number of communities (10,000+ ?). Each community
could easily have say 100 documents each, and there's some other
"global" type documents too.

Say 500,000 - 1,000,000 documents?

What do you mean by "You could also store user documents in the
collection to avoid passing the security info" ?

I'm not really a Java programmer of any significance, but I work with
people who are, and I can bully them into helping out. (I'm a Perl guy
myself).

Thanks,



--
Martyn


On Thu, 2006-08-10 at 23:43 -0400, Yonik Seeley wrote:
> On 8/10/06, Martyn Smith <[EMAIL PROTECTED]> wrote:
> > I'm trying to index data in a system that implements some rather nasty
> > access controls on the data.
> >
> > Basically, there are users, and communities, and users are members of
> > the communities. Potentially a user could be a member of hundreds or
> > even thousands of communities (there's no enforced upper limit).
> 
> I think option 2 (storing the community id with the document) is the way to 
> go.
> If it's not fast enough, custom query handlers and using filters may help.
> You could also store user documents in the collection to avoid passing
> the security info (this would definitely require a custom query
> handler).
> 
> What are the number of documents, and number of communities?
> 
> -Yonik
> 



Re: Searching with access controls

2006-08-10 Thread Martyn Smith
We're not really sure how big the userbase is going to get, but it could
become huge. I think initially we need to be able to cope with several
thousand users, and probably only several thousand communities.

I'll certainly have a look at "faceted browsing" :), and yeah, a query
handler that does that sounds quite useful.

I think I need to have a read on what "filters" actually are :)

Thanks thought, It looks like I've got some more reading to do ...

--
Martyn


On Fri, 2006-08-11 at 00:07 -0400, Yonik Seeley wrote:
> On 8/10/06, Martyn Smith <[EMAIL PROTECTED]> wrote:
> > I was just reading about the limit on boolean operators in a query (it
> > seems to default to 1024 in Solr).
> >
> > Using option 2 would mean that a user can't be in any more than 1024
> > communities (assuming no other boolean logic in the query).
> >
> > Potentially a huge number of communities (10,000+ ?). Each community
> > could easily have say 100 documents each, and there's some other
> > "global" type documents too.
> >
> > Say 500,000 - 1,000,000 documents?
> 
> How many users for this system?
> 
> > What do you mean by "You could also store user documents in the
> > collection to avoid passing the security info" ?
> 
> Store a document of type "user" that contains the communities they belong to.
> Create a custom query handler that takes a base query in addition to
> the user id.
> Get the user document, get a filter for each community they belong to
> from the filter cache, union them all, and then do a filtered query.
> 
> If the number of users is low, you could cache the resulting filter
> from unioning all the communities.  If the number of users is high
> compared to the number of communities, cache the community filters
> instead.
> 
> Search the archives for faceted browsing... many of the techniques may
> be applicable.
> 
> -Yonik
> 



Re: Faceted Searching Presentation @ ApacheCon US

2006-08-15 Thread Martyn Smith
Will this be available on-line anywhere after your presentation?

I'd be very interested to see it :)

--
Martyn

On Tue, 2006-08-15 at 18:13 -0700, Chris Hostetter wrote:
> I'm stoked to anounce that I'll be presenting at this years ApacheCon US,
> In Austin Texas on October 13th.
> 
> I'll be discussing how CNET uses Solr to power our Faceted searching
> pages, showing some examples of how you can use the Solr RequestHandler
> API to impliment very customized Faceted searching plugins, and
> (hopefully) demonstrating the new general purpose Faceted searching
> functionality in the Standard and DisMax request handlers (assuming I have
> time to write it)
> 
> More info can be found at the ApacheCon website...
> http://www.us.apachecon.com/html/sessions.html#FR26
> 
> 
> -Hoss
> 




Re: Faceted Searching Presentation @ ApacheCon US

2006-08-15 Thread Martyn Smith
I can't make it to Texas very easily :(


On Tue, 2006-08-15 at 22:03 -0700, Chris Hostetter wrote:
> : Will this be available on-line anywhere after your presentation?
> :
> : I'd be very interested to see it :)
> 
> The slides, or the code?
> 
> If I have time to write the code, it will be in Subversion.
> 
> As for the slides, i think so -- but i can't make any promises; besides:
> 
>   1) I'm a very animated speaker ... my slides typically don't contain
>  most of the juicy stuff I talk about.
>   2) If i say yes, then what's your incentive to come to the confrence? :)
> 
> 
> 
> -Hoss
> 
>