On 30-Apr-08, at 5:31 PM, Kevin Osborn wrote:

I have an index of about 3,000,000 products and about 8500 customers. Each customers has access to about 50 to about 500,000 of the products.

Our current method was using a bitset in the filter. So, for each customer, they have a bitset in the cache. For each docId that they have access to, the bit is set. This is probably the best performance-wise for searches, but it consumes a lot of memory, especially because each document that they don't have access to also consumes space (a 0). It also is probably the cause of our problems when either these customer access lists (stored in files) or the index is updated.

Is there a better way to manage access control? I was thinking of storing the user access list as a specific document type in the index. Basically, a single multi-value field. But I'm not quite sure where to go from here.

The best way to go about this is to refactor the problem into the true constraints that exist. It is unlikely that ~2,125,000,000 customer- product pairs were manually created. Surely these resulted from groups of less fine-grained control. Could these groups be the filters you use?

Another option is to look for ways to transform the data based on its intristic characteristics. Even if there are no longer explicit control categories that you can leverage, you can look for groups of documents that many users share access to, or large groups of docs that few users have access to, and compose a single query's filter out groups. This is probably pretty hard. A simpler application of the idea is to look for a partitioning of the documents where few users having access to one set have access to the other set. Put these in two separate solrs/cores. Assuming a perfect partitioning, that halves memory consumption.

Also consider that currently filters of size < 3000 are stored as hashes (size proportional to # docs) rather than bitsets, thus consume less memory. This is configurable (but don't go too high).

-Mike

Reply via email to