On 30-Apr-08, at 5:31 PM, Kevin Osborn wrote:
I have an index of about 3,000,000 products and about 8500
customers. Each customers has access to about 50 to about 500,000 of
the products.
Our current method was using a bitset in the filter. So, for each
customer, they have a bitset in the cache. For each docId that they
have access to, the bit is set. This is probably the best
performance-wise for searches, but it consumes a lot of memory,
especially because each document that they don't have access to also
consumes space (a 0). It also is probably the cause of our problems
when either these customer access lists (stored in files) or the
index is updated.
Is there a better way to manage access control? I was thinking of
storing the user access list as a specific document type in the
index. Basically, a single multi-value field. But I'm not quite sure
where to go from here.
The best way to go about this is to refactor the problem into the true
constraints that exist. It is unlikely that ~2,125,000,000 customer-
product pairs were manually created. Surely these resulted from
groups of less fine-grained control. Could these groups be the
filters you use?
Another option is to look for ways to transform the data based on its
intristic characteristics. Even if there are no longer explicit
control categories that you can leverage, you can look for groups of
documents that many users share access to, or large groups of docs
that few users have access to, and compose a single query's filter out
groups. This is probably pretty hard. A simpler application of the
idea is to look for a partitioning of the documents where few users
having access to one set have access to the other set. Put these in
two separate solrs/cores. Assuming a perfect partitioning, that
halves memory consumption.
Also consider that currently filters of size < 3000 are stored as
hashes (size proportional to # docs) rather than bitsets, thus consume
less memory. This is configurable (but don't go too high).
-Mike