DocSet implementation for around 300K documents - clarification regarding the memory size

Rakesh Sinha Sun, 08 Mar 2009 19:44:57 -0700

I have a much smaller document set of 300 K documents that I am talking about.


I have some pre-defined queries (totally around 10K ) that I want to
implement as facets on the resultset of a given query and return the
top N (say 100) of the same.

I was planning to pre-compute the DocSet results for each of these
queries - store them in memory .

And then - for every successive query - I can do an intersection of
the query resultset with the predefined docset and extract the top N
(after sorting, of course).

Before I go along these lines - I was considering the memory usage of
10K docsets.

http://lucene.apache.org/solr/api/org/apache/solr/search/DocSet.html ,
lists 3 possible implementations ( conventional Bitset as BitDocSet,
HashDocSet for sparse docsets and DocSlice , ??? ) .

Most of these 10K docsets that I am talking about would fall into
sparse docset category.

I am curious what docset implementation would be chosen to store the
docset result. (Does it automatically select the right one based on
the density of the docset , for eg - if the number of set bits in
bitset is  > 1/8th then may be storing as a BitDocSet might be ok -
but for storing a bitset with 10 bits out of possible 300K .
HashDocSet might be better) .

Where can I look ( in solr source code) to understand more about this.  Thanks.

DocSet implementation for around 300K documents - clarification regarding the memory size

Reply via email to