Billions of documents?

Tom Burton-West Tue, 21 Aug 2012 15:39:55 -0700

Hello all,

We are thinking about using Solr Field Collapsing on a rather large scale
and wonder if anyone has experience with performance when doing Field
Collapsing on millions of or billions of documents (details below. )  Are
there performance issues with grouping large result sets?


Details:
We have a collection of the full text of 10 million books/journals.  This
is spread across 12 shards with each shard holding about 800,000
documents.  When a query matches a journal article, we would like to group
all the matching articles from the same journal together. (there is a
unique id field identifying the journal).  Similarly when there is a match
in multiple copies of the same book we would like to group all results for
the same book together (again we have a unique id field we can group on).
Sometimes a short query against the OCR field will result in over one
million hits.  Are there known performance issues when field collapsing
result sets containing a million hits?

We currently index the entire book as one Solr document.  We would like to
investigate the feasibility of indexing each page as a Solr document with a
field indicating the book id.  We could then offer our users the choice of
a list of the most relevant pages, or a list of the books containing the
most relevant pages.  We have approximately 3 billion pages.   Does anyone
have experience using field collapsing on this sort of scale?

Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
Univerity of Michigan Library
http://www.hathitrust.org/blogs/large-scale-search

Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Reply via email to