Billions of documents?

Lance Norskog Tue, 21 Aug 2012 23:41:40 -0700

How do you separate the documents among the shards? Can you set up the
shards such that one "collapse group" is only on a single shard? That
you never have to do distributed grouping?


On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
<tchatter...@commvault.com> wrote:
> This wont work, see my thread on Solr3.6 Field collapsing
> Thanks,
> Tirthankar
>
> -----Original Message-----
> From: Tom Burton-West <tburt...@umich.edu>
> Date: Tue, 21 Aug 2012 18:39:25
> To: solr-user@lucene.apache.org<solr-user@lucene.apache.org>
> Reply-To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Cc: William Dueber<dueb...@umich.edu>; Phillip Farber<pfar...@umich.edu>
> Subject: Scalability of Solr Result Grouping/Field Collapsing:
>  Millions/Billions of documents?
>
> Hello all,
>
> We are thinking about using Solr Field Collapsing on a rather large scale
> and wonder if anyone has experience with performance when doing Field
> Collapsing on millions of or billions of documents (details below. )  Are
> there performance issues with grouping large result sets?
>
> Details:
> We have a collection of the full text of 10 million books/journals.  This
> is spread across 12 shards with each shard holding about 800,000
> documents.  When a query matches a journal article, we would like to group
> all the matching articles from the same journal together. (there is a
> unique id field identifying the journal).  Similarly when there is a match
> in multiple copies of the same book we would like to group all results for
> the same book together (again we have a unique id field we can group on).
> Sometimes a short query against the OCR field will result in over one
> million hits.  Are there known performance issues when field collapsing
> result sets containing a million hits?
>
> We currently index the entire book as one Solr document.  We would like to
> investigate the feasibility of indexing each page as a Solr document with a
> field indicating the book id.  We could then offer our users the choice of
> a list of the most relevant pages, or a list of the books containing the
> most relevant pages.  We have approximately 3 billion pages.   Does anyone
> have experience using field collapsing on this sort of scale?
>
> Tom
>
> Tom Burton-West
> Information Retrieval Programmer
> Digital Library Production Service
> Univerity of Michigan Library
> http://www.hathitrust.org/blogs/large-scale-search
> ******************Legal Disclaimer***************************
> "This communication may contain confidential and privileged
> material for the sole use of the intended recipient. Any
> unauthorized review, use or distribution by others is strictly
> prohibited. If you have received the message in error, please
> advise the sender by reply email and delete the message. Thank
> you."
> *********************************************************



-- 
Lance Norskog
goks...@gmail.com

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

Reply via email to