How do you separate the documents among the shards? Can you set up the shards such that one "collapse group" is only on a single shard? That you never have to do distributed grouping?
On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee <tchatter...@commvault.com> wrote: > This wont work, see my thread on Solr3.6 Field collapsing > Thanks, > Tirthankar > > -----Original Message----- > From: Tom Burton-West <tburt...@umich.edu> > Date: Tue, 21 Aug 2012 18:39:25 > To: solr-user@lucene.apache.org<solr-user@lucene.apache.org> > Reply-To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> > Cc: William Dueber<dueb...@umich.edu>; Phillip Farber<pfar...@umich.edu> > Subject: Scalability of Solr Result Grouping/Field Collapsing: > Millions/Billions of documents? > > Hello all, > > We are thinking about using Solr Field Collapsing on a rather large scale > and wonder if anyone has experience with performance when doing Field > Collapsing on millions of or billions of documents (details below. ) Are > there performance issues with grouping large result sets? > > Details: > We have a collection of the full text of 10 million books/journals. This > is spread across 12 shards with each shard holding about 800,000 > documents. When a query matches a journal article, we would like to group > all the matching articles from the same journal together. (there is a > unique id field identifying the journal). Similarly when there is a match > in multiple copies of the same book we would like to group all results for > the same book together (again we have a unique id field we can group on). > Sometimes a short query against the OCR field will result in over one > million hits. Are there known performance issues when field collapsing > result sets containing a million hits? > > We currently index the entire book as one Solr document. We would like to > investigate the feasibility of indexing each page as a Solr document with a > field indicating the book id. We could then offer our users the choice of > a list of the most relevant pages, or a list of the books containing the > most relevant pages. We have approximately 3 billion pages. Does anyone > have experience using field collapsing on this sort of scale? > > Tom > > Tom Burton-West > Information Retrieval Programmer > Digital Library Production Service > Univerity of Michigan Library > http://www.hathitrust.org/blogs/large-scale-search > ******************Legal Disclaimer*************************** > "This communication may contain confidential and privileged > material for the sole use of the intended recipient. Any > unauthorized review, use or distribution by others is strictly > prohibited. If you have received the message in error, please > advise the sender by reply email and delete the message. Thank > you." > ********************************************************* -- Lance Norskog goks...@gmail.com