Hello all, We are thinking about using Solr Field Collapsing on a rather large scale and wonder if anyone has experience with performance when doing Field Collapsing on millions of or billions of documents (details below. ) Are there performance issues with grouping large result sets?
Details: We have a collection of the full text of 10 million books/journals. This is spread across 12 shards with each shard holding about 800,000 documents. When a query matches a journal article, we would like to group all the matching articles from the same journal together. (there is a unique id field identifying the journal). Similarly when there is a match in multiple copies of the same book we would like to group all results for the same book together (again we have a unique id field we can group on). Sometimes a short query against the OCR field will result in over one million hits. Are there known performance issues when field collapsing result sets containing a million hits? We currently index the entire book as one Solr document. We would like to investigate the feasibility of indexing each page as a Solr document with a field indicating the book id. We could then offer our users the choice of a list of the most relevant pages, or a list of the books containing the most relevant pages. We have approximately 3 billion pages. Does anyone have experience using field collapsing on this sort of scale? Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Service Univerity of Michigan Library http://www.hathitrust.org/blogs/large-scale-search