In both cases, for better performance, first I'd load just all the IDs, after, during processing I'd load each document. For what concern the incremental requirement, it should not be difficult to write an hash function which maps a non-numerical I'd to a value. On Jul 27, 2013 7:03 AM, "Joe Zhang" <smartag...@gmail.com> wrote:
> Dear list: > > I have an ever-growing solr repository, and I need to process every single > document to extract statistics. What would be a reasonable process that > satifies the following properties: > > - Exhaustive: I have to traverse every single document > - Incremental: in other words, it has to allow me to divide and conquer --- > if I have processed the first 20k docs, next time I can start with 20001. > > A simple "*:*" query would satisfy the 1st but not the 2nd property. In > fact, given that the processing will take very long, and the repository > keeps growing, it is not even clear that the exhaustiveness is achieved. > > I'm running solr 3.6.2 in a single-machine setting; no hadoop capability > yet. But I guess the same issues still hold even if I have the solr cloud > environment, right, say in each shard? > > Any help would be greatly appreciated. > > Joe >