Thanks for your kind reply, Shawn. On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey <s...@elyograg.org> wrote:
> On 7/26/2013 11:02 PM, Joe Zhang wrote: > > I have an ever-growing solr repository, and I need to process every > single > > document to extract statistics. What would be a reasonable process that > > satifies the following properties: > > > > - Exhaustive: I have to traverse every single document > > - Incremental: in other words, it has to allow me to divide and conquer > --- > > if I have processed the first 20k docs, next time I can start with 20001. > > If your index isn't very big, a *:* query with rows and start parameters > is perfectly acceptable. Performance is terrible for this method when > the index gets huge, though. > ==> Essentially we are doing paigination here, right? If performance is not the concern, given that the index is dynamic, does the order of entries remain stable over time? > If "id" is your uniqueKey field, here's how you can do it. If that's > not your uniqueKey field, substitute your uniqueKey field for id. This > method doesn't work properly if you don't use a field with values that > are guaranteed to be unique. > > For the first query, send a query with these parameters, where NNNNNN is > the number of docs you want to retrieve at once: > q=*:*&rows=NNNNNN&sort=id asc > > For each subsequent query, use the following parameters, where XXX is > the highest id value seen in the previous query: > q={XXX TO *}&rows=NNNNNN&sort=id asc > > ==> This approach seems to require that the id field is numerical, right? I have a text-based id that is unique. ==> I'm not sure I understand the "q={XXX TO *}" part --> wouldn't query be matched against the default search field, which could be "content", for example? How would that do the job? > As soon as you see a numFound value less than NNNNNN, you will know that > there's no more data. > > Generally speaking, you'd want to avoid updating the index while doing > these queries. If you never replace existing documents and you can > guarantee that the value in the uniqueKey field for new documents will > always be higher than any previous value, then you could continue > updating the index. A database autoincrement field would qualify for > that condition. > > Thanks, > Shawn > >