Hi Erick, Ah, yes I guess you are correct in that could just avoid using cursorMark this way...the only (smallish I think) issue is that I would need to extract the last id from the csv output. Oh and that I am using Datastaxx DSE, so uniqueKey is a combination of two fields...but I think I can manage to use a field I now it's unique, even if its not uniqueKey.
thanks! Xavier On Tue, Jun 21, 2016 at 2:13 AM, Erick Erickson <erickerick...@gmail.com> wrote: > The CursorMark stuff has to deal with shards, what happens when more > than one document on different shards has the same sort value, what > if all the docs in the response packet have the same sort value, what > happens when you want to return docs by score and the like. > > For your case you can use a sort criteria that avoids all these issues and > be > OK. You can think of it as a specialized CursorMark. > > You should be able to just sort by > <uniqueKey> and send each query through with a range filter query, > so the first query would look something like (assuming "id" is your > <uniqueKey>) > > q=*:*&sort=id asc&start=0&rows=1000 > then the rest would be > q=*:*&sort=id asc&fq={!cache=false}id:[last_id_returned_from_previous_query > TO *]&start=0&rows=1000 > > this avoids the "deep paging" problem that CursorMark solves more cheaply > because the <uniqueKey> guarantees that there is one and only one doc with > that value. Note that the start parameter is always 0..... > > Or your second query could even be just > q=id:[last_id_returned_from_previous_query TO *]&sort=id > asc&start=0&rows=1000 > > Best, > Erick > > On Mon, Jun 20, 2016 at 12:37 PM, xavi jmlucjav <jmluc...@gmail.com> > wrote: > > Hi, > > > > I need to index into a new schema 800M docs, that exist in an older solr. > > As all fields are stored, I thought I was very lucky as I could: > > > > - use wt=csv > > - combined with cursorMark > > > > to easily script out something that would export/index in chunks of 1M > docs > > or something. CVS output being very efficient for this sort of thing, I > > think. > > > > But, sadly I found that there is no way to get the nextcursorMark after > the > > first request, as the csvwriter just outputs plailn csv info of the > fields, > > excluding all other info on the response!!! > > > > This is so unfortunate, as csv/cursorMark seem like the perfect fit to > > reindex this huge index (it's a one time thing). > > > > Does anyone see some way to still be able to use this? I would prefer not > > having to write some java code just to get the nextcursorMark. > > > > So far I thought of: > > - use json, but I need to postprocess returned json to remove the > response > > info etc, before reindexing, a pain. > > - send two calls for each chunk (sending the same cursormark both times), > > one wt=csv to get the data, another wt=json to get cursormark (and ignore > > the data, maybe using fl=id only to avoid getting much data). I did some > > test and this seems should work. > > > > I guess I will go with the 2nd, but anyone has a better idea? > > thanks > > xavier >