Re: cursorMark and CSVResponseWriter for mass reindex

xavi jmlucjav Mon, 20 Jun 2016 23:30:00 -0700

Hi Erick,

Ah, yes I guess you are correct in that could just avoid using cursorMark
this way...the only (smallish I think) issue is that I would need to
extract the last id from the csv output. Oh and that I am using Datastaxx
DSE, so uniqueKey is a combination of two fields...but I think I can manage
to use a field I now it's unique, even if its not uniqueKey.


thanks!
Xavier



On Tue, Jun 21, 2016 at 2:13 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> The CursorMark stuff has to deal with shards, what happens when more
> than one document on different shards has the same sort value, what
> if all the docs in the response packet have the same sort value, what
> happens when you want to return docs by score and the like.
>
> For your case you can use a sort criteria that avoids all these issues and
> be
> OK. You can think of it as a specialized CursorMark.
>
> You should be able to just sort by
> <uniqueKey> and send each query through with a range filter query,
> so the first query would look something like (assuming "id" is your
> <uniqueKey>)
>
> q=*:*&sort=id asc&start=0&rows=1000
> then the rest would be
> q=*:*&sort=id asc&fq={!cache=false}id:[last_id_returned_from_previous_query
> TO *]&start=0&rows=1000
>
> this avoids the "deep paging" problem that CursorMark solves more cheaply
> because the <uniqueKey> guarantees that there is one and only one doc with
> that value. Note that the start parameter is always 0.....
>
> Or your second query could even be just
> q=id:[last_id_returned_from_previous_query TO *]&sort=id
> asc&start=0&rows=1000
>
> Best,
> Erick
>
> On Mon, Jun 20, 2016 at 12:37 PM, xavi jmlucjav <jmluc...@gmail.com>
> wrote:
> > Hi,
> >
> > I need to index into a new schema 800M docs, that exist in an older solr.
> > As all fields are stored, I thought I was very lucky as I could:
> >
> > - use wt=csv
> > - combined with cursorMark
> >
> > to easily script out something that would export/index in chunks of 1M
> docs
> > or something. CVS output being very efficient for this sort of thing, I
> > think.
> >
> > But, sadly I found that there is no way to get the nextcursorMark after
> the
> > first request, as the csvwriter just outputs plailn csv info of the
> fields,
> > excluding all other info on the response!!!
> >
> > This is so unfortunate, as csv/cursorMark seem like the perfect fit to
> > reindex this huge index (it's a one time thing).
> >
> > Does anyone see some way to still be able to use this? I would prefer not
> > having to write some java code just to get the nextcursorMark.
> >
> > So far I thought of:
> > - use json, but I need to postprocess returned json to remove the
> response
> > info etc, before reindexing, a pain.
> > - send two calls for each chunk (sending the same cursormark both times),
> > one wt=csv to get the data, another wt=json to get cursormark (and ignore
> > the data, maybe using fl=id only to avoid getting much data). I did some
> > test and this seems should work.
> >
> > I guess I will go with the 2nd, but anyone has a better idea?
> > thanks
> > xavier
>

Re: cursorMark and CSVResponseWriter for mass reindex

Reply via email to