Re: Parallelize Cursor approach

Joel Bernstein Thu, 10 Nov 2016 20:15:07 -0800

In Solr 5 the /export handler wasn't escaping json text fields, which would
produce json parse exceptions. This was fixed in Solr 6.0.


Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Hmm, that should work fine. Let us know what the logs show if anything
> because this is weird.
>
> Best,
> Erick
>
> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <chetas.jo...@gmail.com>
> wrote:
> > Hi Erick,
> >
> > This is how I use the streaming approach.
> >
> > Here is the solrconfig block.
> >
> > <requestHandler name="/export" class="solr.SearchHandler">
> >     <lst name="invariants">
> >         <str name="rq">{!xport}</str>
> >         <str name="wt">xsort</str>
> >         <str name="distrib">false</str>
> >     </lst>
> >     <arr name="components">
> >         <str>query</str>
> >     </arr>
> > </requestHandler>
> >
> > And here is the code in which SolrJ is being used.
> >
> > String zkHost = args[0];
> > String collection = args[1];
> >
> > Map props = new HashMap();
> > props.put("q", "*:*");
> > props.put("qt", "/export");
> > props.put("sort", "fieldA asc");
> > props.put("fl", "fieldA,fieldB,fieldC");
> >
> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,
> collection,props);
> >
> > And then I iterate through the cloud stream (TupleStream).
> > So I am using streaming expressions (SolrJ).
> >
> > I have not looked at the solr logs while I started getting the JSON
> parsing
> > exceptions. But I will let you know what I see the next time I run into
> the
> > same exceptions.
> >
> > Thanks
> >
> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> Hmmm, export is supposed to handle 10s of million result sets. I know
> >> of a situation where the Streaming Aggregation functionality back
> >> ported to Solr 4.10 processes on that scale. So do you have any clue
> >> what exactly is failing? Is there anything in the Solr logs?
> >>
> >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
> >> just the raw xport handler? It might be worth trying to do this from
> >> SolrJ if you're not, it should be a very quick program to write, just
> >> to test we're talking 100 lines max.
> >>
> >> You could always roll your own cursor mark stuff by partitioning the
> >> data amongst N threads/processes if you have any reasonable
> >> expectation that you could form filter queries that partition the
> >> result set anywhere near evenly.
> >>
> >> For example, let's say you have a field with random numbers between 0
> >> and 100. You could spin off 10 cursorMark-aware processes each with
> >> its own fq clause like
> >>
> >> fq=partition_field:[0 TO 10}
> >> fq=[10 TO 20}
> >> ....
> >> fq=[90 TO 100]
> >>
> >> Note the use of inclusive/exclusive end points....
> >>
> >> Each one would be totally independent of all others with no
> >> overlapping documents. And since the fq's would presumably be cached
> >> you should be able to go as fast as you can drive your cluster. Of
> >> course you lose query-wide sorting and the like, if that's important
> >> you'd need to figure something out there.
> >>
> >> Do be aware of a potential issue. When regular doc fields are
> >> returned, for each document returned, a 16K block of data will be
> >> decompressed to get the stored field data. Streaming Aggregation
> >> (/xport) reads docValues entries which are held in MMapDirectory space
> >> so will be much, much faster. As of Solr 5.5. You can override the
> >> decompression stuff, see:
> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
> >> both stored and docvalues...
> >>
> >> Best,
> >> Erick
> >>
> >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.jo...@gmail.com>
> >> wrote:
> >> > Thanks Yonik for the explanation.
> >> >
> >> > Hi Erick,
> >> > I was using the /xport functionality. But it hasn't been stable (Solr
> >> > 5.5.0). I started running into run time Exceptions (JSON parsing
> >> > exceptions) while reading the stream of Tuples. This started
> happening as
> >> > the size of my collection increased 3 times and I started running
> queries
> >> > that return millions of documents (>10mm). I don't know if it is the
> >> query
> >> > result size or the actual data size (total number of docs in the
> >> > collection) that is causing the instability.
> >> >
> >> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
> >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> >> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
> >> > 0lG99sHT8P5e'
> >> >
> >> > I won't be able to move to Solr 6.0 due to some constraints in our
> >> > production environment and hence moving back to the cursor approach.
> Do
> >> you
> >> > have any other suggestion for me?
> >> >
> >> > Thanks,
> >> > Chetas.
> >> >
> >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> Have you considered the /xport functionality?
> >> >>
> >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ysee...@gmail.com>
> wrote:
> >> >> > No, you can't get cursor-marks ahead of time.
> >> >> > They are the serialized representation of the last sort values
> >> >> > encountered (hence not known ahead of time).
> >> >> >
> >> >> > -Yonik
> >> >> >
> >> >> >
> >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <
> chetas.jo...@gmail.com>
> >> >> wrote:
> >> >> >> Hi,
> >> >> >>
> >> >> >> I am using the cursor approach to fetch results from Solr (5.5.0).
> >> Most
> >> >> of
> >> >> >> my queries return millions of results. Is there a way I can read
> the
> >> >> pages
> >> >> >> in parallel? Is there a way I can get all the cursors well in
> >> advance?
> >> >> >>
> >> >> >> Let's say my query returns 2M documents and I have set
> rows=100,000.
> >> >> >> Can I have multiple threads iterating over different pages like
> >> >> >> Thread1 -> docs 1 to 100K
> >> >> >> Thread2 -> docs 101K to 200K
> >> >> >> ......
> >> >> >> ......
> >> >> >>
> >> >> >> for this to happen, can I get all the cursorMarks for a given
> query
> >> so
> >> >> that
> >> >> >> I can leverage the following code in parallel
> >> >> >>
> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> >> >> >> val rsp: QueryResponse = c.query(cursorQ)
> >> >> >>
> >> >> >> Thank you,
> >> >> >> Chetas.
> >> >>
> >>
>

Re: Parallelize Cursor approach

Reply via email to