In Solr 5 the /export handler wasn't escaping json text fields, which would produce json parse exceptions. This was fixed in Solr 6.0.
Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Hmm, that should work fine. Let us know what the logs show if anything > because this is weird. > > Best, > Erick > > On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <chetas.jo...@gmail.com> > wrote: > > Hi Erick, > > > > This is how I use the streaming approach. > > > > Here is the solrconfig block. > > > > <requestHandler name="/export" class="solr.SearchHandler"> > > <lst name="invariants"> > > <str name="rq">{!xport}</str> > > <str name="wt">xsort</str> > > <str name="distrib">false</str> > > </lst> > > <arr name="components"> > > <str>query</str> > > </arr> > > </requestHandler> > > > > And here is the code in which SolrJ is being used. > > > > String zkHost = args[0]; > > String collection = args[1]; > > > > Map props = new HashMap(); > > props.put("q", "*:*"); > > props.put("qt", "/export"); > > props.put("sort", "fieldA asc"); > > props.put("fl", "fieldA,fieldB,fieldC"); > > > > CloudSolrStream cloudstream = new CloudSolrStream(zkHost, > collection,props); > > > > And then I iterate through the cloud stream (TupleStream). > > So I am using streaming expressions (SolrJ). > > > > I have not looked at the solr logs while I started getting the JSON > parsing > > exceptions. But I will let you know what I see the next time I run into > the > > same exceptions. > > > > Thanks > > > > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> Hmmm, export is supposed to handle 10s of million result sets. I know > >> of a situation where the Streaming Aggregation functionality back > >> ported to Solr 4.10 processes on that scale. So do you have any clue > >> what exactly is failing? Is there anything in the Solr logs? > >> > >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or > >> just the raw xport handler? It might be worth trying to do this from > >> SolrJ if you're not, it should be a very quick program to write, just > >> to test we're talking 100 lines max. > >> > >> You could always roll your own cursor mark stuff by partitioning the > >> data amongst N threads/processes if you have any reasonable > >> expectation that you could form filter queries that partition the > >> result set anywhere near evenly. > >> > >> For example, let's say you have a field with random numbers between 0 > >> and 100. You could spin off 10 cursorMark-aware processes each with > >> its own fq clause like > >> > >> fq=partition_field:[0 TO 10} > >> fq=[10 TO 20} > >> .... > >> fq=[90 TO 100] > >> > >> Note the use of inclusive/exclusive end points.... > >> > >> Each one would be totally independent of all others with no > >> overlapping documents. And since the fq's would presumably be cached > >> you should be able to go as fast as you can drive your cluster. Of > >> course you lose query-wide sorting and the like, if that's important > >> you'd need to figure something out there. > >> > >> Do be aware of a potential issue. When regular doc fields are > >> returned, for each document returned, a 16K block of data will be > >> decompressed to get the stored field data. Streaming Aggregation > >> (/xport) reads docValues entries which are held in MMapDirectory space > >> so will be much, much faster. As of Solr 5.5. You can override the > >> decompression stuff, see: > >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are > >> both stored and docvalues... > >> > >> Best, > >> Erick > >> > >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.jo...@gmail.com> > >> wrote: > >> > Thanks Yonik for the explanation. > >> > > >> > Hi Erick, > >> > I was using the /xport functionality. But it hasn't been stable (Solr > >> > 5.5.0). I started running into run time Exceptions (JSON parsing > >> > exceptions) while reading the stream of Tuples. This started > happening as > >> > the size of my collection increased 3 times and I started running > queries > >> > that return millions of documents (>10mm). I don't know if it is the > >> query > >> > result size or the actual data size (total number of docs in the > >> > collection) that is causing the instability. > >> > > >> > org.noggit.JSONParser$ParseException: Expected ',' or '}': > >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/ > >> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":" > >> > 0lG99sHT8P5e' > >> > > >> > I won't be able to move to Solr 6.0 due to some constraints in our > >> > production environment and hence moving back to the cursor approach. > Do > >> you > >> > have any other suggestion for me? > >> > > >> > Thanks, > >> > Chetas. > >> > > >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson < > erickerick...@gmail.com > >> > > >> > wrote: > >> > > >> >> Have you considered the /xport functionality? > >> >> > >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ysee...@gmail.com> > wrote: > >> >> > No, you can't get cursor-marks ahead of time. > >> >> > They are the serialized representation of the last sort values > >> >> > encountered (hence not known ahead of time). > >> >> > > >> >> > -Yonik > >> >> > > >> >> > > >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi < > chetas.jo...@gmail.com> > >> >> wrote: > >> >> >> Hi, > >> >> >> > >> >> >> I am using the cursor approach to fetch results from Solr (5.5.0). > >> Most > >> >> of > >> >> >> my queries return millions of results. Is there a way I can read > the > >> >> pages > >> >> >> in parallel? Is there a way I can get all the cursors well in > >> advance? > >> >> >> > >> >> >> Let's say my query returns 2M documents and I have set > rows=100,000. > >> >> >> Can I have multiple threads iterating over different pages like > >> >> >> Thread1 -> docs 1 to 100K > >> >> >> Thread2 -> docs 101K to 200K > >> >> >> ...... > >> >> >> ...... > >> >> >> > >> >> >> for this to happen, can I get all the cursorMarks for a given > query > >> so > >> >> that > >> >> >> I can leverage the following code in parallel > >> >> >> > >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark) > >> >> >> val rsp: QueryResponse = c.query(cursorQ) > >> >> >> > >> >> >> Thank you, > >> >> >> Chetas. > >> >> > >> >