Re: Solr on HDFS: Streaming API performance tuning

Chetas Joshi Mon, 19 Dec 2016 13:43:22 -0800

Hi Joel,

I don't have any solr documents that have NULL values for the sort fields I
use in my queries.


Thanks!

On Sun, Dec 18, 2016 at 12:56 PM, Joel Bernstein <joels...@gmail.com> wrote:

> Ok, based on the stack trace I suspect one of your sort fields has NULL
> values, which in the 5x branch could produce null pointers if a segment had
> no values for a sort field. This is also fixed in the Solr 6x branch.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, Dec 17, 2016 at 2:44 PM, Chetas Joshi <chetas.jo...@gmail.com>
> wrote:
>
> > Here is the stack trace.
> >
> > java.lang.NullPointerException
> >
> >         at
> > org.apache.solr.client.solrj.io.comp.FieldComparator$2.
> > compare(FieldComparator.java:85)
> >
> >         at
> > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > compare(FieldComparator.java:92)
> >
> >         at
> > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > compare(FieldComparator.java:30)
> >
> >         at
> > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> MultiComp.java:45)
> >
> >         at
> > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> MultiComp.java:33)
> >
> >         at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > TupleWrapper.compareTo(CloudSolrStream.java:396)
> >
> >         at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > TupleWrapper.compareTo(CloudSolrStream.java:381)
> >
> >         at java.util.TreeMap.put(TreeMap.java:560)
> >
> >         at java.util.TreeSet.add(TreeSet.java:255)
> >
> >         at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > read(CloudSolrStream.java:366)
> >
> >         at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > read(CloudSolrStream.java:353)
> >
> >         at
> >
> > *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator.
> > scala:101)
> >
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent
> > number:
> > char=A,position=106596
> > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> >
> > org.noggit.JSONParser$ParseException: missing exponent number:
> > char=A,position=106596
> > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> >
> >         at org.noggit.JSONParser.err(JSONParser.java:356)
> >
> >         at org.noggit.JSONParser.readExp(JSONParser.java:513)
> >
> >         at org.noggit.JSONParser.readNumber(JSONParser.java:419)
> >
> >         at org.noggit.JSONParser.next(JSONParser.java:845)
> >
> >         at org.noggit.JSONParser.nextEvent(JSONParser.java:951)
> >
> >         at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127)
> >
> >         at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57)
> >
> >         at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)
> >
> >         at
> > org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> > next(JSONTupleStream.java:84)
> >
> >         at
> > org.apache.solr.client.solrj.io.stream.SolrStream.read(
> > SolrStream.java:147)
> >
> >         at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> TupleWrapper.next(
> > CloudSolrStream.java:413)
> >
> >         at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > read(CloudSolrStream.java:365)
> >
> >         at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > read(CloudSolrStream.java:353)
> >
> >
> > Thanks!
> >
> > On Fri, Dec 16, 2016 at 11:45 PM, Reth RM <reth.ik...@gmail.com> wrote:
> >
> > > If you could provide the json parse exception stack trace, it might
> help
> > to
> > > predict issue there.
> > >
> > >
> > > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi <chetas.jo...@gmail.com>
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > The only NON alpha-numeric characters I have in my data are '+' and
> > '/'.
> > > I
> > > > don't have any backslashes.
> > > >
> > > > If the special characters was the issue, I should get the JSON
> parsing
> > > > exceptions every time irrespective of the index size and irrespective
> > of
> > > > the available memory on the machine. That is not the case here. The
> > > > streaming API successfully returns all the documents when the index
> > size
> > > is
> > > > small and fits in the available memory. That's the reason I am
> > confused.
> > > >
> > > > Thanks!
> > > >
> > > > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein <joels...@gmail.com>
> > > > wrote:
> > > >
> > > > > The Streaming API may have been throwing exceptions because the
> JSON
> > > > > special characters were not escaped. This was fixed in Solr 6.0.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi <
> > chetas.jo...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am running Solr 5.5.0.
> > > > > > It is a solrCloud of 50 nodes and I have the following config for
> > all
> > > > the
> > > > > > collections.
> > > > > > maxShardsperNode: 1
> > > > > > replicationFactor: 1
> > > > > >
> > > > > > I was using Streaming API to get back results from Solr. It
> worked
> > > fine
> > > > > for
> > > > > > a while until the index data size reached beyond 40 GB per shard
> > > (i.e.
> > > > > per
> > > > > > node). It started throwing JSON parsing exceptions while reading
> > the
> > > > > > TupleStream data. FYI: I have other services (Yarn, Spark)
> deployed
> > > on
> > > > > the
> > > > > > same boxes on which Solr shards are running. Spark jobs also use
> a
> > > lot
> > > > of
> > > > > > disk cache. So, the free available disk cache on the boxes vary a
> > > > > > lot depending upon what else is running on the box.
> > > > > >
> > > > > > Due to this issue, I moved to using the cursor approach and it
> > works
> > > > fine
> > > > > > but as we all know it is way slower than the streaming approach.
> > > > > >
> > > > > > Currently the index size per shard is 80GB (The machine has 512
> GB
> > of
> > > > RAM
> > > > > > and being used by different services/programs: heap/off-heap and
> > the
> > > > disk
> > > > > > cache requirements).
> > > > > >
> > > > > > When I have enough RAM (more than 80 GB so that all the index
> data
> > > > could
> > > > > > fit in memory) available on the machine, the streaming API
> succeeds
> > > > > without
> > > > > > running into any exceptions.
> > > > > >
> > > > > > Question:
> > > > > > How different the index data caching mechanism (for HDFS) is for
> > the
> > > > > > Streaming API from the cursorMark approach?
> > > > > > Why cursor works every time but streaming works only when there
> is
> > a
> > > > lot
> > > > > of
> > > > > > free disk cache?
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Solr on HDFS: Streaming API performance tuning

Reply via email to