Re: Solr on HDFS: Streaming API performance tuning

Chetas Joshi Fri, 16 Dec 2016 17:53:00 -0800

Hi Joel,

The only NON alpha-numeric characters I have in my data are '+' and '/'. I
don't have any backslashes.


If the special characters was the issue, I should get the JSON parsing
exceptions every time irrespective of the index size and irrespective of
the available memory on the machine. That is not the case here. The
streaming API successfully returns all the documents when the index size is
small and fits in the available memory. That's the reason I am confused.

Thanks!

On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein <joels...@gmail.com> wrote:

> The Streaming API may have been throwing exceptions because the JSON
> special characters were not escaped. This was fixed in Solr 6.0.
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi <chetas.jo...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I am running Solr 5.5.0.
> > It is a solrCloud of 50 nodes and I have the following config for all the
> > collections.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I was using Streaming API to get back results from Solr. It worked fine
> for
> > a while until the index data size reached beyond 40 GB per shard (i.e.
> per
> > node). It started throwing JSON parsing exceptions while reading the
> > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on
> the
> > same boxes on which Solr shards are running. Spark jobs also use a lot of
> > disk cache. So, the free available disk cache on the boxes vary a
> > lot depending upon what else is running on the box.
> >
> > Due to this issue, I moved to using the cursor approach and it works fine
> > but as we all know it is way slower than the streaming approach.
> >
> > Currently the index size per shard is 80GB (The machine has 512 GB of RAM
> > and being used by different services/programs: heap/off-heap and the disk
> > cache requirements).
> >
> > When I have enough RAM (more than 80 GB so that all the index data could
> > fit in memory) available on the machine, the streaming API succeeds
> without
> > running into any exceptions.
> >
> > Question:
> > How different the index data caching mechanism (for HDFS) is for the
> > Streaming API from the cursorMark approach?
> > Why cursor works every time but streaming works only when there is a lot
> of
> > free disk cache?
> >
> > Thank you.
> >
>

Re: Solr on HDFS: Streaming API performance tuning

Reply via email to