The Streaming API may have been throwing exceptions because the JSON special characters were not escaped. This was fixed in Solr 6.0.
Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi <chetas.jo...@gmail.com> wrote: > Hello, > > I am running Solr 5.5.0. > It is a solrCloud of 50 nodes and I have the following config for all the > collections. > maxShardsperNode: 1 > replicationFactor: 1 > > I was using Streaming API to get back results from Solr. It worked fine for > a while until the index data size reached beyond 40 GB per shard (i.e. per > node). It started throwing JSON parsing exceptions while reading the > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on the > same boxes on which Solr shards are running. Spark jobs also use a lot of > disk cache. So, the free available disk cache on the boxes vary a > lot depending upon what else is running on the box. > > Due to this issue, I moved to using the cursor approach and it works fine > but as we all know it is way slower than the streaming approach. > > Currently the index size per shard is 80GB (The machine has 512 GB of RAM > and being used by different services/programs: heap/off-heap and the disk > cache requirements). > > When I have enough RAM (more than 80 GB so that all the index data could > fit in memory) available on the machine, the streaming API succeeds without > running into any exceptions. > > Question: > How different the index data caching mechanism (for HDFS) is for the > Streaming API from the cursorMark approach? > Why cursor works every time but streaming works only when there is a lot of > free disk cache? > > Thank you. >