We've recently upgraded our SolrCloud (16 shards, 2 replicas) to 6.6.1 on
our way to 7 and I'm getting surprising /stream results.
In one example I /select (wt=csv) and /stream [using
search(...,wt=javabin)] with a query that gives a resultset size of 541
tuples. The select comes back in under a second. The stream takes 70
seconds. Should I expect this much difference?
I then /select and /stream over a query with a resultset size of 3.5M
documents. The select takes 14 minutes. The stream takes just under 7
minutes using `curl`. When I use solrj I get
Truncated chunk ( expected size: 32768; actual size:
13830)","trace":"org.apache.http.TruncatedChunkException: Truncated chunk (
expected size: 32768; actual size: 13830)
at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:200)
at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:215)
at
org.apache.http.impl.io.ChunkedInputStream.close(ChunkedInputStream.java:316)
at
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
...
I found a reference to this being from a timeout of the HTTP session in
CloudSolrStream but couldn't find a bug in Jira on the topic. Digging
around in the source (yay OSS) I found that I could get hold of the
ClouldSolrClient and up the SOTimeout so that's working now.
The documentation describes /stream as "returning data as soon as
available" but there seems to be a HUGE startup latency. Any thoughts on
how to reduce that?