We've recently upgraded our SolrCloud (16 shards, 2 replicas) to 6.6.1 on
our way to 7 and I'm getting surprising /stream results.

In one example I /select (wt=csv) and /stream [using
search(...,wt=javabin)] with a query that gives a resultset size of 541
tuples.  The select comes back in under a second.  The stream takes 70
seconds.  Should I expect this much difference?

I then /select and /stream over a query with a resultset size of 3.5M
documents.  The select takes 14 minutes.  The stream takes just under 7
minutes using `curl`.  When I use solrj I get

Truncated chunk ( expected size: 32768; actual size:
13830)","trace":"org.apache.http.TruncatedChunkException: Truncated chunk (
expected size: 32768; actual size: 13830)
        at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:200)
        at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:215)
        at
org.apache.http.impl.io.ChunkedInputStream.close(ChunkedInputStream.java:316)
        at
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
        at
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
        ...

I found a reference to this being from a timeout of the HTTP session in
CloudSolrStream but couldn't find a bug in Jira on the topic.  Digging
around in the source (yay OSS) I found that I could get hold of the
ClouldSolrClient and up the SOTimeout so that's working now.

The documentation describes /stream as "returning data as soon as
available" but there seems to be a HUGE startup latency.  Any thoughts on
how to reduce that?

Reply via email to