Re: Streaming and large resultsets

Joel Bernstein Thu, 09 Nov 2017 09:17:47 -0800

Can you post the exact streaming query you are using? The size of the index
and field types will help understand the issue as well. Also are you seeing
different performance behaviors after the upgrade or just testing the
streaming for the first time on 6.6.1?

When using the /export handler to stream, the results are pulled from the
DocValues caches which have in-memory structures that need to be built. If
your query returns a large number of string fields you could see a delay as
those cashes are built for the first time or rebuilt following a commit.
You can use static warming queries to warm the fields in the background
following a commit. After caches are built the queries should start
streaming immediately on subsequent queries. The 70 second response time
with only 541 tuples sounds like it might be caused by the caches being
rebuilt.

In general though the /export handler will slow down as you add more fields
to the field list and sort list. But if you post your query and information
on what types the fields are in the field list I can give you an idea of
the type of performance I would expect.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 9, 2017 at 9:54 AM, Lanny Ripple <la...@spotright.com> wrote:

> We've recently upgraded our SolrCloud (16 shards, 2 replicas) to 6.6.1 on
> our way to 7 and I'm getting surprising /stream results.
>
> In one example I /select (wt=csv) and /stream [using
> search(...,wt=javabin)] with a query that gives a resultset size of 541
> tuples.  The select comes back in under a second.  The stream takes 70
> seconds.  Should I expect this much difference?
>
> I then /select and /stream over a query with a resultset size of 3.5M
> documents.  The select takes 14 minutes.  The stream takes just under 7
> minutes using `curl`.  When I use solrj I get
>
> Truncated chunk ( expected size: 32768; actual size:
> 13830)","trace":"org.apache.http.TruncatedChunkException: Truncated chunk
> (
> expected size: 32768; actual size: 13830)
>         at
> org.apache.http.impl.io.ChunkedInputStream.read(
> ChunkedInputStream.java:200)
>         at
> org.apache.http.impl.io.ChunkedInputStream.read(
> ChunkedInputStream.java:215)
>         at
> org.apache.http.impl.io.ChunkedInputStream.close(
> ChunkedInputStream.java:316)
>         at
> org.apache.http.conn.BasicManagedEntity.streamClosed(
> BasicManagedEntity.java:164)
>         at
> org.apache.http.conn.EofSensorInputStream.checkClose(
> EofSensorInputStream.java:228)
>         ...
>
> I found a reference to this being from a timeout of the HTTP session in
> CloudSolrStream but couldn't find a bug in Jira on the topic.  Digging
> around in the source (yay OSS) I found that I could get hold of the
> ClouldSolrClient and up the SOTimeout so that's working now.
>
> The documentation describes /stream as "returning data as soon as
> available" but there seems to be a HUGE startup latency.  Any thoughts on
> how to reduce that?
>

Re: Streaming and large resultsets

Reply via email to