Re: Streaming and large resultsets

Susmit Shukla Sat, 11 Nov 2017 10:53:03 -0800

Hi Lanny,

For long running streaming queries with many shards and huge resultsets,
solrj's default settings for http max connections/connections per host may
not be enough. If you are using the worker collection (/stream), it depends
on dispensing http clients using SolrClientCache with default limits. Could
be useful to turn on debug logging and check.


Thanks,
Susmit

On Thu, Nov 9, 2017 at 8:35 PM, Lanny Ripple <la...@spotright.com> wrote:

> First, Joel, thanks for your help on this.
>
> 1) I have to admit we really haven't played with a lot of system tuning
> recently (before DocValues for sure).   We'll go through another tuning
> round.
>
> 2) At the time I ran these numbers this morning we were not indexing.  We
> build this collection once a month and then client jobs can update it.  I
> was watching our job queue and there were no jobs running at that time.
> It's possible someone else was querying against other collections but they
> wouldn't have been updating this collection.
>
> 3) I'll try /export on each node.  We're pretty cookie-cutter with all
> nodes being the same and configuration controlled with puppet.  We collect
> system metrics to a Graphite display panel and no host looks out of sorts
> relative to the others.  That said I wouldn't be surprised if a node was
> out of whack.
>
> Thanks again.
>   -ljr
>
> On Thu, Nov 9, 2017 at 2:34 PM Joel Bernstein <joels...@gmail.com> wrote:
>
> > In my experience this should be very fast:
> >
> >  search(graph-october,
> >     q="outs:tokenA",
> >     fl="id,token",
> >     sort="id asc",
> >     qt="/export",
> >     wt=javabin)
> >
> >
> > When the DocValues cache is statically warmed for the two output fields I
> > would see somewhere around 500,000 docs per second exported from a single
> > node.
> >
> > You have sixteen shards which would give you 16 times the throughput. But
> > off course the data is being sent back through the single aggregator node
> > so your throughput is only as fast as the aggregator node can process the
> > results.
> >
> > This does not explain the slowness that you are seeing. I see a couple of
> > possible reasons:
> >
> > 1) The memory on the system is not tuned optimally. You allocated a large
> > amount of memory to the heap and are not providing enough memory to OS
> > filesystem. Lucene DocValues use the OS filesystem cache for the
> DocValues
> > caches. So I would bump down the size of heap considerably.
> >
> > 2) Are you indexing while querying at all? If you are you would need to
> be
> > statically warming the DocValues caches for the id field which is used
> for
> > sorting. Following each commit there is a top level docvalues cache that
> is
> > rebuilt for sorting on string fields. If you use a static warming query
> it
> > will warm the cache before making the new searcher live for searchers. I
> > would also pause indexing if possible and run queries only to see how it
> > runs without indexing.
> >
> > 3) Try running a query directly to /export handler on each node. Possibly
> > one of your nodes is slow for some reason and that is causing the entire
> > query to respond slowly.
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Nov 9, 2017 at 2:22 PM, Lanny Ripple <la...@spotright.com>
> wrote:
> >
> > > Happy to do so.  I am testing streams for the first time so we don't
> have
> > > any 5.x experience.  The collection I'm testing was loaded after going
> to
> > > 6.6.1 and fixing up the solrconfig for lucene_version and removing the
> > > /export clause.  The indexes run 57G per replica.  We are using 64G
> hosts
> > > with 48G heaps using G1GC but this isn't our only large collection.
> This
> > > morning as I'm running these our cluster is quiet.  I realize some of
> the
> > > performance we are seeing is going to be our data size so not expecting
> > any
> > > silver bullets.
> > >
> > > We are storing 858M documents that are basically
> > >
> > > id: String
> > > token: String
> > > outs: String[]
> > > outsCount: Int
> > >
> > > All stored=true, docvalues=true.
> > >
> > > The `outs` reference a select number of tokens (1.26M).  Here are
> current
> > > percentiles of our outsCount
> > >
> > > `outsCount`
> > > 50%       12
> > > 85%      127
> > > 98%      937
> > > 99.9% 16,284
> > >
> > > I'll display the /stream query but I'm setting up the same thing in
> > solrj.
> > > I'm going to name our small result set "tokenA" and our large one
> > "tokenV".
> > >
> > >   search(graph-october,
> > >     q="outs:tokenA",
> > >     fl="id,token",
> > >     sort="id asc",
> > >     qt="/export",
> > >     wt=javabin)
> > >
> > > I've placed this in file /tmp/expr and invoke with
> > >
> > >   curl -sSN -m 3600 --data-urlencode expr@/tmp/expr
> > > http://host/solr/graph-october/stream
> > >
> > > The large resultset query replaces "tokenA" with "tokenV".
> > >
> > > My /select query is
> > >
> > >   curl -sSN -m 3600 -d wt=csv -d rows=10000 -d q="outs:tokenA" -d
> > > fl="id,token" http://host/solr/graph-october/select
> > >
> > > So the behavior I'm seeing (times are with running the commands back to
> > > back)
> > >
> > > /select tokenA 0.296s
> > > /search tokenA 8.787s
> > > /search tokenA 8.513s
> > > /select tokenA 0.292s
> > >
> > > I was going to get some "tokenV" times here but curl is hitting
> > "Truncated
> > > chunk" errors.
> > >
> > > I can pick a "tokenB" which hasn't been run but from the originating
> > data I
> > > know has a resultset size of about 5800.  Times are
> > >
> > > /search tokenB 11.486s
> > > /search tokenB 0.960s
> > > /select tokenB 2.534s
> > > /select tokenB 2.117s
> > > /search tokenB 1.208s
> > >
> > > So that's closer to the behavior I'd expect and desire.  Back to
> > "tokenA".
> > >
> > > /search tokenA 0.289s
> > > /search tokenA 0.249s
> > > /select tokenA 0.342s
> > > /select tokenA 0.329s
> > > /search tokenA 11.848s
> > >
> > > Here's "tokenV" out of a solrj couting job.
> > >
> > > _run1_
> > > Token tokenV has 3497203 outs.
> > > Time to first read: PT37.577S
> > > Runtime: PT1M43.822S
> > >
> > > _run2_
> > > Token tokenV has 3497203 outs.
> > > Time to first read: PT18.764S
> > > Runtime: PT1M16.655S
> > >
> > > So hmm.  Given this I'd hazard there's nothing unexpected about /stream
> > and
> > > the problem is the load on our cluster.
> > >
> > > So I thought I'd run "tokenA" though the count job.
> > >
> > > _run1_
> > > Token tokenA has 513* outs.
> > > Time to first read: PT12.105S
> > > Runtime: PT12.114S
> > >
> > > * The correct count.  Last post I also counted csv header (514) then
> > > transposed the digits.  :/
> > >
> > > _run2_
> > > Token tokenA has 513 outs.
> > > Time to first read: PT7.85S
> > > Runtime: PT7.86S
> > >
> > > _run3_
> > > /select tokenA 0.229s
> > >
> > > _run4_
> > > Token tokenA has 513 outs.
> > > Time to first read: PT11.485S
> > > Runtime: PT11.495S
> > >
> > > And we are back to not nearly as fast.
> > >
> > > Here's my counting code.
> > > https://gist.github.com/LannyRipple/8e47769155c766b33bf37bcf901b17e9
> > >
> > > On Thu, Nov 9, 2017 at 11:17 AM Joel Bernstein <joels...@gmail.com>
> > wrote:
> > >
> > > > Can you post the exact streaming query you are using? The size of the
> > > index
> > > > and field types will help understand the issue as well. Also are you
> > > seeing
> > > > different performance behaviors after the upgrade or just testing the
> > > > streaming for the first time on 6.6.1?
> > > >
> > > > When using the /export handler to stream, the results are pulled from
> > the
> > > > DocValues caches which have in-memory structures that need to be
> built.
> > > If
> > > > your query returns a large number of string fields you could see a
> > delay
> > > as
> > > > those cashes are built for the first time or rebuilt following a
> > commit.
> > > > You can use static warming queries to warm the fields in the
> background
> > > > following a commit. After caches are built the queries should start
> > > > streaming immediately on subsequent queries. The 70 second response
> > time
> > > > with only 541 tuples sounds like it might be caused by the caches
> being
> > > > rebuilt.
> > > >
> > > > In general though the /export handler will slow down as you add more
> > > fields
> > > > to the field list and sort list. But if you post your query and
> > > information
> > > > on what types the fields are in the field list I can give you an idea
> > of
> > > > the type of performance I would expect.
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Thu, Nov 9, 2017 at 9:54 AM, Lanny Ripple <la...@spotright.com>
> > > wrote:
> > > >
> > > > > We've recently upgraded our SolrCloud (16 shards, 2 replicas) to
> > 6.6.1
> > > on
> > > > > our way to 7 and I'm getting surprising /stream results.
> > > > >
> > > > > In one example I /select (wt=csv) and /stream [using
> > > > > search(...,wt=javabin)] with a query that gives a resultset size of
> > 541
> > > > > tuples.  The select comes back in under a second.  The stream takes
> > 70
> > > > > seconds.  Should I expect this much difference?
> > > > >
> > > > > I then /select and /stream over a query with a resultset size of
> 3.5M
> > > > > documents.  The select takes 14 minutes.  The stream takes just
> > under 7
> > > > > minutes using `curl`.  When I use solrj I get
> > > > >
> > > > > Truncated chunk ( expected size: 32768; actual size:
> > > > > 13830)","trace":"org.apache.http.TruncatedChunkException:
> Truncated
> > > chunk
> > > > > (
> > > > > expected size: 32768; actual size: 13830)
> > > > >         at
> > > > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > > > ChunkedInputStream.java:200)
> > > > >         at
> > > > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > > > ChunkedInputStream.java:215)
> > > > >         at
> > > > > org.apache.http.impl.io.ChunkedInputStream.close(
> > > > > ChunkedInputStream.java:316)
> > > > >         at
> > > > > org.apache.http.conn.BasicManagedEntity.streamClosed(
> > > > > BasicManagedEntity.java:164)
> > > > >         at
> > > > > org.apache.http.conn.EofSensorInputStream.checkClose(
> > > > > EofSensorInputStream.java:228)
> > > > >         ...
> > > > >
> > > > > I found a reference to this being from a timeout of the HTTP
> session
> > in
> > > > > CloudSolrStream but couldn't find a bug in Jira on the topic.
> > Digging
> > > > > around in the source (yay OSS) I found that I could get hold of the
> > > > > ClouldSolrClient and up the SOTimeout so that's working now.
> > > > >
> > > > > The documentation describes /stream as "returning data as soon as
> > > > > available" but there seems to be a HUGE startup latency.  Any
> > thoughts
> > > on
> > > > > how to reduce that?
> > > > >
> > > >
> > >
> >
>

Re: Streaming and large resultsets

Reply via email to