Hi Lanny, For long running streaming queries with many shards and huge resultsets, solrj's default settings for http max connections/connections per host may not be enough. If you are using the worker collection (/stream), it depends on dispensing http clients using SolrClientCache with default limits. Could be useful to turn on debug logging and check.
Thanks, Susmit On Thu, Nov 9, 2017 at 8:35 PM, Lanny Ripple <la...@spotright.com> wrote: > First, Joel, thanks for your help on this. > > 1) I have to admit we really haven't played with a lot of system tuning > recently (before DocValues for sure). We'll go through another tuning > round. > > 2) At the time I ran these numbers this morning we were not indexing. We > build this collection once a month and then client jobs can update it. I > was watching our job queue and there were no jobs running at that time. > It's possible someone else was querying against other collections but they > wouldn't have been updating this collection. > > 3) I'll try /export on each node. We're pretty cookie-cutter with all > nodes being the same and configuration controlled with puppet. We collect > system metrics to a Graphite display panel and no host looks out of sorts > relative to the others. That said I wouldn't be surprised if a node was > out of whack. > > Thanks again. > -ljr > > On Thu, Nov 9, 2017 at 2:34 PM Joel Bernstein <joels...@gmail.com> wrote: > > > In my experience this should be very fast: > > > > search(graph-october, > > q="outs:tokenA", > > fl="id,token", > > sort="id asc", > > qt="/export", > > wt=javabin) > > > > > > When the DocValues cache is statically warmed for the two output fields I > > would see somewhere around 500,000 docs per second exported from a single > > node. > > > > You have sixteen shards which would give you 16 times the throughput. But > > off course the data is being sent back through the single aggregator node > > so your throughput is only as fast as the aggregator node can process the > > results. > > > > This does not explain the slowness that you are seeing. I see a couple of > > possible reasons: > > > > 1) The memory on the system is not tuned optimally. You allocated a large > > amount of memory to the heap and are not providing enough memory to OS > > filesystem. Lucene DocValues use the OS filesystem cache for the > DocValues > > caches. So I would bump down the size of heap considerably. > > > > 2) Are you indexing while querying at all? If you are you would need to > be > > statically warming the DocValues caches for the id field which is used > for > > sorting. Following each commit there is a top level docvalues cache that > is > > rebuilt for sorting on string fields. If you use a static warming query > it > > will warm the cache before making the new searcher live for searchers. I > > would also pause indexing if possible and run queries only to see how it > > runs without indexing. > > > > 3) Try running a query directly to /export handler on each node. Possibly > > one of your nodes is slow for some reason and that is causing the entire > > query to respond slowly. > > > > > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Nov 9, 2017 at 2:22 PM, Lanny Ripple <la...@spotright.com> > wrote: > > > > > Happy to do so. I am testing streams for the first time so we don't > have > > > any 5.x experience. The collection I'm testing was loaded after going > to > > > 6.6.1 and fixing up the solrconfig for lucene_version and removing the > > > /export clause. The indexes run 57G per replica. We are using 64G > hosts > > > with 48G heaps using G1GC but this isn't our only large collection. > This > > > morning as I'm running these our cluster is quiet. I realize some of > the > > > performance we are seeing is going to be our data size so not expecting > > any > > > silver bullets. > > > > > > We are storing 858M documents that are basically > > > > > > id: String > > > token: String > > > outs: String[] > > > outsCount: Int > > > > > > All stored=true, docvalues=true. > > > > > > The `outs` reference a select number of tokens (1.26M). Here are > current > > > percentiles of our outsCount > > > > > > `outsCount` > > > 50% 12 > > > 85% 127 > > > 98% 937 > > > 99.9% 16,284 > > > > > > I'll display the /stream query but I'm setting up the same thing in > > solrj. > > > I'm going to name our small result set "tokenA" and our large one > > "tokenV". > > > > > > search(graph-october, > > > q="outs:tokenA", > > > fl="id,token", > > > sort="id asc", > > > qt="/export", > > > wt=javabin) > > > > > > I've placed this in file /tmp/expr and invoke with > > > > > > curl -sSN -m 3600 --data-urlencode expr@/tmp/expr > > > http://host/solr/graph-october/stream > > > > > > The large resultset query replaces "tokenA" with "tokenV". > > > > > > My /select query is > > > > > > curl -sSN -m 3600 -d wt=csv -d rows=10000 -d q="outs:tokenA" -d > > > fl="id,token" http://host/solr/graph-october/select > > > > > > So the behavior I'm seeing (times are with running the commands back to > > > back) > > > > > > /select tokenA 0.296s > > > /search tokenA 8.787s > > > /search tokenA 8.513s > > > /select tokenA 0.292s > > > > > > I was going to get some "tokenV" times here but curl is hitting > > "Truncated > > > chunk" errors. > > > > > > I can pick a "tokenB" which hasn't been run but from the originating > > data I > > > know has a resultset size of about 5800. Times are > > > > > > /search tokenB 11.486s > > > /search tokenB 0.960s > > > /select tokenB 2.534s > > > /select tokenB 2.117s > > > /search tokenB 1.208s > > > > > > So that's closer to the behavior I'd expect and desire. Back to > > "tokenA". > > > > > > /search tokenA 0.289s > > > /search tokenA 0.249s > > > /select tokenA 0.342s > > > /select tokenA 0.329s > > > /search tokenA 11.848s > > > > > > Here's "tokenV" out of a solrj couting job. > > > > > > _run1_ > > > Token tokenV has 3497203 outs. > > > Time to first read: PT37.577S > > > Runtime: PT1M43.822S > > > > > > _run2_ > > > Token tokenV has 3497203 outs. > > > Time to first read: PT18.764S > > > Runtime: PT1M16.655S > > > > > > So hmm. Given this I'd hazard there's nothing unexpected about /stream > > and > > > the problem is the load on our cluster. > > > > > > So I thought I'd run "tokenA" though the count job. > > > > > > _run1_ > > > Token tokenA has 513* outs. > > > Time to first read: PT12.105S > > > Runtime: PT12.114S > > > > > > * The correct count. Last post I also counted csv header (514) then > > > transposed the digits. :/ > > > > > > _run2_ > > > Token tokenA has 513 outs. > > > Time to first read: PT7.85S > > > Runtime: PT7.86S > > > > > > _run3_ > > > /select tokenA 0.229s > > > > > > _run4_ > > > Token tokenA has 513 outs. > > > Time to first read: PT11.485S > > > Runtime: PT11.495S > > > > > > And we are back to not nearly as fast. > > > > > > Here's my counting code. > > > https://gist.github.com/LannyRipple/8e47769155c766b33bf37bcf901b17e9 > > > > > > On Thu, Nov 9, 2017 at 11:17 AM Joel Bernstein <joels...@gmail.com> > > wrote: > > > > > > > Can you post the exact streaming query you are using? The size of the > > > index > > > > and field types will help understand the issue as well. Also are you > > > seeing > > > > different performance behaviors after the upgrade or just testing the > > > > streaming for the first time on 6.6.1? > > > > > > > > When using the /export handler to stream, the results are pulled from > > the > > > > DocValues caches which have in-memory structures that need to be > built. > > > If > > > > your query returns a large number of string fields you could see a > > delay > > > as > > > > those cashes are built for the first time or rebuilt following a > > commit. > > > > You can use static warming queries to warm the fields in the > background > > > > following a commit. After caches are built the queries should start > > > > streaming immediately on subsequent queries. The 70 second response > > time > > > > with only 541 tuples sounds like it might be caused by the caches > being > > > > rebuilt. > > > > > > > > In general though the /export handler will slow down as you add more > > > fields > > > > to the field list and sort list. But if you post your query and > > > information > > > > on what types the fields are in the field list I can give you an idea > > of > > > > the type of performance I would expect. > > > > > > > > Joel Bernstein > > > > http://joelsolr.blogspot.com/ > > > > > > > > On Thu, Nov 9, 2017 at 9:54 AM, Lanny Ripple <la...@spotright.com> > > > wrote: > > > > > > > > > We've recently upgraded our SolrCloud (16 shards, 2 replicas) to > > 6.6.1 > > > on > > > > > our way to 7 and I'm getting surprising /stream results. > > > > > > > > > > In one example I /select (wt=csv) and /stream [using > > > > > search(...,wt=javabin)] with a query that gives a resultset size of > > 541 > > > > > tuples. The select comes back in under a second. The stream takes > > 70 > > > > > seconds. Should I expect this much difference? > > > > > > > > > > I then /select and /stream over a query with a resultset size of > 3.5M > > > > > documents. The select takes 14 minutes. The stream takes just > > under 7 > > > > > minutes using `curl`. When I use solrj I get > > > > > > > > > > Truncated chunk ( expected size: 32768; actual size: > > > > > 13830)","trace":"org.apache.http.TruncatedChunkException: > Truncated > > > chunk > > > > > ( > > > > > expected size: 32768; actual size: 13830) > > > > > at > > > > > org.apache.http.impl.io.ChunkedInputStream.read( > > > > > ChunkedInputStream.java:200) > > > > > at > > > > > org.apache.http.impl.io.ChunkedInputStream.read( > > > > > ChunkedInputStream.java:215) > > > > > at > > > > > org.apache.http.impl.io.ChunkedInputStream.close( > > > > > ChunkedInputStream.java:316) > > > > > at > > > > > org.apache.http.conn.BasicManagedEntity.streamClosed( > > > > > BasicManagedEntity.java:164) > > > > > at > > > > > org.apache.http.conn.EofSensorInputStream.checkClose( > > > > > EofSensorInputStream.java:228) > > > > > ... > > > > > > > > > > I found a reference to this being from a timeout of the HTTP > session > > in > > > > > CloudSolrStream but couldn't find a bug in Jira on the topic. > > Digging > > > > > around in the source (yay OSS) I found that I could get hold of the > > > > > ClouldSolrClient and up the SOTimeout so that's working now. > > > > > > > > > > The documentation describes /stream as "returning data as soon as > > > > > available" but there seems to be a HUGE startup latency. Any > > thoughts > > > on > > > > > how to reduce that? > > > > > > > > > > > > > > >