Thank you Erick I should have been clearer that this is a bulk load job into a write-only cluster (until loaded when it becomes read-only) and it is the write throughput I was chasing.
I made some changes and have managed to get it working more closely to what I expect. I'll summarise them here in case anyone stumbles on this thread but please note this was just the result of a few tuning experiments and is not definitive: - Increased shard count, so there were the same number of shards as virtual CPU cores on each machine - Set the ramBufferSizeMB to 2048 - Increased the parallelization in the loading job (i.e. ran the job across more spark cores concurrently) - Dropped to batches of 500 docs sent instead of 1000 On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson <erickerick...@gmail.com> wrote: > The Apache mail server strips attachments pretty aggressively, so I can’t > see your attachment. > > About the only way to diagnose would be to take a thread dump of the > machine that’s running hot. > > There are a couple of places I’d look: > > 1> what happens if you don’t return any non-docValue fields? To return > stored fields, the doc must be fetched and decompressed. That doesn’t fit > very well with your observation that only one node runs hot, but it’s worth > checking. > > 2> Return one doc-value=true field and search only on a single field (with > different values of course). Does that follow this pattern? What I’m > wondering about here is whether the delays are because you’re swapping > index files in and out of memory. Again, that doesn’t really explain high > CPU utilization, if that were the case I’d expect you to be I/O bound. > > 3> I’ve seen indexes with this many fields perform reasonably well BTW. > > How many fields are you returning? One thing that happens is that when a > query comes in to a node, sub-queries are sent out to one replica of each > shard, and the results from each shard are sorted by one node and returned > to the client. Unless you’re returning lots and lots of fields and/or many > rows, this shouldn’t run “for many minutes”, but it’s something to look for. > > When this happens, what is your query response time like? I’m assuming > it’s very slow. > > But these are all shots in the dark, some thread dumps would be where I’d > start. > > Best, > Erick > > > On Mar 18, 2020, at 6:55 AM, Tim Robertson <timrobertson...@gmail.com> > wrote: > > > > Hi all > > > > We load Solr (8.4.1) from Spark and are trying to grow the schema with > some dynamic fields that will result in around 500-600 indexed fields per > doc. > > > > Currently, we see ~300 fields/doc work very well into an 8-node Solr > cluster with CPU nicely balanced across a cluster and we saturate our > network. > > > > However, growing to ~500-600 fields we see incoming network traffic drop > to around a quarter and in the Solr cluster we see low CPU on most > machines, but always one machine with high load (it is the Solr process). > That machine will stay high for many minutes, and then another will go high > - see CPU graph [1]. I've played with changing shard counts but beyond 32 > didn't see any gains. There is only one replica on each shard, each machine > runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a > different set of machines. > > > > Can anyone please throw out ideas of what you would do to tune Solr for > large amounts of dynamic fields? > > > > Does anyone have a guess on what the single high CPU node is doing (some > kind of metrics aggregation maybe?). > > > > Thank you all, > > Tim > > > > [1] > > > > > > > >