What are your hard and soft commit settings? This can have a large impact on the writing throughput.
Best, Edward On Wed, Mar 18, 2020 at 11:43 AM Tim Robertson <timrobertson...@gmail.com> wrote: > > Thank you Erick > > I should have been clearer that this is a bulk load job into a write-only > cluster (until loaded when it becomes read-only) and it is the write > throughput I was chasing. > > I made some changes and have managed to get it working more closely to what > I expect. I'll summarise them here in case anyone stumbles on > this thread but please note this was just the result of a few tuning > experiments and is not definitive: > > - Increased shard count, so there were the same number of shards as virtual > CPU cores on each machine > - Set the ramBufferSizeMB to 2048 > - Increased the parallelization in the loading job (i.e. ran the job across > more spark cores concurrently) > - Dropped to batches of 500 docs sent instead of 1000 > > > On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson <erickerick...@gmail.com> > wrote: > > > The Apache mail server strips attachments pretty aggressively, so I can’t > > see your attachment. > > > > About the only way to diagnose would be to take a thread dump of the > > machine that’s running hot. > > > > There are a couple of places I’d look: > > > > 1> what happens if you don’t return any non-docValue fields? To return > > stored fields, the doc must be fetched and decompressed. That doesn’t fit > > very well with your observation that only one node runs hot, but it’s worth > > checking. > > > > 2> Return one doc-value=true field and search only on a single field (with > > different values of course). Does that follow this pattern? What I’m > > wondering about here is whether the delays are because you’re swapping > > index files in and out of memory. Again, that doesn’t really explain high > > CPU utilization, if that were the case I’d expect you to be I/O bound. > > > > 3> I’ve seen indexes with this many fields perform reasonably well BTW. > > > > How many fields are you returning? One thing that happens is that when a > > query comes in to a node, sub-queries are sent out to one replica of each > > shard, and the results from each shard are sorted by one node and returned > > to the client. Unless you’re returning lots and lots of fields and/or many > > rows, this shouldn’t run “for many minutes”, but it’s something to look for. > > > > When this happens, what is your query response time like? I’m assuming > > it’s very slow. > > > > But these are all shots in the dark, some thread dumps would be where I’d > > start. > > > > Best, > > Erick > > > > > On Mar 18, 2020, at 6:55 AM, Tim Robertson <timrobertson...@gmail.com> > > wrote: > > > > > > Hi all > > > > > > We load Solr (8.4.1) from Spark and are trying to grow the schema with > > some dynamic fields that will result in around 500-600 indexed fields per > > doc. > > > > > > Currently, we see ~300 fields/doc work very well into an 8-node Solr > > cluster with CPU nicely balanced across a cluster and we saturate our > > network. > > > > > > However, growing to ~500-600 fields we see incoming network traffic drop > > to around a quarter and in the Solr cluster we see low CPU on most > > machines, but always one machine with high load (it is the Solr process). > > That machine will stay high for many minutes, and then another will go high > > - see CPU graph [1]. I've played with changing shard counts but beyond 32 > > didn't see any gains. There is only one replica on each shard, each machine > > runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a > > different set of machines. > > > > > > Can anyone please throw out ideas of what you would do to tune Solr for > > large amounts of dynamic fields? > > > > > > Does anyone have a guess on what the single high CPU node is doing (some > > kind of metrics aggregation maybe?). > > > > > > Thank you all, > > > Tim > > > > > > [1] > > > > > > > > > > > > >