Thank you Erick

I should have been clearer that this is a bulk load job into a write-only
cluster (until loaded when it becomes read-only) and it is the write
throughput I was chasing.

I made some changes and have managed to get it working more closely to what
I expect.  I'll summarise them here in case anyone stumbles on
this thread but please note this was just the result of a few tuning
experiments and is not definitive:

- Increased shard count, so there were the same number of shards as virtual
CPU cores on each machine
- Set the ramBufferSizeMB to 2048
- Increased the parallelization in the loading job (i.e. ran the job across
more spark cores concurrently)
- Dropped to batches of 500 docs sent instead of 1000


On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> The Apache mail server strips attachments pretty aggressively, so I can’t
> see your attachment.
>
> About the only way to diagnose would be to take a thread dump of the
> machine that’s running hot.
>
> There are a couple of places I’d look:
>
> 1> what happens if you don’t return any non-docValue fields? To return
> stored fields, the doc must be fetched and decompressed. That doesn’t fit
> very well with your observation that only one node runs hot, but it’s worth
> checking.
>
> 2> Return one doc-value=true field and search only on a single field (with
> different values of course). Does that follow this pattern? What I’m
> wondering about here is whether the delays are because you’re swapping
> index files in and out of memory. Again, that doesn’t really explain high
> CPU utilization, if that were the case I’d expect you to be I/O bound.
>
> 3> I’ve seen indexes with this many fields perform reasonably well BTW.
>
> How many fields are you returning? One thing that happens is that when a
> query comes in to a node, sub-queries are sent out to one replica of each
> shard, and the results from each shard are sorted by one node and returned
> to the client. Unless you’re returning lots and lots of fields and/or many
> rows, this shouldn’t run “for many minutes”, but it’s something to look for.
>
> When this happens, what is your query response time like? I’m assuming
> it’s very slow.
>
> But these are all shots in the dark, some thread dumps would be where I’d
> start.
>
> Best,
> Erick
>
> > On Mar 18, 2020, at 6:55 AM, Tim Robertson <timrobertson...@gmail.com>
> wrote:
> >
> > Hi all
> >
> > We load Solr (8.4.1) from Spark and are trying to grow the schema with
> some dynamic fields that will result in around 500-600 indexed fields per
> doc.
> >
> > Currently, we see ~300 fields/doc work very well into an 8-node Solr
> cluster with CPU nicely balanced across a cluster and we saturate our
> network.
> >
> > However, growing to ~500-600 fields we see incoming network traffic drop
> to around a quarter and in the Solr cluster we see low CPU on most
> machines, but always one machine with high load (it is the Solr process).
> That machine will stay high for many minutes, and then another will go high
> - see CPU graph [1]. I've played with changing shard counts but beyond 32
> didn't see any gains. There is only one replica on each shard, each machine
> runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a
> different set of machines.
> >
> > Can anyone please throw out ideas of what you would do to tune Solr for
> large amounts of dynamic fields?
> >
> > Does anyone have a guess on what the single high CPU node is doing (some
> kind of metrics aggregation maybe?).
> >
> > Thank you all,
> > Tim
> >
> > [1]
> >
> >
> >
>
>

Reply via email to