Re: Tuning for 500+ field schemas

Tim Robertson Wed, 18 Mar 2020 10:52:48 -0700

Thank you Edward, Erick,

In this environment, hard commits @60s without openSearcher and soft
commits are off.
We have the luxury of building the index, then opening searchers and adding
replicas afterward.


We'll monitor the segment merging and lengthen the commit time as suggested
- thank you!




On Wed, Mar 18, 2020 at 5:45 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> Ak, ok. Then your spikes were probably being caused by segment merging,
> which would account for it being on different machines and running for a
> long time. Segment merging is a very expensive operation...
>
> As Edward mentioned, your commit settings come into play. You could easily
> be creating segments much smaller due to commits, I'd check the indexes to
> see how small the smallest segments are while indexing, you can do that
> through the admin UI. If they're much smaller than your ramBufferSizeMB,
> lengthen the commit interval...
>
> The default merge policy caps segments at 5g btw.
>
> Finally, indexing throughput should scale roughly linearly to the number of
> shards. You should be able to saturate the CPUs with enough client threads.
>
> Best,
> Erick
>
> On Wed, Mar 18, 2020, 12:04 Edward Ribeiro <edward.ribe...@gmail.com>
> wrote:
>
> > What are your hard and soft commit settings? This can have a large
> > impact on the writing throughput.
> >
> > Best,
> > Edward
> >
> > On Wed, Mar 18, 2020 at 11:43 AM Tim Robertson
> > <timrobertson...@gmail.com> wrote:
> > >
> > > Thank you Erick
> > >
> > > I should have been clearer that this is a bulk load job into a
> write-only
> > > cluster (until loaded when it becomes read-only) and it is the write
> > > throughput I was chasing.
> > >
> > > I made some changes and have managed to get it working more closely to
> > what
> > > I expect.  I'll summarise them here in case anyone stumbles on
> > > this thread but please note this was just the result of a few tuning
> > > experiments and is not definitive:
> > >
> > > - Increased shard count, so there were the same number of shards as
> > virtual
> > > CPU cores on each machine
> > > - Set the ramBufferSizeMB to 2048
> > > - Increased the parallelization in the loading job (i.e. ran the job
> > across
> > > more spark cores concurrently)
> > > - Dropped to batches of 500 docs sent instead of 1000
> > >
> > >
> > > On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > > > The Apache mail server strips attachments pretty aggressively, so I
> > can’t
> > > > see your attachment.
> > > >
> > > > About the only way to diagnose would be to take a thread dump of the
> > > > machine that’s running hot.
> > > >
> > > > There are a couple of places I’d look:
> > > >
> > > > 1> what happens if you don’t return any non-docValue fields? To
> return
> > > > stored fields, the doc must be fetched and decompressed. That doesn’t
> > fit
> > > > very well with your observation that only one node runs hot, but it’s
> > worth
> > > > checking.
> > > >
> > > > 2> Return one doc-value=true field and search only on a single field
> > (with
> > > > different values of course). Does that follow this pattern? What I’m
> > > > wondering about here is whether the delays are because you’re
> swapping
> > > > index files in and out of memory. Again, that doesn’t really explain
> > high
> > > > CPU utilization, if that were the case I’d expect you to be I/O
> bound.
> > > >
> > > > 3> I’ve seen indexes with this many fields perform reasonably well
> BTW.
> > > >
> > > > How many fields are you returning? One thing that happens is that
> when
> > a
> > > > query comes in to a node, sub-queries are sent out to one replica of
> > each
> > > > shard, and the results from each shard are sorted by one node and
> > returned
> > > > to the client. Unless you’re returning lots and lots of fields and/or
> > many
> > > > rows, this shouldn’t run “for many minutes”, but it’s something to
> > look for.
> > > >
> > > > When this happens, what is your query response time like? I’m
> assuming
> > > > it’s very slow.
> > > >
> > > > But these are all shots in the dark, some thread dumps would be where
> > I’d
> > > > start.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Mar 18, 2020, at 6:55 AM, Tim Robertson <
> > timrobertson...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi all
> > > > >
> > > > > We load Solr (8.4.1) from Spark and are trying to grow the schema
> > with
> > > > some dynamic fields that will result in around 500-600 indexed fields
> > per
> > > > doc.
> > > > >
> > > > > Currently, we see ~300 fields/doc work very well into an 8-node
> Solr
> > > > cluster with CPU nicely balanced across a cluster and we saturate our
> > > > network.
> > > > >
> > > > > However, growing to ~500-600 fields we see incoming network traffic
> > drop
> > > > to around a quarter and in the Solr cluster we see low CPU on most
> > > > machines, but always one machine with high load (it is the Solr
> > process).
> > > > That machine will stay high for many minutes, and then another will
> go
> > high
> > > > - see CPU graph [1]. I've played with changing shard counts but
> beyond
> > 32
> > > > didn't see any gains. There is only one replica on each shard, each
> > machine
> > > > runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a
> > > > different set of machines.
> > > > >
> > > > > Can anyone please throw out ideas of what you would do to tune Solr
> > for
> > > > large amounts of dynamic fields?
> > > > >
> > > > > Does anyone have a guess on what the single high CPU node is doing
> > (some
> > > > kind of metrics aggregation maybe?).
> > > > >
> > > > > Thank you all,
> > > > > Tim
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> >
>

Re: Tuning for 500+ field schemas

Reply via email to