Tuning for 500+ field schemas

2020-03-18 Thread Tim Robertson
Hi all

We load Solr (8.4.1) from Spark and are trying to grow the schema with some
dynamic fields that will result in around 500-600 indexed fields per doc.

Currently, we see ~300 fields/doc work very well into an 8-node Solr
cluster with CPU nicely balanced across a cluster and we saturate our
network.

However, growing to ~500-600 fields we see incoming network traffic drop to
around a quarter and in the Solr cluster we see low CPU on most machines,
but always one machine with high load (it is the Solr process). That
machine will stay high for many minutes, and then another will go high -
see CPU graph [1]. I've played with changing shard counts but beyond 32
didn't see any gains. There is only one replica on each shard, each machine
runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a
different set of machines.

Can anyone please throw out ideas of what you would do to tune Solr for
large amounts of dynamic fields?

Does anyone have a guess on what the single high CPU node is doing (some
kind of metrics aggregation maybe?).

Thank you all,
Tim

[1]

[image: image.png]


Re: Tuning for 500+ field schemas

2020-03-18 Thread Erick Erickson
The Apache mail server strips attachments pretty aggressively, so I can’t see 
your attachment.

About the only way to diagnose would be to take a thread dump of the machine 
that’s running hot.

There are a couple of places I’d look:

1> what happens if you don’t return any non-docValue fields? To return stored 
fields, the doc must be fetched and decompressed. That doesn’t fit very well 
with your observation that only one node runs hot, but it’s worth checking.

2> Return one doc-value=true field and search only on a single field (with 
different values of course). Does that follow this pattern? What I’m wondering 
about here is whether the delays are because you’re swapping index files in and 
out of memory. Again, that doesn’t really explain high CPU utilization, if that 
were the case I’d expect you to be I/O bound.

3> I’ve seen indexes with this many fields perform reasonably well BTW.

How many fields are you returning? One thing that happens is that when a query 
comes in to a node, sub-queries are sent out to one replica of each shard, and 
the results from each shard are sorted by one node and returned to the client. 
Unless you’re returning lots and lots of fields and/or many rows, this 
shouldn’t run “for many minutes”, but it’s something to look for.

When this happens, what is your query response time like? I’m assuming it’s 
very slow.

But these are all shots in the dark, some thread dumps would be where I’d start.

Best,
Erick

> On Mar 18, 2020, at 6:55 AM, Tim Robertson  wrote:
> 
> Hi all
> 
> We load Solr (8.4.1) from Spark and are trying to grow the schema with some 
> dynamic fields that will result in around 500-600 indexed fields per doc.
> 
> Currently, we see ~300 fields/doc work very well into an 8-node Solr cluster 
> with CPU nicely balanced across a cluster and we saturate our network.
> 
> However, growing to ~500-600 fields we see incoming network traffic drop to 
> around a quarter and in the Solr cluster we see low CPU on most machines, but 
> always one machine with high load (it is the Solr process). That machine will 
> stay high for many minutes, and then another will go high - see CPU graph 
> [1]. I've played with changing shard counts but beyond 32 didn't see any 
> gains. There is only one replica on each shard, each machine runs on AWS with 
> an EFS mounted disk only running Solr 8, ZK is on a different set of machines.
> 
> Can anyone please throw out ideas of what you would do to tune Solr for large 
> amounts of dynamic fields?
> 
> Does anyone have a guess on what the single high CPU node is doing (some kind 
> of metrics aggregation maybe?).
> 
> Thank you all,
> Tim
> 
> [1]
> 
> 
>  



Re: Tuning for 500+ field schemas

2020-03-18 Thread Tim Robertson
Thank you Erick

I should have been clearer that this is a bulk load job into a write-only
cluster (until loaded when it becomes read-only) and it is the write
throughput I was chasing.

I made some changes and have managed to get it working more closely to what
I expect.  I'll summarise them here in case anyone stumbles on
this thread but please note this was just the result of a few tuning
experiments and is not definitive:

- Increased shard count, so there were the same number of shards as virtual
CPU cores on each machine
- Set the ramBufferSizeMB to 2048
- Increased the parallelization in the loading job (i.e. ran the job across
more spark cores concurrently)
- Dropped to batches of 500 docs sent instead of 1000


On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson 
wrote:

> The Apache mail server strips attachments pretty aggressively, so I can’t
> see your attachment.
>
> About the only way to diagnose would be to take a thread dump of the
> machine that’s running hot.
>
> There are a couple of places I’d look:
>
> 1> what happens if you don’t return any non-docValue fields? To return
> stored fields, the doc must be fetched and decompressed. That doesn’t fit
> very well with your observation that only one node runs hot, but it’s worth
> checking.
>
> 2> Return one doc-value=true field and search only on a single field (with
> different values of course). Does that follow this pattern? What I’m
> wondering about here is whether the delays are because you’re swapping
> index files in and out of memory. Again, that doesn’t really explain high
> CPU utilization, if that were the case I’d expect you to be I/O bound.
>
> 3> I’ve seen indexes with this many fields perform reasonably well BTW.
>
> How many fields are you returning? One thing that happens is that when a
> query comes in to a node, sub-queries are sent out to one replica of each
> shard, and the results from each shard are sorted by one node and returned
> to the client. Unless you’re returning lots and lots of fields and/or many
> rows, this shouldn’t run “for many minutes”, but it’s something to look for.
>
> When this happens, what is your query response time like? I’m assuming
> it’s very slow.
>
> But these are all shots in the dark, some thread dumps would be where I’d
> start.
>
> Best,
> Erick
>
> > On Mar 18, 2020, at 6:55 AM, Tim Robertson 
> wrote:
> >
> > Hi all
> >
> > We load Solr (8.4.1) from Spark and are trying to grow the schema with
> some dynamic fields that will result in around 500-600 indexed fields per
> doc.
> >
> > Currently, we see ~300 fields/doc work very well into an 8-node Solr
> cluster with CPU nicely balanced across a cluster and we saturate our
> network.
> >
> > However, growing to ~500-600 fields we see incoming network traffic drop
> to around a quarter and in the Solr cluster we see low CPU on most
> machines, but always one machine with high load (it is the Solr process).
> That machine will stay high for many minutes, and then another will go high
> - see CPU graph [1]. I've played with changing shard counts but beyond 32
> didn't see any gains. There is only one replica on each shard, each machine
> runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a
> different set of machines.
> >
> > Can anyone please throw out ideas of what you would do to tune Solr for
> large amounts of dynamic fields?
> >
> > Does anyone have a guess on what the single high CPU node is doing (some
> kind of metrics aggregation maybe?).
> >
> > Thank you all,
> > Tim
> >
> > [1]
> >
> >
> >
>
>


Re: How do *you* restrict access to Solr?

2020-03-18 Thread Ryan W
On Tue, Mar 17, 2020 at 6:05 AM Jan Høydahl  wrote:

> You can consider upgrading to Solr 8.5 which is to be released in a couple
> of days, which makes it easy to whitelist IP addresses in solr.in.sh:
>

Thanks.  That is good news, though it won't help me this time around.  My
application framework (Drupal) doesn't support Solr 8.  I may try Solr 6
again, or take another stab at getting the Basic Authentication plugin to
work in Solr 7.  My Solr install isn't web-accessible, so the only threats
would come from inside the network.



>
> # Allow IPv4/IPv6 localhost, the 192.168.0.x IPv4 network, and
> 2000:123:4:5:: IPv6 network.
> SOLR_IP_WHITELIST="127.0.0.1, [::1], 192.168.0.0/24, [2000:123:4:5::]/64"
>
>
> https://lucene.apache.org/solr/guide/8_5/securing-solr.html#enable-ip-access-control
>
> But please please do not expose Solr, even if secured, to untrusted
> networks and never to the public internet.
>
> Jan
>
> > 16. mar. 2020 kl. 16:46 skrev Ryan W :
> >
> > On Mon, Mar 16, 2020 at 10:51 AM Susheel Kumar 
> > wrote:
> >
> >> Basic auth should help you to start
> >>
> >>
> https://lucene.apache.org/solr/guide/8_1/basic-authentication-plugin.html
> >
> >
> >
> > Thanks.  I think I will give up on the plugin system.  I haven't been
> able
> > to get the plugin system to work, and it creates too many opportunities
> for
> > human error.  Even if I can get it working this week, what about 6 months
> > from now or a year from now when something goes wrong and I have to debug
> > it.  It seems like far too much overhead to provide the desired security
> > benefit, except perhaps in situations where an organization has Solr
> > specialists who can maintain the system.
>
>


Re: How do *you* restrict access to Solr?

2020-03-18 Thread Markus Kalkbrenner
> My application framework (Drupal) doesn't support Solr 8.

That's not true. Like Solr itself you just have to update to recent drupal 
module versions.
As you can see at 
https://travis-ci.org/github/mkalkbrenner/search_api_solr/builds/663153535 the 
automated tests run against Solr 6.6.6, 7.7.2 and 8.4.1.

Best,
Markus



Re: Tuning for 500+ field schemas

2020-03-18 Thread Edward Ribeiro
What are your hard and soft commit settings? This can have a large
impact on the writing throughput.

Best,
Edward

On Wed, Mar 18, 2020 at 11:43 AM Tim Robertson
 wrote:
>
> Thank you Erick
>
> I should have been clearer that this is a bulk load job into a write-only
> cluster (until loaded when it becomes read-only) and it is the write
> throughput I was chasing.
>
> I made some changes and have managed to get it working more closely to what
> I expect.  I'll summarise them here in case anyone stumbles on
> this thread but please note this was just the result of a few tuning
> experiments and is not definitive:
>
> - Increased shard count, so there were the same number of shards as virtual
> CPU cores on each machine
> - Set the ramBufferSizeMB to 2048
> - Increased the parallelization in the loading job (i.e. ran the job across
> more spark cores concurrently)
> - Dropped to batches of 500 docs sent instead of 1000
>
>
> On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson 
> wrote:
>
> > The Apache mail server strips attachments pretty aggressively, so I can’t
> > see your attachment.
> >
> > About the only way to diagnose would be to take a thread dump of the
> > machine that’s running hot.
> >
> > There are a couple of places I’d look:
> >
> > 1> what happens if you don’t return any non-docValue fields? To return
> > stored fields, the doc must be fetched and decompressed. That doesn’t fit
> > very well with your observation that only one node runs hot, but it’s worth
> > checking.
> >
> > 2> Return one doc-value=true field and search only on a single field (with
> > different values of course). Does that follow this pattern? What I’m
> > wondering about here is whether the delays are because you’re swapping
> > index files in and out of memory. Again, that doesn’t really explain high
> > CPU utilization, if that were the case I’d expect you to be I/O bound.
> >
> > 3> I’ve seen indexes with this many fields perform reasonably well BTW.
> >
> > How many fields are you returning? One thing that happens is that when a
> > query comes in to a node, sub-queries are sent out to one replica of each
> > shard, and the results from each shard are sorted by one node and returned
> > to the client. Unless you’re returning lots and lots of fields and/or many
> > rows, this shouldn’t run “for many minutes”, but it’s something to look for.
> >
> > When this happens, what is your query response time like? I’m assuming
> > it’s very slow.
> >
> > But these are all shots in the dark, some thread dumps would be where I’d
> > start.
> >
> > Best,
> > Erick
> >
> > > On Mar 18, 2020, at 6:55 AM, Tim Robertson 
> > wrote:
> > >
> > > Hi all
> > >
> > > We load Solr (8.4.1) from Spark and are trying to grow the schema with
> > some dynamic fields that will result in around 500-600 indexed fields per
> > doc.
> > >
> > > Currently, we see ~300 fields/doc work very well into an 8-node Solr
> > cluster with CPU nicely balanced across a cluster and we saturate our
> > network.
> > >
> > > However, growing to ~500-600 fields we see incoming network traffic drop
> > to around a quarter and in the Solr cluster we see low CPU on most
> > machines, but always one machine with high load (it is the Solr process).
> > That machine will stay high for many minutes, and then another will go high
> > - see CPU graph [1]. I've played with changing shard counts but beyond 32
> > didn't see any gains. There is only one replica on each shard, each machine
> > runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a
> > different set of machines.
> > >
> > > Can anyone please throw out ideas of what you would do to tune Solr for
> > large amounts of dynamic fields?
> > >
> > > Does anyone have a guess on what the single high CPU node is doing (some
> > kind of metrics aggregation maybe?).
> > >
> > > Thank you all,
> > > Tim
> > >
> > > [1]
> > >
> > >
> > >
> >
> >


Re: Tuning for 500+ field schemas

2020-03-18 Thread Erick Erickson
Ak, ok. Then your spikes were probably being caused by segment merging,
which would account for it being on different machines and running for a
long time. Segment merging is a very expensive operation...

As Edward mentioned, your commit settings come into play. You could easily
be creating segments much smaller due to commits, I'd check the indexes to
see how small the smallest segments are while indexing, you can do that
through the admin UI. If they're much smaller than your ramBufferSizeMB,
lengthen the commit interval...

The default merge policy caps segments at 5g btw.

Finally, indexing throughput should scale roughly linearly to the number of
shards. You should be able to saturate the CPUs with enough client threads.

Best,
Erick

On Wed, Mar 18, 2020, 12:04 Edward Ribeiro  wrote:

> What are your hard and soft commit settings? This can have a large
> impact on the writing throughput.
>
> Best,
> Edward
>
> On Wed, Mar 18, 2020 at 11:43 AM Tim Robertson
>  wrote:
> >
> > Thank you Erick
> >
> > I should have been clearer that this is a bulk load job into a write-only
> > cluster (until loaded when it becomes read-only) and it is the write
> > throughput I was chasing.
> >
> > I made some changes and have managed to get it working more closely to
> what
> > I expect.  I'll summarise them here in case anyone stumbles on
> > this thread but please note this was just the result of a few tuning
> > experiments and is not definitive:
> >
> > - Increased shard count, so there were the same number of shards as
> virtual
> > CPU cores on each machine
> > - Set the ramBufferSizeMB to 2048
> > - Increased the parallelization in the loading job (i.e. ran the job
> across
> > more spark cores concurrently)
> > - Dropped to batches of 500 docs sent instead of 1000
> >
> >
> > On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson 
> > wrote:
> >
> > > The Apache mail server strips attachments pretty aggressively, so I
> can’t
> > > see your attachment.
> > >
> > > About the only way to diagnose would be to take a thread dump of the
> > > machine that’s running hot.
> > >
> > > There are a couple of places I’d look:
> > >
> > > 1> what happens if you don’t return any non-docValue fields? To return
> > > stored fields, the doc must be fetched and decompressed. That doesn’t
> fit
> > > very well with your observation that only one node runs hot, but it’s
> worth
> > > checking.
> > >
> > > 2> Return one doc-value=true field and search only on a single field
> (with
> > > different values of course). Does that follow this pattern? What I’m
> > > wondering about here is whether the delays are because you’re swapping
> > > index files in and out of memory. Again, that doesn’t really explain
> high
> > > CPU utilization, if that were the case I’d expect you to be I/O bound.
> > >
> > > 3> I’ve seen indexes with this many fields perform reasonably well BTW.
> > >
> > > How many fields are you returning? One thing that happens is that when
> a
> > > query comes in to a node, sub-queries are sent out to one replica of
> each
> > > shard, and the results from each shard are sorted by one node and
> returned
> > > to the client. Unless you’re returning lots and lots of fields and/or
> many
> > > rows, this shouldn’t run “for many minutes”, but it’s something to
> look for.
> > >
> > > When this happens, what is your query response time like? I’m assuming
> > > it’s very slow.
> > >
> > > But these are all shots in the dark, some thread dumps would be where
> I’d
> > > start.
> > >
> > > Best,
> > > Erick
> > >
> > > > On Mar 18, 2020, at 6:55 AM, Tim Robertson <
> timrobertson...@gmail.com>
> > > wrote:
> > > >
> > > > Hi all
> > > >
> > > > We load Solr (8.4.1) from Spark and are trying to grow the schema
> with
> > > some dynamic fields that will result in around 500-600 indexed fields
> per
> > > doc.
> > > >
> > > > Currently, we see ~300 fields/doc work very well into an 8-node Solr
> > > cluster with CPU nicely balanced across a cluster and we saturate our
> > > network.
> > > >
> > > > However, growing to ~500-600 fields we see incoming network traffic
> drop
> > > to around a quarter and in the Solr cluster we see low CPU on most
> > > machines, but always one machine with high load (it is the Solr
> process).
> > > That machine will stay high for many minutes, and then another will go
> high
> > > - see CPU graph [1]. I've played with changing shard counts but beyond
> 32
> > > didn't see any gains. There is only one replica on each shard, each
> machine
> > > runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a
> > > different set of machines.
> > > >
> > > > Can anyone please throw out ideas of what you would do to tune Solr
> for
> > > large amounts of dynamic fields?
> > > >
> > > > Does anyone have a guess on what the single high CPU node is doing
> (some
> > > kind of metrics aggregation maybe?).
> > > >
> > > > Thank you all,
> > > > Tim
> > > >
> > > > [1]
> > > >
> > > >
> > > >
> > >
> > >
>


Re: Tuning for 500+ field schemas

2020-03-18 Thread Tim Robertson
Thank you Edward, Erick,

In this environment, hard commits @60s without openSearcher and soft
commits are off.
We have the luxury of building the index, then opening searchers and adding
replicas afterward.

We'll monitor the segment merging and lengthen the commit time as suggested
- thank you!




On Wed, Mar 18, 2020 at 5:45 PM Erick Erickson 
wrote:

> Ak, ok. Then your spikes were probably being caused by segment merging,
> which would account for it being on different machines and running for a
> long time. Segment merging is a very expensive operation...
>
> As Edward mentioned, your commit settings come into play. You could easily
> be creating segments much smaller due to commits, I'd check the indexes to
> see how small the smallest segments are while indexing, you can do that
> through the admin UI. If they're much smaller than your ramBufferSizeMB,
> lengthen the commit interval...
>
> The default merge policy caps segments at 5g btw.
>
> Finally, indexing throughput should scale roughly linearly to the number of
> shards. You should be able to saturate the CPUs with enough client threads.
>
> Best,
> Erick
>
> On Wed, Mar 18, 2020, 12:04 Edward Ribeiro 
> wrote:
>
> > What are your hard and soft commit settings? This can have a large
> > impact on the writing throughput.
> >
> > Best,
> > Edward
> >
> > On Wed, Mar 18, 2020 at 11:43 AM Tim Robertson
> >  wrote:
> > >
> > > Thank you Erick
> > >
> > > I should have been clearer that this is a bulk load job into a
> write-only
> > > cluster (until loaded when it becomes read-only) and it is the write
> > > throughput I was chasing.
> > >
> > > I made some changes and have managed to get it working more closely to
> > what
> > > I expect.  I'll summarise them here in case anyone stumbles on
> > > this thread but please note this was just the result of a few tuning
> > > experiments and is not definitive:
> > >
> > > - Increased shard count, so there were the same number of shards as
> > virtual
> > > CPU cores on each machine
> > > - Set the ramBufferSizeMB to 2048
> > > - Increased the parallelization in the loading job (i.e. ran the job
> > across
> > > more spark cores concurrently)
> > > - Dropped to batches of 500 docs sent instead of 1000
> > >
> > >
> > > On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > > > The Apache mail server strips attachments pretty aggressively, so I
> > can’t
> > > > see your attachment.
> > > >
> > > > About the only way to diagnose would be to take a thread dump of the
> > > > machine that’s running hot.
> > > >
> > > > There are a couple of places I’d look:
> > > >
> > > > 1> what happens if you don’t return any non-docValue fields? To
> return
> > > > stored fields, the doc must be fetched and decompressed. That doesn’t
> > fit
> > > > very well with your observation that only one node runs hot, but it’s
> > worth
> > > > checking.
> > > >
> > > > 2> Return one doc-value=true field and search only on a single field
> > (with
> > > > different values of course). Does that follow this pattern? What I’m
> > > > wondering about here is whether the delays are because you’re
> swapping
> > > > index files in and out of memory. Again, that doesn’t really explain
> > high
> > > > CPU utilization, if that were the case I’d expect you to be I/O
> bound.
> > > >
> > > > 3> I’ve seen indexes with this many fields perform reasonably well
> BTW.
> > > >
> > > > How many fields are you returning? One thing that happens is that
> when
> > a
> > > > query comes in to a node, sub-queries are sent out to one replica of
> > each
> > > > shard, and the results from each shard are sorted by one node and
> > returned
> > > > to the client. Unless you’re returning lots and lots of fields and/or
> > many
> > > > rows, this shouldn’t run “for many minutes”, but it’s something to
> > look for.
> > > >
> > > > When this happens, what is your query response time like? I’m
> assuming
> > > > it’s very slow.
> > > >
> > > > But these are all shots in the dark, some thread dumps would be where
> > I’d
> > > > start.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Mar 18, 2020, at 6:55 AM, Tim Robertson <
> > timrobertson...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi all
> > > > >
> > > > > We load Solr (8.4.1) from Spark and are trying to grow the schema
> > with
> > > > some dynamic fields that will result in around 500-600 indexed fields
> > per
> > > > doc.
> > > > >
> > > > > Currently, we see ~300 fields/doc work very well into an 8-node
> Solr
> > > > cluster with CPU nicely balanced across a cluster and we saturate our
> > > > network.
> > > > >
> > > > > However, growing to ~500-600 fields we see incoming network traffic
> > drop
> > > > to around a quarter and in the Solr cluster we see low CPU on most
> > > > machines, but always one machine with high load (it is the Solr
> > process).
> > > > That machine will stay high for many minutes, and then another will
> go
>