Some ideas:

1> What version of Solr? Solr 7.3 completely re-wrote Leader Initiated
Recovery and 7.5 has other improvements for recovery, we're hoping
that the recovery situation is much improved.

2> In the 7x code line, there are TLOG and PULL replicas. As of 7.5,
you can set up so the queries are served by replica type, see:
https://issues.apache.org/jira/browse/SOLR-11982. This might help you
out. This moves all the indexing to the leader and reserves the rest
of the nodes for queries only, using old-style replication. I'm
assuming from your commit rate that latency between when updates
happen and the updates are searchable isn't a big concern.

3> Just because the CPU isn't 100% doesn't mean Solr is running flat
out. There's I/O waits while sub-requests are serviced and the like.

4> As for how to add faceting without slowing down querying, there's
no way. Extra work is extra work. Depending on _what_ you're faceting
on, you may be able to do some tricks, but without details it's hard
to say. You need to get the query rate target first though ;)

5> OOMs Hmm, you say you're doing complex sorts, are all fields
involved in sorts docValues=true? They have to be to be used in
function queries of course, but what about any fields that aren't?
What about your _version_ field?

6>  bq. "...indexed 2 times/day, as fast as the SOLR allows..." One
experiment I'd run is to test your QPS rate when there was _no_
indexing going on. That would give you a hint as to whether the
TLOG/PULL configuration would be helpful. There's been talk of
separate thread pools for indexing and querying to give queries a
better shot at the CPU, but that's not in place yet.

7> G1GC may also help rather than CMS, but as you're well aware GC
tuning "is more art than science" ;).

Good luck!
Erick

On Fri, Oct 26, 2018 at 8:55 AM Sofiya Strochyk <s...@interlogic.com.ua> wrote:
>
> Hi everyone,
>
> We have a SolrCloud setup with the following configuration:
>
> 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon E5-1650v2, 
> 12 cores, with SSDs)
> One collection, 4 shards, each has only a single replica (so 4 replicas in 
> total), using compositeId router
> Total index size is about 150M documents/320GB, so about 40M/80GB per node
> Zookeeper is on a separate server
> Documents consist of about 20 fields (most of them are both stored and 
> indexed), average document size is about 2kB
> Queries are mostly 2-3 words in the q field, with 2 fq parameters, with 
> complex sort expression (containing IF functions)
> We don't use faceting due to performance reasons but need to add it in the 
> future
> Majority of the documents are reindexed 2 times/day, as fast as the SOLR 
> allows, in batches of 1000-10000 docs. Some of the documents are also deleted 
> (by id, not by query)
> autoCommit is set to maxTime of 1 minute with openSearcher=false and 
> autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from 
> clients are ignored.
> Heap size is set to 8GB.
>
> Target query rate is up to 500 qps, maybe 300, and we need to keep response 
> time at <200ms. But at the moment we only see very good search performance 
> with up to 100 requests per second. Whenever it grows to about 200, average 
> response time abruptly increases to 0.5-1 second. (Also it seems that request 
> rate reported by SOLR in admin metrics is 2x higher than the real one, 
> because for every query, every shard receives 2 requests: one to obtain IDs 
> and second one to get data by IDs; so target rate for SOLR metrics would be 
> 1000 qps).
>
> During high request load, CPU usage increases dramatically on the SOLR nodes. 
> It doesn't reach 100% but averages at 50-70% on 3 servers and about 93% on 1 
> server (random server each time, not the smallest one).
>
> The documentation mentions replication to spread the load between the 
> servers. We tested replicating to smaller servers (32GB RAM, Intel Core 
> i7-4770). However, when we tested it, the replicas were going out of sync all 
> the time (possibly during commits) and reported errors like "PeerSync 
> Recovery was not successful - trying replication." Then they proceed with 
> replication which takes hours and the leader handles all requests 
> singlehandedly during that time. Also both leaders and replicas started 
> encountering OOM errors (heap space) for unknown reason. Heap dump analysis 
> shows that most of the memory is consumed by [J (array of long) type, my best 
> guess would be that it is "_version_" field, but it's still unclear why it 
> happens. Also, even though with replication request rate and CPU usage drop 2 
> times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers (p75_ms 
> is much smaller on nodes with replication, but still not as low as under load 
> of <100 requests/s).
>
> Garbage collection is much more active during high load as well. Full GC 
> happens almost exclusively during those times. We have tried tuning GC 
> options like suggested here and it didn't change things though.
>
> My questions are
>
> How do we increase throughput? Is replication the only solution?
> if yes - then why doesn't it affect response times, considering that CPU is 
> not 100% used and index fits into memory?
> How to deal with OOM and replicas going into recovery?
> Is memory or CPU the main problem? (When searching on the internet, i never 
> see CPU as main bottleneck for SOLR, but our case might be different)
> Or do we need smaller shards? Could segments merging be a problem?
> How to add faceting without search queries slowing down too much?
> How to diagnose these problems and narrow down to the real reason in hardware 
> or setup?
>
> Any help would be much appreciated.
>
> Thanks!
>
> --
> Sofiia Strochyk
>
>
>
> s...@interlogic.com.ua
>
> www.interlogic.com.ua
>
>

Reply via email to