I haven't worked with AWS, but recently we tried to move some of our solr
instances to a cloud in Google's Cloud offering, and it did not go well.
All of our problems ended up stemming from the fact that the I/O is
throttled. Any complicated enough query would require too many disk reads
to return the results in a reasonable time when being throttled. SSDs were
better but not a practical cost and not as performant as our own bare metal.

I'm not sure if that is what is happening in your case since it seemed like
your CPU time was mostly idle instead of I/O waits, but your case sounds a
lot like our when we started indexing in the cloud instances. There might
be an equivalent metric for AWS, but Google had the number of throttled
reads and writes available (albeit through StackDriver) that we could track.

When we were doing the initial indexing, the indexing processes would get
to a point where the updates were taking minutes to complete and the cause
was throttled write ops.

A few things we did to get everything indexing at a reasonable rate for the
initial setup:
-- autoCommit set to something very very low, like 10-15 seconds, and
openSearcher set to false
-- autoSoftCommit set to 1 hour or more (our indexing took days) to avoid
unnecessary read operations during indexing.
-- left the RAM buffer/buffered doc settings and maxIndexingThreads to the
defaults
-- set the max threads and max concurrent merges of the mergeScheduler to
be 1 (or very low). This prevented excessive IO during indexing.
-- Only keep one copy of each shard to avoid duplicate writes/merges on the
follower replicas. Add the redundant copies once after the bulk indexing.
-- There was some setting with respect to the storage objects to make them
faster at the expense of more CPU used (not waiting). Helped with indexing,
but not didn't make a difference in the long run.

With regards to SPM. I haven't used it to troubleshoot this type of problem
before, but we use it for all of our solr monitoring. The out-of-the-box
settings work very well for us, so I'm not sure how much metric
customization beyond the initially setup ones it allows.

Also, most of your attachments got filtered out by the mailing list,
particularly the images.

Best,
Chris

On Mon, Apr 23, 2018 at 5:38 PM Denis Demichev <demic...@gmail.com> wrote:

> I conducted another experiment today with local SSD drives, but this did
> not seem to fix my problem.
> Don't see any extensive I/O in this case:
>
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>
> xvda              1.76        88.83         5.52    1256191      77996
>
> xvdb             13.95       111.30     56663.93    1573961  801303364
>
> xvdb - is the device where SolrCloud is installed and data files are kept.
>
> What I see:
> - There are 17 "Lucene Merge Thread #..." running. Some of them are
> blocked, some of them are RUNNING
> - updateExecutor-N-thread-M threads are in parked mode and number of docs
> that I am able to submit is still low
> - Tried to change maxIndexingThreads, set it to something high. This seems
> to prolong the time when cluster is accepting new indexing requests and
> keeps CPU utilization a lot higher while the cluster is merging indexes
>
> Could anyone please point me to the right direction (documentation or Java
> classes) where I can read about how data is passed from updateExecutor
> thread pool to Merge Threads? I assume there should be some internal
> blocking queue or something similar.
> Still cannot wrap my head around how Solr blocks incoming connections. Non
> merged indexes are not kept in memory so I don't clearly understand why
> Solr cannot keep writing index file to HDD while other threads are merging
> indexes (since this is a continuous process anyway).
>
> Does anyone use SPM monitoring tool for that type of problems? Is it of
> any use at all?
>
>
> Thank you in advance.
>
> [image: image.png]
>
>
> Regards,
> Denis
>
>
> On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev <demic...@gmail.com> wrote:
>
>> Mikhail,
>>
>> Sure, I will keep everyone posted. Moving to non-HVM instance may take
>> some time, so hopefully I will be able to share my observations in the next
>> couple of days or so.
>> Thanks again for all the help.
>>
>> Regards,
>> Denis
>>
>>
>> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev <m...@apache.org> wrote:
>>
>>> Denis, please let me know what it ends up with. I'm really curious
>>> regarding this case and AWS instace flavours. fwiw since 7.4 we'll have
>>> ioThrottle=false option.
>>>
>>> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev <demic...@gmail.com>
>>> wrote:
>>>
>>>> Mikhail, Erick,
>>>>
>>>> Thank you.
>>>>
>>>> What just occurred to me - we don't use local SSD but instead we're
>>>> using EBS volumes.
>>>> This was a wrong instance type that I looked at.
>>>> Will try to set up a cluster with SSD nodes and retest.
>>>>
>>>> Regards,
>>>> Denis
>>>>
>>>>
>>>> On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev <m...@apache.org>
>>>> wrote:
>>>>
>>>>> I'm not sure it's the right context, but here is one guy shows really
>>>>> low throthle boundary
>>>>>
>>>>> https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348
>>>>>
>>>>>
>>>>> On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev <m...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Threads are hanging on merge io throthling
>>>>>>
>>>>>>         at 
>>>>>> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
>>>>>>         at 
>>>>>> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
>>>>>>         at 
>>>>>> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
>>>>>>         at 
>>>>>> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
>>>>>>
>>>>>> It seems odd. Please confirm that you don't commit on every update
>>>>>> request.
>>>>>> The only way to monitor io throthling is to enable infostream and
>>>>>> read a lot of logs.
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Erick,
>>>>>>>
>>>>>>> Thank you for your quick response.
>>>>>>>
>>>>>>> I/O bottleneck: Please see another screenshot attached, as you can
>>>>>>> see disk r/w operations are pretty low or not significant.
>>>>>>> iostat==========
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> xvda              0.00     0.00    0.00    0.00     0.00
>>>>>>> 0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>           12.52    0.00    0.00    0.00    0.00   87.48
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> xvda              0.00     0.00    0.00    0.00     0.00
>>>>>>> 0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>           12.51    0.00    0.00    0.00    0.00   87.49
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> xvda              0.00     0.00    0.00    0.00     0.00
>>>>>>> 0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> ==========================
>>>>>>>
>>>>>>> Merging threads: I don't see any modifications of a merging policy
>>>>>>> comparing to the default solrconfig.
>>>>>>> Index config:
>>>>>>> <ramBufferSizeMB>2000</ramBufferSizeMB><maxBufferedDocs>500000</maxBufferedDocs>
>>>>>>> Update handler: <updateHandler class="solr.DirectUpdateHandler2">
>>>>>>> Could you please help me understand how can I validate this theory?
>>>>>>> Another note here. Even if I remove the stress from the cluster I
>>>>>>> still see that merging thread is consuming CPU for some time. It may 
>>>>>>> take
>>>>>>> hours and if I try to return the stress back nothing changes.
>>>>>>> If this is overloaded merging process, it should take some time to
>>>>>>> reduce the queue length and it should start accepting new indexing 
>>>>>>> requests.
>>>>>>> Maybe I am wrong, but I need some help to understand how to check it.
>>>>>>>
>>>>>>> AWS - Sorry, I don't have any physical hardware to replicate this
>>>>>>> test locally
>>>>>>>
>>>>>>> GC - I monitored GC closely. If you take a look at CPU utilization
>>>>>>> screenshot you will see a blue graph that is GC consumption. In 
>>>>>>> addition to
>>>>>>> that I am using Visual GC plugin from VisualVM to understand how GC
>>>>>>> performs under the stress and don't see any anomalies.
>>>>>>> There are several GC pauses from time to time but those are not
>>>>>>> significant. Heap utilization graph tells me that GC is not struggling a
>>>>>>> lot.
>>>>>>>
>>>>>>> Thank you again for your comments, hope the information above will
>>>>>>> help you understand the problem.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson <
>>>>>>> erickerick...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Have you changed any of the merge policy parameters? I doubt it but
>>>>>>>> just asking.
>>>>>>>>
>>>>>>>> My guess: your I/O is your bottleneck. There are a limited number of
>>>>>>>> threads (tunable) that are used for background merging. When they're
>>>>>>>> all busy, incoming updates are queued up. This squares with your
>>>>>>>> statement that queries are fine and CPU activity is moderate.
>>>>>>>>
>>>>>>>> A quick test there would be to try this on a non-AWS setup if you
>>>>>>>> have
>>>>>>>> some hardware you can repurpose.
>>>>>>>>
>>>>>>>> an 80G heap is a red flag. Most of the time that's too large by far.
>>>>>>>> So one thing I'd do is hook up some GC monitoring, you may be
>>>>>>>> spending
>>>>>>>> a horrible amount of time in GC cycles.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Erick
>>>>>>>>
>>>>>>>> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > All,
>>>>>>>> >
>>>>>>>> > I would like to request some assistance with a situation
>>>>>>>> described below. My
>>>>>>>> > SolrCloud cluster accepts the update requests at a very low pace
>>>>>>>> making it
>>>>>>>> > impossible to index new documents.
>>>>>>>> >
>>>>>>>> > Cluster Setup:
>>>>>>>> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
>>>>>>>> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB
>>>>>>>> physical memory,
>>>>>>>> > 80GB Java Heap space, AWS
>>>>>>>> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment
>>>>>>>> (build
>>>>>>>> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed
>>>>>>>> mode)
>>>>>>>> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
>>>>>>>> >
>>>>>>>> > Symptoms:
>>>>>>>> > 1. 4 instances running 4 threads each are using SolrJ client to
>>>>>>>> submit
>>>>>>>> > documents to SolrCloud for indexing, do not perform any manual
>>>>>>>> commits. Each
>>>>>>>> > document  batch is 10 documents big, containing ~200 text fields
>>>>>>>> per
>>>>>>>> > document.
>>>>>>>> > 2. After some time (~20-30 minutes, by that time I see only
>>>>>>>> ~50-60K of
>>>>>>>> > documents in a collection, node restarts do not help) I notice
>>>>>>>> that clients
>>>>>>>> > cannot submit new documents to the cluster for indexing anymore,
>>>>>>>> each
>>>>>>>> > operation takes enormous amount of time
>>>>>>>> > 3. Cluster is not loaded at all, CPU consumption is moderate (I
>>>>>>>> am seeing
>>>>>>>> > that merging is performed all the time though), memory
>>>>>>>> consumption is
>>>>>>>> > adequate, but still updates are not accepted from external clients
>>>>>>>> > 4. Search requests are handled fine
>>>>>>>> > 5. I don't see any significant activity in SolrCloud logs
>>>>>>>> anywhere, just
>>>>>>>> > regular replication attempts only. No errors.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Additional information
>>>>>>>> > 1. Please see Thread Dump attached.
>>>>>>>> > 2. Please see SolrAdmin info with physical memory and file
>>>>>>>> descriptor
>>>>>>>> > utilization
>>>>>>>> > 3. Please see VisualVM screenshots with CPU and memory
>>>>>>>> utilization and CPU
>>>>>>>> > profiling data. Physical memory utilization is about 60-70
>>>>>>>> percent all the
>>>>>>>> > time.
>>>>>>>> > 4. Schema file contains ~10 permanent fields 5 of which are
>>>>>>>> mapped and
>>>>>>>> > mandatory and persisted, the rest of the fields are optional and
>>>>>>>> dynamic
>>>>>>>> > 5. Solr config configures autoCommit to be set to 2 minutes and
>>>>>>>> openSearcher
>>>>>>>> > set to false
>>>>>>>> > 6. Caches are set up with autoWarmCount = 0
>>>>>>>> > 7. GC was fine tuned and I don't see any significant CPU
>>>>>>>> utilization by GC
>>>>>>>> > or any lengthy pauses. Majority of the garbage is collected in
>>>>>>>> young gen
>>>>>>>> > space.
>>>>>>>> >
>>>>>>>> > My primary question: I see that the cluster is alive and performs
>>>>>>>> some
>>>>>>>> > merging and commits but does not accept new documents for
>>>>>>>> indexing. What is
>>>>>>>> > causing this slowdown and why it does not accept new submissions?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Regards,
>>>>>>>> > Denis
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sincerely yours
>>>>>> Mikhail Khludnev
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>>
>>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>>
>>

Reply via email to