I haven't worked with AWS, but recently we tried to move some of our solr instances to a cloud in Google's Cloud offering, and it did not go well. All of our problems ended up stemming from the fact that the I/O is throttled. Any complicated enough query would require too many disk reads to return the results in a reasonable time when being throttled. SSDs were better but not a practical cost and not as performant as our own bare metal.
I'm not sure if that is what is happening in your case since it seemed like your CPU time was mostly idle instead of I/O waits, but your case sounds a lot like our when we started indexing in the cloud instances. There might be an equivalent metric for AWS, but Google had the number of throttled reads and writes available (albeit through StackDriver) that we could track. When we were doing the initial indexing, the indexing processes would get to a point where the updates were taking minutes to complete and the cause was throttled write ops. A few things we did to get everything indexing at a reasonable rate for the initial setup: -- autoCommit set to something very very low, like 10-15 seconds, and openSearcher set to false -- autoSoftCommit set to 1 hour or more (our indexing took days) to avoid unnecessary read operations during indexing. -- left the RAM buffer/buffered doc settings and maxIndexingThreads to the defaults -- set the max threads and max concurrent merges of the mergeScheduler to be 1 (or very low). This prevented excessive IO during indexing. -- Only keep one copy of each shard to avoid duplicate writes/merges on the follower replicas. Add the redundant copies once after the bulk indexing. -- There was some setting with respect to the storage objects to make them faster at the expense of more CPU used (not waiting). Helped with indexing, but not didn't make a difference in the long run. With regards to SPM. I haven't used it to troubleshoot this type of problem before, but we use it for all of our solr monitoring. The out-of-the-box settings work very well for us, so I'm not sure how much metric customization beyond the initially setup ones it allows. Also, most of your attachments got filtered out by the mailing list, particularly the images. Best, Chris On Mon, Apr 23, 2018 at 5:38 PM Denis Demichev <demic...@gmail.com> wrote: > I conducted another experiment today with local SSD drives, but this did > not seem to fix my problem. > Don't see any extensive I/O in this case: > > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > > xvda 1.76 88.83 5.52 1256191 77996 > > xvdb 13.95 111.30 56663.93 1573961 801303364 > > xvdb - is the device where SolrCloud is installed and data files are kept. > > What I see: > - There are 17 "Lucene Merge Thread #..." running. Some of them are > blocked, some of them are RUNNING > - updateExecutor-N-thread-M threads are in parked mode and number of docs > that I am able to submit is still low > - Tried to change maxIndexingThreads, set it to something high. This seems > to prolong the time when cluster is accepting new indexing requests and > keeps CPU utilization a lot higher while the cluster is merging indexes > > Could anyone please point me to the right direction (documentation or Java > classes) where I can read about how data is passed from updateExecutor > thread pool to Merge Threads? I assume there should be some internal > blocking queue or something similar. > Still cannot wrap my head around how Solr blocks incoming connections. Non > merged indexes are not kept in memory so I don't clearly understand why > Solr cannot keep writing index file to HDD while other threads are merging > indexes (since this is a continuous process anyway). > > Does anyone use SPM monitoring tool for that type of problems? Is it of > any use at all? > > > Thank you in advance. > > [image: image.png] > > > Regards, > Denis > > > On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev <demic...@gmail.com> wrote: > >> Mikhail, >> >> Sure, I will keep everyone posted. Moving to non-HVM instance may take >> some time, so hopefully I will be able to share my observations in the next >> couple of days or so. >> Thanks again for all the help. >> >> Regards, >> Denis >> >> >> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev <m...@apache.org> wrote: >> >>> Denis, please let me know what it ends up with. I'm really curious >>> regarding this case and AWS instace flavours. fwiw since 7.4 we'll have >>> ioThrottle=false option. >>> >>> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev <demic...@gmail.com> >>> wrote: >>> >>>> Mikhail, Erick, >>>> >>>> Thank you. >>>> >>>> What just occurred to me - we don't use local SSD but instead we're >>>> using EBS volumes. >>>> This was a wrong instance type that I looked at. >>>> Will try to set up a cluster with SSD nodes and retest. >>>> >>>> Regards, >>>> Denis >>>> >>>> >>>> On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev <m...@apache.org> >>>> wrote: >>>> >>>>> I'm not sure it's the right context, but here is one guy shows really >>>>> low throthle boundary >>>>> >>>>> https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348 >>>>> >>>>> >>>>> On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev <m...@apache.org> >>>>> wrote: >>>>> >>>>>> Threads are hanging on merge io throthling >>>>>> >>>>>> at >>>>>> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150) >>>>>> at >>>>>> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148) >>>>>> at >>>>>> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93) >>>>>> at >>>>>> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78) >>>>>> >>>>>> It seems odd. Please confirm that you don't commit on every update >>>>>> request. >>>>>> The only way to monitor io throthling is to enable infostream and >>>>>> read a lot of logs. >>>>>> >>>>>> >>>>>> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Erick, >>>>>>> >>>>>>> Thank you for your quick response. >>>>>>> >>>>>>> I/O bottleneck: Please see another screenshot attached, as you can >>>>>>> see disk r/w operations are pretty low or not significant. >>>>>>> iostat========== >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> xvda 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 12.52 0.00 0.00 0.00 0.00 87.48 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> xvda 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 12.51 0.00 0.00 0.00 0.00 87.49 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> xvda 0.00 0.00 0.00 0.00 0.00 >>>>>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>>>>>> ========================== >>>>>>> >>>>>>> Merging threads: I don't see any modifications of a merging policy >>>>>>> comparing to the default solrconfig. >>>>>>> Index config: >>>>>>> <ramBufferSizeMB>2000</ramBufferSizeMB><maxBufferedDocs>500000</maxBufferedDocs> >>>>>>> Update handler: <updateHandler class="solr.DirectUpdateHandler2"> >>>>>>> Could you please help me understand how can I validate this theory? >>>>>>> Another note here. Even if I remove the stress from the cluster I >>>>>>> still see that merging thread is consuming CPU for some time. It may >>>>>>> take >>>>>>> hours and if I try to return the stress back nothing changes. >>>>>>> If this is overloaded merging process, it should take some time to >>>>>>> reduce the queue length and it should start accepting new indexing >>>>>>> requests. >>>>>>> Maybe I am wrong, but I need some help to understand how to check it. >>>>>>> >>>>>>> AWS - Sorry, I don't have any physical hardware to replicate this >>>>>>> test locally >>>>>>> >>>>>>> GC - I monitored GC closely. If you take a look at CPU utilization >>>>>>> screenshot you will see a blue graph that is GC consumption. In >>>>>>> addition to >>>>>>> that I am using Visual GC plugin from VisualVM to understand how GC >>>>>>> performs under the stress and don't see any anomalies. >>>>>>> There are several GC pauses from time to time but those are not >>>>>>> significant. Heap utilization graph tells me that GC is not struggling a >>>>>>> lot. >>>>>>> >>>>>>> Thank you again for your comments, hope the information above will >>>>>>> help you understand the problem. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Denis >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson < >>>>>>> erickerick...@gmail.com> wrote: >>>>>>> >>>>>>>> Have you changed any of the merge policy parameters? I doubt it but >>>>>>>> just asking. >>>>>>>> >>>>>>>> My guess: your I/O is your bottleneck. There are a limited number of >>>>>>>> threads (tunable) that are used for background merging. When they're >>>>>>>> all busy, incoming updates are queued up. This squares with your >>>>>>>> statement that queries are fine and CPU activity is moderate. >>>>>>>> >>>>>>>> A quick test there would be to try this on a non-AWS setup if you >>>>>>>> have >>>>>>>> some hardware you can repurpose. >>>>>>>> >>>>>>>> an 80G heap is a red flag. Most of the time that's too large by far. >>>>>>>> So one thing I'd do is hook up some GC monitoring, you may be >>>>>>>> spending >>>>>>>> a horrible amount of time in GC cycles. >>>>>>>> >>>>>>>> Best, >>>>>>>> Erick >>>>>>>> >>>>>>>> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > All, >>>>>>>> > >>>>>>>> > I would like to request some assistance with a situation >>>>>>>> described below. My >>>>>>>> > SolrCloud cluster accepts the update requests at a very low pace >>>>>>>> making it >>>>>>>> > impossible to index new documents. >>>>>>>> > >>>>>>>> > Cluster Setup: >>>>>>>> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data >>>>>>>> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB >>>>>>>> physical memory, >>>>>>>> > 80GB Java Heap space, AWS >>>>>>>> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment >>>>>>>> (build >>>>>>>> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed >>>>>>>> mode) >>>>>>>> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor >>>>>>>> > >>>>>>>> > Symptoms: >>>>>>>> > 1. 4 instances running 4 threads each are using SolrJ client to >>>>>>>> submit >>>>>>>> > documents to SolrCloud for indexing, do not perform any manual >>>>>>>> commits. Each >>>>>>>> > document batch is 10 documents big, containing ~200 text fields >>>>>>>> per >>>>>>>> > document. >>>>>>>> > 2. After some time (~20-30 minutes, by that time I see only >>>>>>>> ~50-60K of >>>>>>>> > documents in a collection, node restarts do not help) I notice >>>>>>>> that clients >>>>>>>> > cannot submit new documents to the cluster for indexing anymore, >>>>>>>> each >>>>>>>> > operation takes enormous amount of time >>>>>>>> > 3. Cluster is not loaded at all, CPU consumption is moderate (I >>>>>>>> am seeing >>>>>>>> > that merging is performed all the time though), memory >>>>>>>> consumption is >>>>>>>> > adequate, but still updates are not accepted from external clients >>>>>>>> > 4. Search requests are handled fine >>>>>>>> > 5. I don't see any significant activity in SolrCloud logs >>>>>>>> anywhere, just >>>>>>>> > regular replication attempts only. No errors. >>>>>>>> > >>>>>>>> > >>>>>>>> > Additional information >>>>>>>> > 1. Please see Thread Dump attached. >>>>>>>> > 2. Please see SolrAdmin info with physical memory and file >>>>>>>> descriptor >>>>>>>> > utilization >>>>>>>> > 3. Please see VisualVM screenshots with CPU and memory >>>>>>>> utilization and CPU >>>>>>>> > profiling data. Physical memory utilization is about 60-70 >>>>>>>> percent all the >>>>>>>> > time. >>>>>>>> > 4. Schema file contains ~10 permanent fields 5 of which are >>>>>>>> mapped and >>>>>>>> > mandatory and persisted, the rest of the fields are optional and >>>>>>>> dynamic >>>>>>>> > 5. Solr config configures autoCommit to be set to 2 minutes and >>>>>>>> openSearcher >>>>>>>> > set to false >>>>>>>> > 6. Caches are set up with autoWarmCount = 0 >>>>>>>> > 7. GC was fine tuned and I don't see any significant CPU >>>>>>>> utilization by GC >>>>>>>> > or any lengthy pauses. Majority of the garbage is collected in >>>>>>>> young gen >>>>>>>> > space. >>>>>>>> > >>>>>>>> > My primary question: I see that the cluster is alive and performs >>>>>>>> some >>>>>>>> > merging and commits but does not accept new documents for >>>>>>>> indexing. What is >>>>>>>> > causing this slowdown and why it does not accept new submissions? >>>>>>>> > >>>>>>>> > >>>>>>>> > Regards, >>>>>>>> > Denis >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Sincerely yours >>>>>> Mikhail Khludnev >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Sincerely yours >>>>> Mikhail Khludnev >>>>> >>>> >>> >>> >>> -- >>> Sincerely yours >>> Mikhail Khludnev >>> >>