On Wed, Jun 10, 2020 at 8:35 PM Hup Chen <chai...@hotmail.com> wrote:
> I will check "dmesg" first, to find out any hardware error message. > Here is what I see toward the end of the output from dmesg: [1521232.781785] [118857] 48 118857 108785 677 201 901 0 httpd [1521232.781787] [118860] 48 118860 108785 710 201 881 0 httpd [1521232.781788] [118862] 48 118862 113063 5256 210 725 0 httpd [1521232.781790] [118864] 48 118864 114085 6634 212 703 0 httpd [1521232.781791] [118871] 48 118871 139687 32323 262 620 0 httpd [1521232.781793] [118873] 48 118873 108785 821 201 792 0 httpd [1521232.781795] [118879] 48 118879 140263 32719 263 621 0 httpd [1521232.781796] [118903] 48 118903 108785 812 201 771 0 httpd [1521232.781798] [118905] 48 118905 113575 5606 211 660 0 httpd [1521232.781800] [118906] 48 118906 113563 5694 211 626 0 httpd [1521232.781801] Out of memory: Kill process 117529 (httpd) score 9 or sacrifice child [1521232.782908] Killed process 117529 (httpd), UID 48, total-vm:675824kB, anon-rss:181844kB, file-rss:0kB, shmem-rss:0kB Is this a relevant "Out of memory" message? Does this suggest an OOM situation is the culprit? When I grep in the solr logs for oom, I see some entries like this... ./solr_gc.log.4.current:CommandLine flags: -XX:CICompilerCount=4 -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:ConcGCThreads=4 -XX:GCLogFileSize=20971520 -XX:InitialHeapSize=536870912 -XX:MaxHeapSize=536870912 -XX:MaxNewSize=134217728 -XX:MaxTenuringThreshold=8 -XX:MinHeapDeltaBytes=196608 -XX:NewRatio=3 -XX:NewSize=134217728 -XX:NumberOfGCLogFiles=9 -XX:OldPLABSize=16 -XX:OldSize=402653184 -XX:-OmitStackTraceInFastThrow -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /opt/solr/server/logs -XX:ParallelGCThreads=4 -XX:+ParallelRefProcEnabled -XX:PretenureSizeThreshold=67108864 -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:ThreadStackSize=256 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC Buried in there I see "OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh". But I think this is just a setting that indicates what to do in case of an OOM. And if I look in that oom_solr.sh file, I see it would write an entry to a solr_oom_kill log. And there is no such log in the logs directory. Many thanks. > Then use some system admin tools to monitor that server, > for instance, top, vmstat, lsof, iostat ... or simply install some nice > free monitoring tool into this system, like monit, monitorix, nagios. > Good luck! > > ________________________________ > From: Ryan W <rya...@gmail.com> > Sent: Thursday, June 11, 2020 2:13 AM > To: solr-user@lucene.apache.org <solr-user@lucene.apache.org> > Subject: Re: How to determine why solr stops running? > > Hi all, > > People keep suggesting I check the logs for errors. What do those errors > look like? Does anyone have examples of the text of a Solr oom error? Or > the text of any other errors I should be looking for the next time solr > fails? Are there phrases I should grep for in the logs? Should I be > looking in the Solr logs for an OOM error, or in the Apache logs? > > There is nothing failing on the server except for solr -- at least not that > I can see. There is no apparent problem with the hardware or anything else > on the server. The OS is Red Hat Enterprise Linux. The server has 16 GB of > RAM and hosts one website that does not get a huge amount of traffic. > > When the start command is given to solr, does it first check to see if solr > is running, or does it always start solr whether it is already running or > not? > > Many thanks! > Ryan > > > On Tue, Jun 9, 2020 at 7:58 AM Erick Erickson <erickerick...@gmail.com> > wrote: > > > To add to what Dave said, if you have a particular machine that’s prone > to > > suddenly stopping, that’s usually a red flag that you should seriously > > think about hardware issues. > > > > If the problem strikes different machines, then I agree with Shawn that > > the first thing I’d be suspicious of is OOM errors. > > > > FWIW, > > Erick > > > > > On Jun 9, 2020, at 6:05 AM, Dave <hastings.recurs...@gmail.com> wrote: > > > > > > I’ll add that whenever I’ve had a solr instance shut down, for me it’s > > been a hardware failure. Either the ram or the disk got a “glitch” and > both > > of these are relatively fragile and wear and tear type parts of the > > machine, and should be expected to fail and be replaced from time to > time. > > Solr is pretty aggressive with its logging so there are a lot of writes > > always happening and of course reads, if the disk has any issues or the > > memory it can lock it up and bring her down, more so if you have any > > spellcheck dictionaries or suggesters being built on start up. > > > > > > Just my experience with this, could be wrong (most likely wrong) but we > > always have extra drives and memory around the server room for this > > reason. At least once or twice a year we will have a disk failure in the > > raid and need to swap in a new one. > > > > > > Good luck though, also solr should be logging it’s failures so it would > > be good to look there too > > > > > >> On Jun 9, 2020, at 2:35 AM, Shawn Heisey <apa...@elyograg.org> wrote: > > >> > > >> On 5/14/2020 7:22 AM, Ryan W wrote: > > >>> I manage a site where solr has stopped running a couple times in the > > past > > >>> week. The server hasn't been rebooted, so that's not the reason. > What > > else > > >>> causes solr to stop running? How can I investigate why this is > > happening? > > >> > > >> Any situation where Solr stops running and nobody requested the stop > is > > a result of a serious problem that must be thoroughly investigated. I > > think it's a bad idea for Solr to automatically restart when it stops > > unexpectedly. Chances are that whatever caused the crash is going to > > simply make the crash happen again until the problem is solved. > > Automatically restarting could hide problems from the system > administrator. > > >> > > >> The only way a Solr auto-restart would be acceptable to me is if it > > sends a high priority alert to the sysadmin EVERY time it executes an > > auto-restart. It really is that bad of a problem. > > >> > > >> The causes of Solr crashes (that I can think of) include the > following. > > I believe I have listed these four options from most likely to least > likely: > > >> > > >> * Java OutOfMemoryError exceptions. On non-windows systems, the > > "bin/solr" script starts Solr with an option that results in Solr's death > > anytime one of these exceptions occurs. We do this because program > > operation is indeterminate and completely unpredictable when OOME occurs, > > so it's far safer to stop running. That exception can be caused by > several > > things, some of which actually do not involve memory at all. If you're > > running on Windows via the bin\solr.cmd command, then this will not > happen > > ... but OOME could still cause a crash, because as I already mentioned, > > program operation is unpredictable when OOME occurs. > > >> > > >> * The OS kills Solr because system memory is completely exhausted and > > Solr is the process using the most memory. Linux calls this the > > "oom-killer" ... I am pretty sure something like it exists on most > > operating systems. > > >> > > >> * Corruption somewhere in the system. Could be in Java, the OS, Solr, > > or data used by any of those. > > >> > > >> * A very serious bug in Solr's code that we haven't discovered yet. > > >> > > >> I included that last one simply for completeness. A bug that causes a > > crash *COULD* exist, but as of right now, we have not seen any supporting > > evidence. > > >> > > >> My guess is that Java OutOfMemoryError is the cause here, but I can't > > be certain. If that is happening, then some resource (which might not be > > memory) is fully depleted. We would need to see the full > OutOfMemoryError > > exception in order to determine why it is happening. Sometimes the > > exception is logged in solr.log, sometimes it isn't. We cannot predict > > what part of the code will be running when OOME occurs, so it would be > > nearly impossible for us to guarantee logging. OOME can happen ANYWHERE > - > > even in code that the compiler thinks is immune to exceptions. > > >> > > >> Side note to fellow committers: I wonder if we should implement an > > uncaught exception handler in Solr. I have found in my own programs that > > it helps figure out thorny problems. And while I am on the subject of > > handlers that might not be general knowledge, I didn't find a shutdown > hook > > or a security manager outside of tests. > > >> > > >> Thanks, > > >> Shawn > > > > >