Hey guys, I am reaching out to the Solr list with a very vague issue: under high load against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas (2 cores per instance), I eventually see failure messages related to transaction logs, and shortly after these stacktraces occur the cluster starts to fall apart.
To explain my setup: - SolrCloud 4.3.1. - Jetty 9.x. - Oracle/Sun JDK 1.7.25 w/CMS. - RHEL 6.x 64-bit. - 3 instances, 1 per server. - 3 shards. - 2 replicas per shard. The transaction log error I receive after about 10-30 minutes of load testing is: "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) /opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.0000000000000000078:org.apache.solr.common.SolrException: java.io.EOFException at org.apache.solr.update.TransactionLog.<init>(TransactionLog.java:182) at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233) at org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83) at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:138) at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:125) at org.apache.solr.update.DirectUpdateHandler2.<init>(DirectUpdateHandler2.java:95) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525) at org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:805) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618) at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:982) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.EOFException at org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73) at org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216) at org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266) at org.apache.solr.update.TransactionLog.<init>(TransactionLog.java:160) ... 25 more " Eventually after a few of these stack traces, the cluster starts to lose shards and replicas fail. Jetty then creates hung threads until hitting OutOfMemory on native threads due to the maximum process ulimit. I know this is quite a vague issue, so I'm not expecting a silver-bullet answer, but I was wondering if anyone has suggestions on where to look next? Does this sound Solr-related at all, or possibly system? Has anyone seen this issue before, or has any hypothesis how to find out more? I will reply shortly with a thread dump, taken from 1 locked-up node. Thanks for any suggestions! Tim