Hello, We're indexing a large set of files using Solr 6.1.0, running a SolrCloud by utilizing ZooKeeper 3.4.8.
We have two ensembles - and both clusters are running on three of their own respective VMs (CentOS 7). We first thought the error was due to CDCR - as we were trying to index a large amount of documents which had to be replicated to the target cluster. However, we got the same error even after turning of CDCR - which indicates CDCR wasn't the problem after all. After indexing between 20 000 to 35 000 documents to the source cluster does the File Descriptor Count reach 4096 for one of the solr-nodes - and the respective node crashes. The count grows quite linearly as time goes. The remaining 2 nodes in the cluster is not affected at all, and their logs had no relevant posts. We found the following errors for the crashing node in its log: 2016-06-30 08:23:12.459 ERROR (updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error java.net.SocketException: Too many open files (...) 2016-06-30 08:23:12.460 ERROR (updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.u.StreamingSolrClients error java.net.SocketException: Too many open files (...) 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update: Too many open files Too many open files (...) 2016-06-30 08:23:12.461 INFO (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1] webapp=/solr path=/update params={version=2.2} status=-1 QTime=5 2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 2 Async exceptions during distributed update: Too many open files Too many open files (....) 2016-06-30 08:23:12.461 WARN (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall invalid return code: -1 2016-06-30 08:23:38.108 INFO (qtp314337396-20) [c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1] webapp=/solr path=/select params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=https://10.0.106.115:443/solr/DIPS_shard1_replica1/&rows=10&version=2&q=*:*&NOW=1467275018057&isShard=true&wt=javabin&_=1467275017220} hits=30218 status=0 QTime=1 Running netstat -n -p on the VM that yields the exceptions reveals that there is at least 1 800 TCP connections (not counted how many - the netstat command filled the entire PuTTY window yielding 2 000 lines) waiting to be closed: tcp6 70 0 10.0.106.115:34531 10.0.106.114:443 CLOSE_WAIT 21658/java We're running the SolrCloud on 443, and the IP's belong to the VMs. We also tried adjusting the ulimit for the machine to 100 000 - without any results.. Greetings, Mads