Hello,
We're indexing a large set of files using Solr 6.1.0, running a SolrCloud by 
utilizing ZooKeeper 3.4.8.

We have two ensembles - and both clusters are running on three of their own 
respective VMs (CentOS 7). We first thought the error was due to CDCR - as we 
were trying to index a large amount of documents which had to be replicated to 
the target cluster. However, we got the same error even after turning of CDCR - 
which indicates CDCR wasn't the problem after all.

After indexing between 20 000 to 35 000 documents to the source cluster does 
the File Descriptor Count reach 4096 for one of the solr-nodes - and the 
respective node crashes. The count grows quite linearly as time goes. The 
remaining 2 nodes in the cluster is not affected at all, and their logs had no 
relevant posts.  We found the following errors for the crashing node in its log:

2016-06-30 08:23:12.459 ERROR 
(updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1
 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) 
[c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] 
o.a.s.u.StreamingSolrClients error
java.net.SocketException: Too many open files
                (...)
2016-06-30 08:23:12.460 ERROR 
(updateExecutor-2-thread-22-processing-https:////10.0.106.168:443//solr//DIPS_shard3_replica1
 x:DIPS_shard1_replica1 r:core_node1 n:10.0.106.115:443_solr s:shard1 c:DIPS) 
[c:DIPS s:shard1 r:core_node1 x:DIPS_shard1_replica1] 
o.a.s.u.StreamingSolrClients error
java.net.SocketException: Too many open files
                (...)
2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.h.RequestHandlerBase 
org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
 2 Async exceptions during distributed update:
Too many open files
Too many open files
                (...)
2016-06-30 08:23:12.461 INFO  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr 
path=/update params={version=2.2} status=-1 QTime=5
2016-06-30 08:23:12.461 ERROR (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall 
null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
 2 Async exceptions during distributed update:
Too many open files
Too many open files
                (....)

2016-06-30 08:23:12.461 WARN  (qtp314337396-18) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.s.HttpSolrCall invalid return code: -1
2016-06-30 08:23:38.108 INFO  (qtp314337396-20) [c:DIPS s:shard1 r:core_node1 
x:DIPS_shard1_replica1] o.a.s.c.S.Request [DIPS_shard1_replica1]  webapp=/solr 
path=/select 
params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=https://10.0.106.115:443/solr/DIPS_shard1_replica1/&rows=10&version=2&q=*:*&NOW=1467275018057&isShard=true&wt=javabin&_=1467275017220}
 hits=30218 status=0 QTime=1

Running netstat -n -p on the VM that yields the exceptions reveals that there 
is at least 1 800 TCP connections (not counted how many - the netstat command 
filled the entire PuTTY window yielding 2 000 lines) waiting to be closed:
tcp6      70      0 10.0.106.115:34531      10.0.106.114:443        CLOSE_WAIT  
21658/java
We're running the SolrCloud on 443, and the IP's belong to the VMs. We also 
tried adjusting the ulimit for the machine to 100 000 - without any results..

Greetings,
Mads

Reply via email to