Hello,

We are facing some problems when indexing with Solr 4.0.0 with more than one 
server node and we can't find a way to solve them.
We have 2 nodes of Solr Cloud instances.
They are running in a Zookeeper ensemble (3.4.4 version) with 3 servers 
(another application is deployed on the third server).
We try to index a collection with 1 shard stored in the 2 nodes.
2 other collections with an only shard have already been indexed. The logs for 
this first indexing have been lost but maybe there was a single Solr node when 
the indexing has been made. Each collection contains about 3.000.000 documents 
(16 Go).

When we start adding documents, failures occur very fast, after maybe 2000 
documents, and the solr servers cannot be accessed anymore.
I add to this mail an attachment containing a part of the logs.

When we use Solr Cloud with only one node in a single zookeeper ensemble, we 
don't encounter any problem.



Some precisions on our configuration :
We send about 400 documents per minute.
The documents are added in Solr by two threads on our application, using the 
CloudSolrServer class.
These threads don't call the commit method. We use only the solr config to 
commit. The solrconfig.xml defines for now :
<autoCommit><maxTime>15000</maxTime><openSearcher>false</openSearcher></autoCommit>
No soft commit
We have also tried :
<autoCommit><maxTime>600000</maxTime><openSearcher>false</openSearcher></autoCommit>
<autoSoftCommit><maxTime>1000</maxTime></autoSoftCommit>

The Solr servers are launched with these options :
-Xmx12G -Xms4G
-XX:MaxPermSize=256m -XX:MaxNewSize=356m
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC
-XX:+CMSClassUnloadingEnabled
-XX:MinHeapFreeRatio=10
-XX:MaxHeapFreeRatio=25
-DzkHost=server1:2188,server2:2188,server3:2188

The solr.xml contains zkClientTimeout="60000" and zoo.cfg defines a ticktime of 
3000 ms.

The Solr servers on which we are facing some problems contain old collections 
and old cores created for some tests.



Could you give some indications to me ?
Is this a problem in our solr or zookeeper config ?
How could we detect network problems ?
Is there a problem with the VM parameters ? Should we analyse some garbage 
collect logs ?

Thanks in advance.

Joel Gaspard
Problems often begin with errors like :

server.log on the current leader :
11:55:30,015 ERROR [STDERR] Jan 22, 2013 11:55:30 AM 
org.apache.solr.core.SolrCore execute
INFO: [collection2] webapp=/solr path=/replication 
params={file=_6pe_nrm.cfs&command=filecontent&checksum=true&offset=1304428544&qt=/replication&generation=1839&wt=filestream}
 status=0 QTime=0 
11:55:30,047 ERROR [STDERR] Jan 22, 2013 11:55:30 AM 
org.apache.zookeeper.ClientCnxn$SendThread run
INFO: Client session timed out, have not heard from server in 62416ms for 
sessionid 0x13c61924ade0000, closing socket connection and attempting reconnect
11:55:30,099 ERROR [STDERR] Jan 22, 2013 11:55:30 AM 
org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/get 
params={getVersions=100&distrib=false&qt=/get&wt=javabin&version=2} status=0 
QTime=0 
11:55:30,139 ERROR [STDERR] Jan 22, 2013 11:55:30 AM 
org.apache.solr.handler.ReplicationHandler$FileStream write
WARNING: Exception while writing response for params: 
file=_6pe_nrm.cfs&command=filecontent&checksum=true&generation=1839&qt=/replication&wt=filestream
ClientAbortException:  java.net.SocketException: Broken pipe
        at 
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:366)
...
2013-01-22 11:55:31,764 ERROR [STDERR] Jan 22, 2013 11:55:31 AM 
org.apache.solr.common.cloud.ConnectionManager process
INFO: zkClient has disconnected
...
2013-01-22 11:55:32,429 ERROR [STDERR] Jan 22, 2013 11:55:32 AM 
org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
WARNING: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /overseer_elect/leader
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)

zookeeper.log on the current leader :
2013-01-22 11:55:30,122 [myid:1] - INFO  [CommitProcessor:1:NIOServerCnxn@1001] 
- Closed socket connection for client /120.8.195.38:42931 which had sessionid 
0x13c61924ade0000
2013-01-22 11:55:30,122 [myid:1] - DEBUG [CommitProcessor:1:NIOServerCnxn@1017] 
- ignoring exception during output shutdown
java.net.SocketException: Transport endpoint is not connected
        at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
        
server.log on the current replicate :
INFO: [collection2] webapp=/solr path=/update 
params={distrib.from=http://server1:8080/solr/collection2/&update.distrib=FROMLEADER&wt=javabin&version=2}
 status=0 QTime=5 
11:54:48,301 ERROR [STDERR] Jan 22, 2013 11:54:48 AM 
org.apache.solr.handler.SnapPuller$FileFetcher fetchPackets
WARNING: Error in fetching packets 
java.net.SocketTimeoutException: Read timed out
...
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:220)
11:55:08,311 ERROR [STDERR] Jan 22, 2013 11:55:08 AM 
org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: removing temporary index download directory 
/var/webfarm/server2/var/solr/solr_data_global/collection2/data/index.20130122115232777
11:55:08,360 ERROR [STDERR] Jan 22, 2013 11:55:08 AM 
org.apache.solr.common.SolrException log
SEVERE: SnapPull failed :org.apache.solr.common.SolrException: Unable to 
download _6pe_nrm.cfs completely. Downloaded 1304428544!=2778392858
        at 
org.apache.solr.handler.SnapPuller$FileFetcher.cleanup(SnapPuller.java:1126)
        
zookeeper.log on the current replicate :
2013-01-22 11:53:17,456 [myid:2] - WARN  
[NIOServerCxn.Factory:server2/120.8.195.39:2188:NIOServerCnxn@349] - caught end 
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x23c61926cad0002, likely client has closed socket
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:662)
2013-01-22 11:53:17,459 [myid:2] - INFO  
[NIOServerCxn.Factory:server2/120.8.195.39:2188:NIOServerCnxn@1001] - Closed 
socket connection for client /120.8.195.39:47642 which had sessionid 
0x23c61926cad0002
2013-01-22 11:53:17,459 [myid:2] - DEBUG 
[NIOServerCxn.Factory:server2/120.8.195.39:2188:NIOServerCnxn@1025] - ignoring 
exception during input shutdown
java.net.SocketException: Transport endpoint is not connected
        ...

Then both Solr servers don't answer to any request.

In some other tests made with other VM parameters, no log like these can be 
read, but there are some messages like :
INFO: zkClient has disconnected. After a little time, an OOME : Java heap space 
occur :
SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap 
space
        at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:469)
Or :
SEVERE: null:java.lang.IllegalStateException: this writer hit an 
OutOfMemoryError; cannot commit
        at 
org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2717)

Reply via email to