Hi all, I hope you can advise a solution to our incorrect data directory
issue.

 

We have 2 physical servers using Solr 4.3.0, each with 24 separate
tomcat instances (RedHat 6.4, java 1.7.0_10-b18, tomcat 7.0.34) with a
solr shard in each. This configuration means that each shard has its own
data directory declared. (Server OS, tomcat and solr, including shards,
created via automated builds.) 

 

That is, for example,

- tomcat instance, /var/local/tomcat/solrshard3/, port 8985

- corresponding solr instance, /usr/local/solrshard3/, with
/usr/local/solrshard3/collection1/conf/solrconfig.xml

- corresponding solr data directory,
/var/local/solrshard3/collection1/data/

 

We process ~1.5 billion documents, which is why we use so 48 shards (24
leaders, 24 replicas). These physical servers are rebooted regularly to
fsck their drives. When rebooted, we always see several (~10-20) shards
failing to start (UI cloud view shows them as 'Down' or 'Recovering'
though they never recover without intervention), though there is not a
pattern to which shards fail to start - we haven't recorded any that
always or never fail. On inspection, the UI dashboard for these failed
shards displays, for example:

- Host                    Server1

- Instance            /usr/local/sholrshard3/collection1

- Data                    /var/local/solrshard6/collection1/data

- Index                  /var/local/solrshard6/collection1/data/index

 

To fix such failed shards, I manually restart the shard leader and
replicas, which fixes the issue. However, of course, I would like to
know a permanent cure for this, not a remedy.

 

We use a separate zookeeper service, spread across 3 Virtual Machines
within our private network of ~200 servers (physical and virtual).
Network traffic is constant but relatively little across 1GB bandwidth.

 

Any advice or suggestions greatly appreciated.

Gil

 

Gil Hoggarth

Web Archiving Engineer

The British Library, Boston Spa, West Yorkshire, LS23 7BQ

 

Reply via email to