Hi,

I'd like your advice on this backup plan.
It's my first Solr deployment (4.0).

Production consists of 1 master and n frontend slaves
placed in different datacenters, replicating through HTTP.
Only master is backed up. Frontend slaves can die anytime
or go stale for a while and that's ok.

Backup is performed daily. Steps are:

1) Ping the /replication handler with command=backup and numberToKeep=3
   and verify that we get a status=0

2) Check the replication handler with command=details and verify that
   we get a "snapshotCompletedAt". If not, spin and wait for it.

3) Snapshot is completed. Rsync --delete everything to a different
   volume on the same host. This is to keep a complete archived *local*
   copy should the index SSD drive fail.

4) Once the rsync is finished, a stand by machine downloads the archived
   copy from the master, and rebuilds everything under a "restore" core.

5) New "restore" core is started up with /admin/cores handler
   (command=CREATE IIRC)

6) Nagios checks that we can query the restore core correctly
   and get back at least a document from it.

In this way, I get:
- 3 (n) quick snapshots done by Solr itself. Older ones are discarded
  automatically
- 1 full index copy on a secondary volume
- 1 "offsite" copy on another machine
- a daily automated restore that verifies that our backup is valid

It's been running reliably for a week or so now,
but surely someone out there must have done this before

Did I miss something?

-- 
Cosimo

Reply via email to