On 12/4/2012 1:55 AM, Andy D'Arcy Jewell wrote:
Is there an easy way to tell (say from a shell script) when "all commits and merges [are] complete"?
One important bit of information I just thought of: A default Solr 4 config uses a new directory implementation called NRTCachingDirectory, which in some circumstances may keep part of the newest segment(s) in RAM. I *hope* that issuing an explicit hard commit will flush that to disk, but I am not sure. You *might* need to switch to the old directory implementation to be sure a hardlink backup is complete. Can one of the committers please comment on this? Assuming that we work out this detail, the rest of what I've said will be valid.
Detecting when commits are done has to be coordinated with your indexing program, so depending on how your system works, kicking off the "make hardlinks" process might need to be part of your indexing program. As for merges, that's a bit tougher, because Solr 4 and later will do merges in the background after informing your indexing program that the commit is complete.
If you grab a hardlink copy while a merge is happening, I do not believe it will be corrupt in any way, but it may be larger than expected because it will contain the new segments from the merge. Those segments would not be referenced by the segments.nnn file, so I *think* that if you then load that index into Solr, it would ignore the other segments. I am not sure about that, though. You might be able to use a command like the following to delete the newer segments from the copy, but I would not do it without experimentation to be sure it's actually required, and that it never wipes anything out that you actually need:
find -type f -newer segments.gen | xargs rm -f
If I keep a replica solely for backup purposes, I assume I can "do what I like with it" - presumably replication will resume/catch-up when I resume it (I admit, I have a bit of reading to do wrt replication - I just skimmed that because it wasn't in my initial brief).
As long as the replica server isn't being actively updated or used for queries and you temporarily turn off replication, you should be able to do whatever you want with its index.
I'm assuming that because you're using hardlinks, that means that SolR writes a "new" file when it updates (sortof copy-on-write style)? So we are relying on the principle that as long as you have at least one remaining reference to the data, it's not deleted...
Yes. Lucene (which Solr uses under the hood) never touches segment files once they have been written. It only deletes segment files in two circumstances: 1) Every document in that segment has been deleted from the index. 2) The data in that segment has been written to a new segment. The combination of Lucene's update method and hardlink functionality will ensure that the hardlink copy is always good.
Thanks, Shawn