Hi Alex, thanks for the feedback.
I still got the original data so that is not a problem right now. What worries me is even if I restore the data right now can I trust the system? It is a RAID5 I am using and the discs are new. I have formated the disc space on Thursday so the file system is new as wll. What I found on the front end is that in syslog: mptbase: ioc0: LogInfo(0x11080000): F/W: Outbound DMA Overrun And I get that a few times. So either the controller on the front end got a problem which I did not see with the older Infortrend box as it is slower and hence the controller is less active, or the controller at the Infortrend box got a problem. I don't know whether the Infortrend box does scrubbing. I have not activated something here and I am just using the standart settings. Regarding ZFS: is that available for Linux now? I lost a bit track here. All the best from London Jörg On Sonntag 21 September 2014 you wrote: > Hi Jörg, > > Sounds like a "typical" but very uncommon silent data corruption problem. > If you have another copy of the data, compare to that? If you don't have > another copy, accept the fact that some of your data maybe got silently > corrupted. > > Most RAID controllers do periodic "scrubbing"; was your Infortrend doing > that? > > For the new system, consider using ZFS pointed at plain disks, as it may > have more layers of checksums compared to your current system. > > Regards, > Alex > > On Sunday, September 21, 2014, Jörg Saßmannshausen < > > j.sassmannshau...@ucl.ac.uk> wrote: > > Dear all, > > > > I got a rather strange problem with one of my file servers which I > > recently have upgraded in order to accommodate more disc space. > > > > The problem: I have copies the files from the old file space to a > > temporary disc > > storage space using this rsync command: > > > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo > > tempspace:baa > > > > I am doing this now for some years and never had any problems. > > > > As always, I am running md5sum afterwards to be sure ther is not a > > problem later and the user is loosing data. This time around a rather > > large file (around 16 GB) the md5sum failed after I moved the files from > > the temp space > > back to the new destination using the same command as above. > > > > Having still access to the old file space, I decided to move this file > > from the > > old file space. Strangely enough, rsync does not sync the file again so I > > had to > > delete the file. Even after deleting the file and re-sync it from the old > > source, the md5sum is wrong. > > > > Copying the file to a different file space did not cause these problem, > > i.e. the > > md5sum is correct. > > As it is a tar.gz file, I simply decided to decompress the original file > > on the > > different file server. That worked. The file where the md5sum is wrong > > did not > > decompress on the different file server but crashed with an error message > > when I > > executed gunzip. So the file is broken. > > > > The setup: > > > > Originally I was using an old Infortrand box which had old PATA discs in > > it. > > This box is connected via scsi to a frontend server which exports the > > file space via iscsi. The backend for that, i.e. the one the user is > > accessing is > > on a different physical machine and it is a XEN guest. The reason behind > > that > > setting is as the frontend is acting as a backup server and I don't want > > people to have access to it. > > I then exchanged the Infortrend box with a more recent model which got > > SATA capeabilities but still got scsi connection to the frontend. The > > frontend is > > the same. I got a new controller for that box as the old one was broken. > > There is no changes in the backend, that is still the same XEN guest on > > the same hardware. > > > > What I cannot work out is why the old Infortrend box does not have any > > problems with the new file, the newer one has a problem here. Also, when > > I have > > copied over some files (again using the rsync command above) a few files > > did not > > copy correctly (again md5sum) in the first instance but done so later. > > > > I find that highly alarming as that means that at least for larger and/or > > some > > binary files there seems to be a problem. However, I am not sure there to > > look > > at it as I am out of ideas. > > > > Could it be there is a problem with the 'new' controller? > > In all cases I was using ext4 as a file system and I did not have any > > problems > > with that. > > > > Anybody got some sentiments here? > > > > All the best from a sunny London > > > > Jörg > > > > P.S. To make things worse I am off on a work related trip from Monday > > onwards > > and I am working on that problem since Friday evening. > > > > > > > > -- > > ************************************************************* > > Dr. Jörg Saßmannshausen, MRSC > > University College London > > Department of Chemistry > > Gordon Street > > London > > WC1H 0AJ > > > > email: j.sassmannshau...@ucl.ac.uk <javascript:;> > > web: http://sassy.formativ.net > > > > Please avoid sending me Word or PowerPoint attachments. > > See http://www.gnu.org/philosophy/no-word-attachments.html -- ************************************************************* Dr. Jörg Saßmannshausen, MRSC University College London Department of Chemistry Gordon Street London WC1H 0AJ email: j.sassmannshau...@ucl.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf