Bruce Allen wrote:
Hi Gerry,
Areca replacement; RAID rebuild (usually successful); backup; Areca
replacement with 3Ware controller or CoRAID (or JetStor) shelf; create
new RAID instance; restore from backup.
Let's just say we lost confidence.
I understand. Was this with 'current generation' controllers and
firmware or was this two or three years ago? It's my impression that
(when used with compatible drives and drive backplanes) the latest
generation of Areca hardware is quite stable.
Within the last year with current drivers. The possibilities include a
pair of bad drives taht failed at the same time, anomalies caused by a
near melt-down in our campus data center (ambient temps spiked past
130F), and almost any other random event that could have inflicted
itself on a large number of continuously rotating drives.
My sample's too small to indict Areca but when we had two failures in a
10 day period coincidence was enough to overwhelm failure statistics
analysis...
gerry
Bruce Allen wrote:
What was needed to fix the systems? Reboot? Hardware replacement?
On Wed, 16 Apr 2008, Gerry Creager wrote:
We've had two fail rather randomly. The failures did cause disk
corruption but it wasn't an undetected/undetectable sort. They
started throwing errors to syslog, then fell over and stopped
accessing disks.
gerry
Bruce Allen wrote:
Hi Gerry,
So far the only problem we have had is with one Areca card that had
a bad 2GB memory module. This generated lots of (correctable)
single bit errors but eventually caused real problems. Could you
say something about the reliability issues you have seen?
Cheers,
Bruce
On Wed, 16 Apr 2008, Gerry Creager wrote:
We've used AoE (CoRAID hardware) with pretty good success (modulo
one RAID shelf fire that was caused by a manufacturing defect and
dealt with promptly by CoRAID). We've had some reliability issues
with Areca cards but no data corruption on the systems we've built
that way.
gerry
Bruce Allen wrote:
Hi Xavier,
PPS: We've also been doing some experiments with putting
OpenSolaris+ZFS on some of our generic (Supermicro + Areca)
16-disk RAID systems, which were originally intended to run
Linux.
I think that DESY proved some data corruption with such
configuration, so they switched to OpenSolaris+ZFS.
I'm confused. I am also talking about OpenSolaris+ZFS. What
did DESY try, and what did they switch to?
Sorry, I am indeed not clear. As far as I know, DESY found data
corruption using Linux and Areca cards. They moved from linux to
OpenSolaris and ZFS, avoiding other corruption. This has been
discussed in HEPiX storage workgroup. However, I can not speak
on their behalf at all. I'll try to get you in touch with
someone more aware of this issue, as my statements lack of figures.
I think that would be very interesting to the entire Beowulf
mailing list, so please suggest that they respond to the entire
group, not just to me personally. Here is an LKML thread about
silent data corruption:
http://kerneltrap.org/mailarchive/linux-kernel/2007/9/10/191697
So far we have not seen any signs of data corruption on
Linux+Areca systems (and our data files carry both internal and
external checksums, so we would be sensitive to this).
Cheers,
Bruce
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Gerry Creager -- [EMAIL PROTECTED]
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf