Launchpad has imported 23 comments from the remote bug at
https://bugzilla.redhat.com/show_bug.cgi?id=429755.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2008-01-22T21:31:23+00:00 Thom wrote:

>From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) 
Gecko/20071127 Firefox/2.0.0.11

Description of problem:
CommuniGate Systems is reporting this case on behalf of two CommuniGate/RedHat 
customers running RedHat Enterprise Server 5.1 and seeing some problems related 
to file integrity on this platform. We have two customers who upgraded their 
CommuniGate Pro cluster nodes to RedHat 5.1, from an earlier RHES 4.1 version. 
In both these cases, the kernel reportedly in use is this: 2.6.18-53.el5

We also have reports of a possibly identical problem with a customer
running this kernel version, though we don't have the specifics of the
Linux OS version: kernel version 2.6.18-8.1.8

In both of these cases, the customers began to get what appear to be
"null bytes" in mailboxes. I will a screenshot png of one of
these mailboxes, as seen with vi.

The mount options used are:
tcp,rsize=32768,wsize=32768,hard,intr,timeo=600,bg,retrans=2,noatime

The output of "mount -v" for one of these customers showed the following:
>>>172.30.35.5:/vol/CGPweb on /CGPweb type nfs
>>>(rw,nfsvers=3,proto=tcp,rsize=32768,wsize=32768,timeo=600,hard,intr,bg,acregmax=6,addr=172.30.35.5)

The exact operating system in use was:
>>>Linux MSA 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007 i686 i686 i386 
>>>GNU/Linux Red Hat Enterprise Linux 5.

When I last researched this in detail, it appeared that the byte offsets
and total sizes were still correct after the null bytes were inserted;
only the contents of those bytes were replaced with null characters. So,
it appeared at first glance that it was a 1:1 replacement of valid data
with "corruption"-type data of some sort.

When analyzing CommuniGate Pro logs (which report the file sizes and
offsets of all messages), we found two types of symptoms:

1. a missing message with null bytes inserted instead (1:1 replacement of 
characters bytes into NULL bytes)
2. no missing message, but null bytes between/within two messages, and there is 
some indication that parts of some messages are missing (and replaced with NULL 
bytes)

A key question that is not 100% clearly answered is whether there is any
indication of additional bytes ever being added, or if the null bytes
are simply byte-replacement of data. From the available evidence, it
appears that there is just 1:1 byte replacement.

Also, shile this is not 100% confirmed - CommuniGate Pro appears to be
getting the correct byte offsets from the file system, as noted in the
"math" parts of this document. This would suggest a problem that is
different than the previous pre-2.6.13 Linux NFS kernel problem. Also,
the time interval between some of the events that insert null bytes is
rather large, often times 10+ minutes of interval between events.

Of these two customers, both went back to RedHat 4.5, and the problem
immediately disappeared. We have many customers running RedHat ES 4.5 
successfully. Earlier RedHat versions than this still have a different NFS 
kernel bug which can cause serious problems in an CommuniGate Pro Dynamic 
Cluster when NFS-based. (A few years ago, we discovered a Linux
kernel bug related to NFS client handling in the kernel (specifically
related to filesize caching), which was fixed by Trond Myklebust at
NetApp, and these fixes were put into the 2.6.13 and 2.6.14 kernel. If
interested, you can read more about this requirement here:)
https://support.communigate.com/tickets/kb_article.php?ref=2908-TIOL-4737

Duplicating this problem will be challenging, though we believe possible
using a Dynamic Cluster on RedHat 5 under relatively high load. We would be 
glad to work with RedHat to try to replicate the issue. Since our customers have
since rolled back to RedHat 4.5, we don't have any customers actively
using RedHat 5 within an NFS-based cluster, to our knowledge. If we could get 
access to RHES 5 with the latest patches, we would also be glad to begin trying 
to replicate this problem in-house. We would need two RedHat 5 servers running 
with an NFS-based storage backend - we have the equipment available, but would 
need to get the latest RedHat 5 software.

We realize that reporting this bug with only partial evidence is
difficult. However, we felt it would be better to report the possible
bug and discover if there were possibly known causes, or if other RedHat
customers are experiencing anything similar. We are not aware of any
other CommuniGate customers running Linux-based NFS-based Dynamic
Clusters having this problem, including quite a few who run more recent
kernels.

Version-Release number of selected component (if applicable):
kernel-2.6.18-53.el5

How reproducible:
Didn't try


Steps to Reproduce:
The basic file access flow used here would be

1. Have two or more NFS clients mount the same logical volume.
2. Have one NFS client modify a non-binary, text file, using C++ operations 
such as lseek(), write(), and fsync() (all filehandles are properly fsync()'d 
when closed by an NFS client)
3. No less than 6 seconds later, have a second NFS client open the same file, 
modify it (lseek/write/fsync).
4. Repeat steps 2-3 repeatedly.

At some point in this file access pattern, null bytes may be inserted
into these files.

Actual Results:
We will attempt to do so, though we would like to request temporary access to 
the latest appropriate versions of RedHat Enterprise 5 in order to test.

Expected Results:
Files should be written without null bytes inserted.

Additional info:

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/0

------------------------------------------------------------------------
On 2008-01-22T21:45:57+00:00 Thom wrote:

Created attachment 292558
vi session screenshot of mailbox showing null bytes in message file

Please note that the screenshot has been modified to obfuscate potentially
proprietary data. These fields are clearly designated with red boxes.

This screenshot is quite representative of the problem, demonstrating how
entire or partial sections of messages are replaced with null bytes, in what
appear to be 1:1 byte replacements. The following message is properly
delineated in the message "mbox" file with a new special From line:

>From <...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/1

------------------------------------------------------------------------
On 2008-01-22T23:24:05+00:00 Wendy wrote:

Few questions:

1. How did these corrupted files look like from server end ? 
2. As stated in the problem statement, "RHEL 4.5" doesn't have this issue but 
   it is not clear when the platform was moved back to RHEL 4.5, was the 
   server moved too ? What are the OSs running on server with and without 
   the problem ? What is server's filesystem (GFS, ext3, etc) ?

Intuitively, if file size and offset are correct but file contens were 
partially filled with "NULL" characters, it normally implies the file 
spaces are allocated but file contents are not there. We need to isolate 
whether this is really a NFS client issue as stated or it is a server 
(nfsd and/or filesystem) issue.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/2

------------------------------------------------------------------------
On 2008-01-22T23:34:10+00:00 Thom wrote:

The NAS servers reportedly in use here are the following:

1. Customer A: NetApp 3020c running OnTAP v7.2.2
2. Customer B: NetApp 3020c

So, both are using NetApp.

We have contacts at NetApp if they should be brought into the discussion.
However, one note that may or may not be in the case so far, but should be, is
that the "null bytes" problem disappeared when Customer A went to a single NFS
client (taking one CommuniGate Pro "Backend Server" offline). It is only with
two or more NFS clients (Backend Servers) online that the problem can occur.

NetApp uses a proprietary filesystem called "WAFS". I am unsure whether NetApp
can provide filesystem/shell-level access to WAFS directly from their NAS
device, but it may be possible.

If we can attempt to reproduce these tests in-house, we do have a NetApp device
on which to try this; although the model number and NetApp OS version may
differ, and would need to be researched.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/3

------------------------------------------------------------------------
On 2008-01-23T14:53:47+00:00 Wendy wrote:

ok, thanks ! Was wondering whether GFS clusters were involved. With above
info, I would say this does look like an NFS client issue at this moment.
Info will be passed to Red Hat NFS kernel folks.


Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/4

------------------------------------------------------------------------
On 2008-01-23T22:58:58+00:00 Tom wrote:

Thanks for the detailed BZ.

> Duplicating this problem will be challenging, though we believe
possible ...

I am making arrangements to make RHEL 5.1 available to you. If you can reproduce
the problem that will be the first step. 

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/5

------------------------------------------------------------------------
On 2008-02-01T14:45:58+00:00 Jeff wrote:

Yes, thanks for the detailed bug report. I've looked over this and have
a question:

2. Have one NFS client modify a non-binary, text file, using C++ operations such
as lseek(), write(), and fsync() (all filehandles are properly fsync()'d when
closed by an NFS client)
3. No less than 6 seconds later, have a second NFS client open the same file,
modify it (lseek/write/fsync).

is there any sort of fcntl locking going on here? You don't mention any so I
assume not...

Would it be possible for you to write a small a reproducer program and give us a
set of steps to duplicate this? Trying to troubleshoot this in the context of a
MTA is going to be tricky. It'll be much easier if we can reduce the reproducer
down.


Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/6

------------------------------------------------------------------------
On 2008-02-01T16:01:24+00:00 Thom wrote:

Ideally we would, yes, write such a program. We do not have one currently for
this particular problem.

For the previous NFS-related "file-caching" bug (fixed in the 2.6.13/2.6.14
kernels by Trond Mykelbust), we did produce such a tool. However, that tool does
not appear to trigger this new problem.

I hope to be trying to reproduce this issue next week. If we can do so reliably,
we can write such an application.

Sincerely,
 -t

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/7

------------------------------------------------------------------------
On 2008-02-05T12:35:10+00:00 Jeff wrote:

Excellent. I'll set this to NEEDINFO for now. Go ahead and set it back to
ASSIGNED once you have more info to go on.


Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/8

------------------------------------------------------------------------
On 2008-02-08T01:24:06+00:00 Trond wrote:

Created attachment 294295
Trivial testcase that should demonstrate the problem

Instructions for the trivial testcase script:

Please edit the variables 'filename' and 'remote' depending on your
test environment.
The testcase should be run on NFS client number 1. '$remote' is another
NFS client that shares the same NFS namespace (and has access to the file
${filename})

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/9

------------------------------------------------------------------------
On 2008-02-08T01:25:29+00:00 Trond wrote:

Created attachment 294296
NFS: Fix a potential file corruption issue when writing

Proposed fix for the bug.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/10

------------------------------------------------------------------------
On 2008-02-08T11:26:34+00:00 Jeff wrote:

Thanks, Trond. Let me see what we can do about getting this in soon.


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/11

------------------------------------------------------------------------
On 2008-02-08T11:51:22+00:00 Jeff wrote:

Yep, the reproducer here is indeed trivial and consistently fails. The patch
seems to fix it and looks sane.


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/12

------------------------------------------------------------------------
On 2008-02-08T20:28:51+00:00 RHEL wrote:

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/13

------------------------------------------------------------------------
On 2008-02-08T20:30:43+00:00 Jeff wrote:

I've got some test kernels on my people page with this patch. Thom, would you be
able to test your product on them and let me know if they correct the issue?
Note that these kernels are based on develoment builds and aren't fully QA'ed,
so please only deploy them for testing purposes...

http://people.redhat.com/jlayton


Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/14

------------------------------------------------------------------------
On 2008-02-13T16:52:27+00:00 Thom wrote:

Excellent work all around, thank you folks. Thank you, Trond. I hope to test
with the new kernel today. 

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/15

------------------------------------------------------------------------
On 2008-02-13T20:28:50+00:00 Jeff wrote:

Excellent. I've just posted a new set of test kernels on my people page
(jtltest.20). I'd recommend using those instead of any earlier ones since those
kernels should also have the fixes for the vmsplice() local exploit that was
disclosed recently.

Let me know how it goes.


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/16

------------------------------------------------------------------------
On 2008-02-13T20:51:49+00:00 Don wrote:

in 2.6.18-81.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/17

------------------------------------------------------------------------
On 2008-02-15T16:34:55+00:00 Thom wrote:

I have confirmed that the new kernel 2.6.18-81.el5 looks to have eliminated the
null bytes problem, when using the testcase that Trond provided.

I am also running a "SPECmail" test on this environment today, with the new
kernel in place, in order to verify proper behaviour under higher load. Thanks,
sincerely, -t

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/18

------------------------------------------------------------------------
On 2008-02-16T00:20:15+00:00 Thom wrote:

More supporting detail - I ran two SPECmail tests this afternoon. Each test used
2 Linux NFS client servers, attached to a shared NFS storage volume. The kernels
used were as follows, along with the results:

Test tool: SPECmail (spec.org) v1.01
Server application: CommuniGate Pro 5.2.0 x86-64, 0+2 Dynamic Cluster
NFS server: NetApp FAS270
OS: RedHat 5.1 x86_64 [RHEL5.1-Server-20071017.0-x86_64-DVD.iso]

Test 1:
Kernel vmlinuz-2.6.18-53.el5
Resulted in null byte in CommuniGate Pro "mailbox" files
(I will attach a sample mailbox file demonstrating the null bytes.)

Test 2:
Kernel vmlinuz-2.6.18-81.el5
Resulted in no null bytes in mailboxes

Thank you, please let me know if there are any questions. Sincerely.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/19

------------------------------------------------------------------------
On 2008-02-16T00:22:57+00:00 Thom wrote:

Created attachment 295054
INBOX mailbox with 1 message replaced with null bytes

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/20

------------------------------------------------------------------------
On 2008-02-28T23:13:43+00:00 Mike wrote:

Verified based on customer's report as well as the testcase.


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/21

------------------------------------------------------------------------
On 2008-05-21T15:07:14+00:00 errata-xmlrpc wrote:


An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/comments/28


** Changed in: linux (Fedora)
   Importance: Unknown => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/199037

Title:
  Null bytes in files access by 2 or more NFS clients

Status in linux package in Ubuntu:
  Fix Released
Status in linux package in Fedora:
  Fix Released

Bug description:
  There's a bug in the linux NFS client where it's possible to corrupt
  files when the server is a NetApp filer and two (or more) clients have
  write access to the file. A good example of such a file is
  ~/.zhistory. This has been fixed upstream and on RHEL5:

  http://lkml.org/lkml/2008/2/7/642 (second changeset)

  and discussed here:

  https://bugzilla.redhat.com/show_bug.cgi?id=429755

  Please apply to the Hardy kernel, and possibly others, thanks!

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199037/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to