Hi Ted, Thanks for the reply, see my responses in-line below:
Excerpts from Theodore Tso's message of Sat Nov 14 16:21:51 -0500 2009: > Do you have the full e2fsck transcript (it looks like what you > submitted to BTS was only a partial transcript)? Unfortunately, no. Although we were running this fsck within a screen'd serial console, the output vastly exceeded the buffer of both the xterm and the screen process. The fsck itself took a very long time, its a terabyte drive, so I was following it for some time, and paid very close attention to what I saw. Everything was typical up to the point where it started asking me the questions I included in the bug report. Those questions were the bulk of the log, the same question repeated thousands of times... then two times, the PROGRAMMING BUG appeared in the middle of those questions, I was able to manage to capture that part of the log. > Also, can you tell me something about the files which got the > PROGRAMMING BUG error? It would be useful to see the pathname and > inode breakdown of the inode(s) in question. For example, for inode > 223806323, the following debugfs commands will give the pathname and > inode: > > % debugfs /dev/mapper/vg_hoopoe0-backups > debugfs: ncheck <223806323> > debugfs: stat <223806323> > debugfs: quit Sure, I would be happy to do this. However, this will have to wait until the non-destructive read/write tests we have been doing on the drive finish. As of this writing we are here: 75.30% done, 138:02:35 elapsed Once it finishes, I'll provide this additional information. > The other thing might be worth trying is re-running e2fsck and see > what you see, via "e2fsck -f /dev/mapper/vg_hoopoe0-backups". The > PROGRAMMING BUG error can also result by having a hard drive returning > different data when a particular inode tabke block is read at > different times. So if there is something flakey in your storage > device --- for example, if you have a RAID 1 setup, and the two > mirrors aren't synchronized, it could be that e2fsck would read from > disk #1 during pass 1, and then later when pass #4, if the disk read > comes from disk #2 returns different data, you will also get the > PROGRAMMING BUG error. Also will do as you suggest when I can. The system is *not* setup with a RAID 1 configuration, but it does have file-system encryption setup via dmcrypt and then the LVM layer on top of it. > It should also be the case after a single run of e2fsck, if all > answers are answered with 'yes', that a subsequent run of e2fsck > should find no problems. This, of course, is assuming that there are > no e2fsck bugs and that storage device is reliable. (That is, data > written to a block will be read back when the block is read, and data > read from a block at time T and data read time T+n will be the same, > if there are no intervening writes to that block.) Sounds reasonable. The odd thing about this particular system is that this is the second time that we have needed to do a fsck of this type on this system after a routine debian kernel security upgrade. We aren't exactly sure what is going on here, if its the disk that has an issue, the controller, the memory, or what. We have a typical burn-in process to weed out bad memory before we deploy boxes (memtest86+ plus some cpuburns or kernel compiles), but things can always change. This is why we are doing the non-destructive read/write tests on the drive right now, after that completes I can obtain the information you have requested, and perhaps we will attempt some stress tests. thanks, micah -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org