On Friday, 19 March 1999 at 23:30:44 -0600, Russell Neeper wrote: > I decided to give vinum a try a few days ago and ran into the same > problem as Vallo - after using it for a short period of time it caused > a kernel panic due to a page fault. > > I spent some time with kgdb today and believe that I have found the bug.
This message came in literally two seconds after I send a reply to Vallo about the bug: Mar 20 16:00:54 allegro sendmail[7086]: QAA07084: to=<va...@matti.ee>, delay=00:00:07, xdelay=00:00:07, mailer=esmtp, relay=solaris.matti.ee. [194.126.98.135], stat=Sent (HAA07473 Message accepted for delivery) Mar 20 16:00:56 allegro sendmail[7087]: QAA07087: from=<russ...@newnet.tamu.edu>, size=5117, class=0, pri=35117, nrcpts=1, msgid=<19990319233044.a10...@net.tamu.edu>, bodytype=8BITMIME, proto=ESMTP, relay=newnet.tamu.edu [128.194.177.50] Mar 20 16:00:57 allegro sendmail[7088]: QAA07087: to=<g...@lemis.com>, delay=00:00:04, xdelay=00:00:01, mailer=local, stat=Sent Anyway, wonderful! Exactly right. I'm very impressed that you found it at effectively the same time as me. More comments further down. > On Thu, Mar 18, 1999 at 01:31:23PM +1030, Greg Lehey wrote: >> This is a problem I've seen before, but it completely baffles me. The >> request passed to launch_requests (frame 10) has been deallocated. >> Some of the debug code I put in caught it: >> >> (kgdb) p freeinfo[7] >> $2 = { >> time = { >> tv_sec = 921669613, >> tv_usec = 289712 >> }, >> seq = 24, >> size = 36, >> line = 174, >> address = 0xf0a3cb00 >> "ÞÀÞh\235\"ðÞÀÞÀÈ£ðÞÀÞÞÀÞÞÀÞÞÀÞÞÀÞÞÀÞÞÀÞÞÀÞÞÀÞÞÀÞÞÀÞÞÀÞ", >> file = "vinuminterrupt.c" >> } >> >> This was called from freerq, which frees the complete request. freerq >> is called from only four places: one on completion of the request >> (which in this case is just about to be started), one if the request >> is aborted (which also sets bp->b_error, which is not set here), once >> in a read context (which is not applicable here: it's a write), and >> once just before the call to launch_requests in frame 11: > > The best that I can tell, the problem is with the first call that you > listed: "on completion of the request". The function 'complete_rqe' is > called asynchronously by an interrupt at the completion of the I/O > request. Correct. >> So where is this coming from? I'm completely baffled. It doesn't >> happen to most people, though I have had reports of one or two other >> cases. About the only clue is that the problem didn't occur when I >> removed the debug memory allocator, but I don't know whether it went >> away or into hiding. I'd really like to find out what's going on >> here. > > I think that removing the debug memory allocator just made it go into > hiding because it changed the timing of the code. Possibly. It would work in the right direction. > Freeing the request structure in the interrupt routine is causing a > race condition in the function 'launch_requests'. Interrupts must > be disabled around any and all code which refers to the request > chain and this wasn't being done. I have created a patch that seems > to fix the problem. However, there could be other places in the > code that refers to the request chain without disabling interrupts. > After looking at it for only a few hours, I'm not familiar enough > with it to tell. This is the only place. > Here's the patch: > > diff -u vinum/vinumrequest.c vinum-mod/vinumrequest.c > --- vinum/vinumrequest.c Thu Mar 18 20:21:46 1999 > +++ vinum-mod/vinumrequest.c Fri Mar 19 22:55:49 1999 > @@ -258,13 +258,8 @@ > biodone(bp); > freerq(rq); > return -1; > - } { /* XXX */ > - int result; > - int s = splhigh(); > - result = launch_requests(rq, reviveok); /* now start the > requests if we can */ > - splx(s); > - return result; > } > + return launch_requests(rq, reviveok); /* now start the requests > if we can */ > } else > /* > * This is a write operation. We write to all > @@ -366,6 +361,7 @@ > if (debug & DEBUG_LASTREQS) > logrq(loginfo_user_bpl, rq->bp, rq->bp); > #endif > + s = splbio(); > for (rqg = rq->rqg; rqg != NULL; rqg = rqg->next) { /* > through the whole request chain */ > rqg->active = rqg->count; /* they're all > active */ > rq->active++; /* one more > active request group */ > @@ -396,13 +392,13 @@ > logrq(loginfo_rqe, rqe, rq->bp); > #endif > /* fire off the request */ > - s = splbio(); > (*bdevsw[major(rqe->b.b_dev)]->d_strategy) (&rqe->b); > - splx(s); > } > /* XXX Do we need caching? Think about this more */ > } > } > + splx(s); > + > return 0; > } > > I remove the splhigh/splx from around the first call of launch_requests > because, as far as I can tell, it became redundant after adding > splbio/splx around the for loop in the launch_requests function. Yup, sloppy coding on my part. I had suspected a problem in the area, so I put that in to see if it would fix it. It seemed to, but I had only protected one of the two calls (the read call). *sigh*. It's interesting to note that this happened reliably and repeatedly on a system with IDE drives; in order for it to happen, the I/O request (there was only one) had to complete before the top half got back to the calling function. I suspect there might be a problem in that area. Here's an overview of the stages of the request. The format of this list is described in vinum(8) under the info -V command, though this comes from gdb (thus the unconverted times). Time Event Buf Dev Offset Bytes SD SDoff Doffset Goffset 921802787.741353 1VS Write 0xf1569f90 0x5b00 0xe0 8192 921802787.741381 2LR Write 0xf1569f90 0x5b00 0xe0 8192 921802787.741385 3RQ Write 0xf1569f90 0xc 0x1e9 8192 0 e0 0 0 921802787.742031 4DN Write 0xf1569f90 0xc 0x1e9 8192 0 e0 0 0 In other words, the request completed after 650 µs: it was issued (3RQ) at 741385 µs past the second, and completed at 742031 µs (4DN). Still, it's puzzling to know why the IDE driver didn't return control to the caller during that time. Before you ask, yes, it was PIO, but that shouldn't cause the top half to hold on to the caller. Greg -- See complete headers for address, home page and phone numbers finger g...@lemis.com for PGP public key To Unsubscribe: send mail to majord...@freebsd.org with "unsubscribe freebsd-current" in the body of the message