>-----Original Message----- >From: Konrad Rzeszutek Wilk [mailto:konrad.w...@oracle.com] >Sent: Monday, September 20, 2010 6:08 PM >To: Artur Linhart - Linux communication >Cc: 'Ian Campbell'; 596...@bugs.debian.org >Subject: Re: Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: >causes a system >hangup by the shutdown of the system, aacraid (sw raid) involved in hangup) > >> So, it worked if I have specified in Dom0 in the "baloon" mode by omitting >> the specification of dom0_mem or, if dom0_mem is specified then also the >> swiotlb=65536 must be specified. > >Wow. That implies that AACRAID uses quite a lot of buffers, and looking at the >driver >there are a bunch of quirks where it can only do DMA up to 2GB, so that would >explain >why it relies on SWIOTLB that much.
Unfortunatelly I did not tried to raise dom0_mem higher than 2 GB :-(. > >Based on what Ian analyzed it really looks that we just ran out of DMA buffers >and >the driver didn't try to retry but just bails out. > >We can narrow down who is using so many buffers by using the attached debug >module >that when loaded will print out who is using what buffers if >CONFIG_DMA_API_DEBUG=y is set. > >But the proper workaround is the one you discovered - either raise the SWIOTLB >buffer >or raise the memory allocated for Dom0. > >> >> I have noticed one interesting behavior - during the successfull suspension >> of the domains during the shutdown the first one which is beeing suspended >> writes very fast three "dots", then it stops to write the dots for some time >> and then agfter some time very fast a lot of (possibly also all remaining) >> "dots" are written on the screen. By the next suspensions the suspension >> works continuously dot-by-dot smoothly without any delays. It looks like it >> waits for something during the first suspension (memory allocation?). > >That usually means that is stuck waiting for the disks to write out all the >data. OK, I thought it too, but in the case if I omitted dom0_mem or specified the higher swiotlb this behaved differently and I think, it should behave in the same way, isn't it? At least I would guess it so... >> >> Generally, it is for me very surpsrising, how the aacraid module works, I am >> no C or kernel developer but I would expect something like this cannot >> happen - the module should allocate its necessary memory in the start or, I >> would understand there can fail some specific read or write operation if the >> sw raid has not enough memory to execute them, but I would never expect this >> will lead to the hangup and freeze of the whole system. The probability of > > Well, to be honest, we engineers aren't known for testing all of the failure > paths > as well as we should. That is why folks like you are quite helpful in finding > bugs :-) I am always very pleased to have the possibility to help You all who are doing such a great job at least with some small piece of work - even if it did cost me unexpectedly much time :-) I actually began with the usage of the HW RAID on that server instead of SW raid - from other reasons. But at this time I still have the HDD with the SW raid configuration and I would be able to test something, if You have some ideas or want to let me test something concrete on my configuration. If not, I want to remove the software raid sometimes in the next week completely because I need this HDD, so let me know till that time, if there is something You would need to test - I do not know, how difficult would it be for You to reproduce the error on other machine(s). I think it should not be so difficult but who knows.... -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org