----- Original Message ----- From: "Steven Hartland"
I've been investigating an issue for a user who was seeing his pool import hang after upgrading on FreeBSD. After digging around it turned out the issue was due to lack of free space on the pool.As the pool imports it writes hence requiring space but the pool has so little space this was failing. The IO being a required IO it retries, but obviously fails again resulting the the pool being suspened hence the hang. With the pool suspended during import it still holds the pool lock so all attempts to query the status also hang, which is one problem as the user can't tell why the hang has occured. During the debugging I mounted the pool read only and sent a copy to another empty pool, which resulted in ~1/2 capacity being recovered. This seemed odd but I dismissed it at the time. The machine was then left, with the pool not being accessed, however I just recieved an alert from our monitoring for a pool failure. On looking I now see the new pool I created with 2 write errors and no free space. So just having the pool mounted, with no access happening, has managed to use the remain 2GB on the 4GB pool. Has anyone seen this before or has any ideas what might be going on? zdb -m -m -m -m <pool> shows allocation to transactions e.g. metaslab 100 offset c8000000 spacemap 1453 free 0 segments 0 maxsize 0 freepct 0% In-memory histogram: On-disk histogram: fragmentation 0 [ 0] ALLOC: txg 417, pass 2 [ 1] A range: 00c8000000-00c8001600 size: 001600 [ 2] ALLOC: txg 417, pass 3 [ 3] A range: 00c8001600-00c8003a00 size: 002400 [ 4] ALLOC: txg 418, pass 2 [ 5] A range: 00c8003a00-00c8005000 size: 001600 [ 6] ALLOC: txg 418, pass 3 [ 7] A range: 00c8005000-00c8006600 size: 001600 [ 8] ALLOC: txg 419, pass 2 [ 9] A range: 00c8006600-00c8007c00 size: 001600 [ 10] ALLOC: txg 419, pass 3 I tried destroying the pool and that hung, presumably due to IO being suspended after the out of space errors.
After bisecting the kernel changes the commit which seems to be causing this is: https://svnweb.freebsd.org/base?view=revision&revision=268650 https://github.com/freebsd/freebsd/commit/91643324a9009cb5fbc8c00544b7781941f0d5d1 which correlates to: https://github.com/illumos/illumos-gate/commit/7fd05ac4dec0c343d2f68f310d3718b715ecfbaf I've checked the two make the same changes so there doesn't seem to have been a downstream merge issue, at least not on this specific commit. My test now consists of: 1. mdconfig -t malloc -s 4G -S 512 2. zpool create tpool md0 3. zfs recv -duF tpool < test.zfs 4. zpool list -p -o free zfs 5 With this commit present, free reduces every 5 seconds until the pool is out of space. Without it after at most 3 reductions the pool settles and no further free space reduction is seen. I've also found that creating the pool without async_destroy enabled also prevents the issue. An image that shows the final result of the leak can be found here: http://www.ijs.si/usr/mark/bsd/ On FreeBSD this image stalls on import unless imported readonly. Once imported I used the following to create the test image used above: zfs send -R zfs/ROOT@auto-2014-09-19_22.30 >test.zfs Copying in the zfs illumos list to get more eyeballs given it seems to be a quite serious issue. Regards Steve _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
