In message <a291c24c-9d7c-4e79-ad03-68ed910fc...@yahoo.com>, Mark Millard write s: > [This just puts my prior reply's material into Cy's > adjusted resend of the original. The To/Cc should > be coomplete this time.] > > On Apr 12, 2023, at 22:52, Cy Schubert <cy.schub...@cschubert.com> = > wrote: > > > In message <c8e4a43b-9fc8-456e-adb3-13e7f40b2...@yahoo.com>, Mark = > Millard=20 > > write > > s: > >> From: Charlie Li <vishwin_at_freebsd.org> wrote on > >> Date: Wed, 12 Apr 2023 20:11:16 UTC : > >>=20 > >>> Charlie Li wrote: > >>>> Mateusz Guzik wrote: > >>>>> can you please test poudriere with > >>>>> https://github.com/openzfs/zfs/pull/14739/files > >>>>>=20 > >>>> After applying, on the md(4)-backed pool regardless of =3D > >> block_cloning,=3D20 > >>>> the cy@ `cp -R` test reports no differing (ie corrupted) files. = > Will=3D20=3D > >>=20 > >>>> report back on poudriere results (no block_cloning). > >>>> =3D20 > >>> As for poudriere, build failures are still rolling in. These are = > (and=3D20=3D > >>=20 > >>> have been) entirely random on every run. Some examples from this = > run: > >>> =3D20 > >>> lang/php81: > >>> - post-install: @${INSTALL_DATA} ${WRKSRC}/php.ini-development=3D20 > >>> ${WRKSRC}/php.ini-production ${WRKDIR}/php.conf =3D > >> ${STAGEDIR}/${PREFIX}/etc > >>> - consumers fail to build due to corrupted php.conf packaged > >>> =3D20 > >>> devel/ninja: > >>> - phase: stage > >>> - install -s -m 555=3D20 > >>> /wrkdirs/usr/ports/devel/ninja/work/ninja-1.11.1/ninja=3D20 > >>> /wrkdirs/usr/ports/devel/ninja/work/stage/usr/local/bin > >>> - consumers fail to build due to corrupted bin/ninja packaged > >>> =3D20 > >>> devel/netsurf-buildsystem: > >>> - phase: stage > >>> - mkdir -p=3D20 > >>> =3D > >> = > /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/ne= > =3D > >> tsurf-buildsystem/makefiles=3D20 > >>> =3D > >> = > /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/ne= > =3D > >> tsurf-buildsystem/testtools > >>> for M in Makefile.top Makefile.tools Makefile.subdir =3D > >> Makefile.pkgconfig=3D20 > >>> Makefile.clang Makefile.gcc Makefile.norcroft Makefile.open64; do \ > >>> cp makefiles/$M=3D20 > >>> =3D > >> = > /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/ne= > =3D > >> tsurf-buildsystem/makefiles/;=3D20 > >>> \ > >>> done > >>> - graphics/libnsgif fails to build due to NUL characters in=3D20 > >>> Makefile.{clang,subdir}, causing nothing to link > >>=20 > >> Summary: I have problems building ports into packages > >> via poudriere-devel use despite being fully updated/patched > >> (as of when I started the experiment), never having enabled > >> block_cloning ( still using openzfs-2.1-freebsd ). > >>=20 > >> In other words, I can confirm other reports that have > >> been made. > >>=20 > >> The details follow. > >>=20 > >>=20 > >> [Written as I was working on setting up for the experiments > >> and then executing those experiments, adjusting as I went > >> along.] > >>=20 > >> I've run my own tests in a context that has never had the > >> zpool upgrade and that jump from before the openzfs import to > >> after the existing commits for trying to fix openzfs on > >> FreeBSD. I report on the sequence of activities getting to > >> the point of testing as well. > >>=20 > >> By personal policy I keep my (non-temporary) pool's compatible > >> with what the most recent ??.?-RELEASE supports, using > >> openzfs-2.1-freebsd for now. The pools involved below have > >> never had a zpool upgrade from where they started. (I've no > >> pools that have ever had a zpool upgrade.) > >>=20 > >> (Temporary pools are rare for me, such as this investigation. > >> But I'm not testing block_cloning or anything new this time.) > >>=20 > >> I'll note that I use zfs for bectl, not for redundancy. So > >> my evidence is more limited in that respect. > >>=20 > >> The activities were done on a HoneyComb (16 Cortex-A72 cores). > >> The system has and supports ECC RAM, 64 GiBytes of RAM are > >> present. > >>=20 > >> I started by duplicating my normal zfs environment to an > >> external USB3 NVMe drive and adjusting the host name and such > >> to produce the below. (Non-debug, although I do not strip > >> symbols.) : > >>=20 > >> # uname -apKU > >> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90 =3D > >> main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023 =3D > >> = > root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6= > =3D > >> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400082 1400082 > >>=20 > >> I then did: git fetch, stash push ., merge --ff-only, stash apply . : > >> my normal procedure. I then also applied the patch from: > >>=20 > >> https://github.com/openzfs/zfs/pull/14739/files > >>=20 > >> Then I did: buildworld buildkernel, install them, and rebooted. > >>=20 > >> The result was: > >>=20 > >> # uname -apKU > >> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #91 =3D > >> main-n262122-2ef2c26f3f13-dirty: Wed Apr 12 19:23:35 PDT 2023 =3D > >> = > root@CA72_4c8G_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6= > =3D > >> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400086 1400086 > >>=20 > >> The later poudriere-devel based build of packages from ports is > >> based on: > >>=20 > >> # ~/fbsd-based-on-what-commit.sh -C /usr/ports > >> 4e94ac9eb97f (HEAD -> main, freebsd/main, freebsd/HEAD) =3D > >> devel/freebsd-gcc12: Bump to 12.2.0. > >> Author: John Baldwin <j...@freebsd.org> > >> Commit: John Baldwin <j...@freebsd.org> > >> CommitDate: 2023-03-25 00:06:40 +0000 > >> branch: main > >> merge-base: 4e94ac9eb97fab16510b74ebcaa9316613182a72 > >> merge-base: CommitDate: 2023-03-25 00:06:40 +0000 > >> n613214 (--first-parent --count for merge-base) > >>=20 > >> poudriere attempted to build 476 packages, starting > >> with pkg (in order to build the 56 that I explicitly > >> indicate that I want). It is my normal set of ports. > >> The form of building is biased to allowing a high > >> load average compared to the number of hardware > >> threads (same as cores here): each builder is allowed > >> to use the full count of hardware threads. The build > >> used USE_TMPFS=3D3D"data" instead of the USE_TMPFS=3D3Dall I > >> normally use on the build machine involved. > >>=20 > >> And it produced some random errors during the attempted > >> builds. A type of example that is easy to interpret > >> without further exploration is: > >>=20 > >> pkg_resources.extern.packaging.requirements.InvalidRequirement: Parse = > =3D > >> error at "'\x00\x00\x00\x00\x00\x00\x00\x00'": Expected W:(0-9A-Za-z) > >>=20 > >> A fair number of errors are of the form: the build > >> installing a previously built package for use in the > >> builder but later the builder can not find some file > >> from the package's installation. > >>=20 > >> Another error reported was: > >>=20 > >> ld: error: /usr/local/lib/libblkid.a: unknown file type > >>=20 > >> For reference: > >>=20 > >> [main-CA72-bulk_a-default] [2023-04-12_20h45m32s] [committing:] = > Queued: =3D > >> 476 Built: 2> >> Tobuild: 0 Time: 00:37:52 > >>=20 > >> I started another build that tried to build 224 packeges: > >> the 11 failed and 213 skipped. > >>=20 > >> Just 1 package built that failed before: > >>=20 > >> [00:04:58] [09] [00:04:15] Finished databases/sqlite3@default | =3D > >> sqlite3-3.41.0_1,1: Success > >>=20 > >> It seems to be the only one where the original failure was not > >> an example of complaining about the missing/corrupted content > >> of a package install used for building. So it is an example > >> of randomly varying behavior. > >>=20 > >> That, in turn, allowed: > >>=20 > >> [00:04:58] [01] [00:00:00] Building security/nss | nss-3.89 > >>=20 > >> to build but everything else failed or was skipped. > >>=20 > >> The sqlite3 vs. other failure difference suggests that writes > >> have random problems but later reads reliably see the problem > >> that resulted (before the content is deleted). > >>=20 > >>=20 > >> After the above: > >>=20 > >> # zpool status > >> pool: zroot > >> state: ONLINE > >> config: > >>=20 > >> NAME STATE READ WRITE CKSUM > >> zroot ONLINE 0 0 0 > >> da0p8 ONLINE 0 0 0 > >>=20 > >> errors: No known data errors > >>=20 > >> =08=E0=B9=84=C2=8DM # zpool scrub zroot > >> # zpool status > >> pool: zroot > >> state: ONLINE > >> scan: scrub repaired 0B in 00:16:25 with 0 errors on Wed Apr 12 =3D > >> 22:15:39 2023 > >> config: > >>=20 > >> NAME STATE READ WRITE CKSUM > >> zroot ONLINE 0 0 0 > >> da0p8 ONLINE 0 0 0 > >>=20 > >> errors: No known data errors > >>=20 > >>=20 > >> =3D3D=3D3D=3D3D > >> Mark Millard > >> marklmi at yahoo.com > >=20 > >=20 > > Let's try this again. Claws-mail didn't include the list address in = > the=20 > > header. Trying to reply, again, using exmh instead. > >=20 > >=20 > > Did your pools suffer the EXDEV problem? The EXDEV also corrupted = > files. > > As I reported, this was a jump from before the import > to as things are tonight (here). So: NO, unless the > existing code as of tonight still has the EXDEV problem! > > Prior to this experiment I'd not progressed any media > beyond: main-n261544-cee09bda03c8-dirty Wed Mar 15 20:25:49. > > > I think, without sufficient investigation we risk jumping to > > conclusions. I've taken an extremely cautious approach, rolling back > > snapshots (as much as possible, i.e. poudriere datasets) when EXDEV > > corruption was encountered. > > Again: nothing between main-n261544-cee09bda03c8-dirty and > main-n262122-2ef2c26f3f13-dirty was involved at any stage. > > >=20 > > I did not rollback any snapshots in my MH mail directory. Rolling back > > snapshots of my MH maildir would result in loss of email. I have to > > live with that corruption. Corrupted files in my outgoing sent email > > directory remain: > >=20 > > slippy$ ugrep -cPa '\x00' ~/.Mail/note | grep -c :1=20 > > 53 > > slippy$=20 > >=20 > > There are 53 corrupted files in my note log of 9913 emails. Those = > files > > will never be fixed. They were corrupted by the EXDEV bug. Any new ZFS > > or ZFS patches cannot retroactively remove the corruption from those > > files. > >=20 > > But my poudriere files, because the snapshots were rolled back, were > > "repaired" by the rolled back snapshots. > >=20 > > I'm not convinced that there is presently active corruption since > > the problem has been fixed. I am convinced that whatever corruption > > that was written at the time will remain forever or until those files > > are deleted or replaced -- just like my email files written to disk at > > the time. > > My test results and procedure just do not fit your conclusion > that things are okay now if block_clonging is completely avoided.
Admitting I'm wrong: sending copies of my last reply to you back to myself, again and again, three times, I've managed to reproduce the corruption you are talking about. >From my previous email to you. header. Trying to reply ^^^^^^^^^ Here it is, nine additional bytes of garbage. In another instance about 500 bytes were removed. I can reproduce the corruption at will now. The EXDEV patch is applied. Block_cloning is disabled. > > > =3D=3D=3D > Mark Millard > marklmi at yahoo.com -- Cheers, Cy Schubert <cy.schub...@cschubert.com> FreeBSD UNIX: <c...@freebsd.org> Web: https://FreeBSD.org NTP: <c...@nwtime.org> Web: https://nwtime.org e^(i*pi)+1=0