Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75

Cy Schubert Wed, 12 Apr 2023 23:44:37 -0700

In message <a291c24c-9d7c-4e79-ad03-68ed910fc...@yahoo.com>, Mark Millard 
write
s:
> [This just puts my prior reply's material into Cy's
> adjusted resend of the original. The To/Cc should
> be coomplete this time.]
>
> On Apr 12, 2023, at 22:52, Cy Schubert <cy.schub...@cschubert.com> =
> wrote:
>
> > In message <c8e4a43b-9fc8-456e-adb3-13e7f40b2...@yahoo.com>, Mark =
> Millard=20
> > write
> > s:
> >> From: Charlie Li <vishwin_at_freebsd.org> wrote on
> >> Date: Wed, 12 Apr 2023 20:11:16 UTC :
> >>=20
> >>> Charlie Li wrote:
> >>>> Mateusz Guzik wrote:
> >>>>> can you please test poudriere with
> >>>>> https://github.com/openzfs/zfs/pull/14739/files
> >>>>>=20
> >>>> After applying, on the md(4)-backed pool regardless of =3D
> >> block_cloning,=3D20
> >>>> the cy@ `cp -R` test reports no differing (ie corrupted) files. =
> Will=3D20=3D
> >>=20
> >>>> report back on poudriere results (no block_cloning).
> >>>> =3D20
> >>> As for poudriere, build failures are still rolling in. These are =
> (and=3D20=3D
> >>=20
> >>> have been) entirely random on every run. Some examples from this =
> run:
> >>> =3D20
> >>> lang/php81:
> >>> - post-install: @${INSTALL_DATA} ${WRKSRC}/php.ini-development=3D20
> >>> ${WRKSRC}/php.ini-production ${WRKDIR}/php.conf =3D
> >> ${STAGEDIR}/${PREFIX}/etc
> >>> - consumers fail to build due to corrupted php.conf packaged
> >>> =3D20
> >>> devel/ninja:
> >>> - phase: stage
> >>> - install -s -m 555=3D20
> >>> /wrkdirs/usr/ports/devel/ninja/work/ninja-1.11.1/ninja=3D20
> >>> /wrkdirs/usr/ports/devel/ninja/work/stage/usr/local/bin
> >>> - consumers fail to build due to corrupted bin/ninja packaged
> >>> =3D20
> >>> devel/netsurf-buildsystem:
> >>> - phase: stage
> >>> - mkdir -p=3D20
> >>> =3D
> >> =
> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/ne=
> =3D
> >> tsurf-buildsystem/makefiles=3D20
> >>> =3D
> >> =
> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/ne=
> =3D
> >> tsurf-buildsystem/testtools
> >>> for M in Makefile.top Makefile.tools Makefile.subdir =3D
> >> Makefile.pkgconfig=3D20
> >>> Makefile.clang Makefile.gcc Makefile.norcroft Makefile.open64; do \
> >>> cp makefiles/$M=3D20
> >>> =3D
> >> =
> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/ne=
> =3D
> >> tsurf-buildsystem/makefiles/;=3D20
> >>> \
> >>> done
> >>> - graphics/libnsgif fails to build due to NUL characters in=3D20
> >>> Makefile.{clang,subdir}, causing nothing to link
> >>=20
> >> Summary: I have problems building ports into packages
> >> via poudriere-devel use despite being fully updated/patched
> >> (as of when I started the experiment), never having enabled
> >> block_cloning ( still using openzfs-2.1-freebsd ).
> >>=20
> >> In other words, I can confirm other reports that have
> >> been made.
> >>=20
> >> The details follow.
> >>=20
> >>=20
> >> [Written as I was working on setting up for the experiments
> >> and then executing those experiments, adjusting as I went
> >> along.]
> >>=20
> >> I've run my own tests in a context that has never had the
> >> zpool upgrade and that jump from before the openzfs import to
> >> after the existing commits for trying to fix openzfs on
> >> FreeBSD. I report on the sequence of activities getting to
> >> the point of testing as well.
> >>=20
> >> By personal policy I keep my (non-temporary) pool's compatible
> >> with what the most recent ??.?-RELEASE supports, using
> >> openzfs-2.1-freebsd for now. The pools involved below have
> >> never had a zpool upgrade from where they started. (I've no
> >> pools that have ever had a zpool upgrade.)
> >>=20
> >> (Temporary pools are rare for me, such as this investigation.
> >> But I'm not testing block_cloning or anything new this time.)
> >>=20
> >> I'll note that I use zfs for bectl, not for redundancy. So
> >> my evidence is more limited in that respect.
> >>=20
> >> The activities were done on a HoneyComb (16 Cortex-A72 cores).
> >> The system has and supports ECC RAM, 64 GiBytes of RAM are
> >> present.
> >>=20
> >> I started by duplicating my normal zfs environment to an
> >> external USB3 NVMe drive and adjusting the host name and such
> >> to produce the below. (Non-debug, although I do not strip
> >> symbols.) :
> >>=20
> >> # uname -apKU
> >> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90 =3D
> >> main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023     =3D
> >> =
> root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6=
> =3D
> >> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400082 1400082
> >>=20
> >> I then did: git fetch, stash push ., merge --ff-only, stash apply . :
> >> my normal procedure. I then also applied the patch from:
> >>=20
> >> https://github.com/openzfs/zfs/pull/14739/files
> >>=20
> >> Then I did: buildworld buildkernel, install them, and rebooted.
> >>=20
> >> The result was:
> >>=20
> >> # uname -apKU
> >> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #91 =3D
> >> main-n262122-2ef2c26f3f13-dirty: Wed Apr 12 19:23:35 PDT 2023     =3D
> >> =
> root@CA72_4c8G_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6=
> =3D
> >> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400086 1400086
> >>=20
> >> The later poudriere-devel based build of packages from ports is
> >> based on:
> >>=20
> >> # ~/fbsd-based-on-what-commit.sh -C /usr/ports
> >> 4e94ac9eb97f (HEAD -> main, freebsd/main, freebsd/HEAD) =3D
> >> devel/freebsd-gcc12: Bump to 12.2.0.
> >> Author:     John Baldwin <j...@freebsd.org>
> >> Commit:     John Baldwin <j...@freebsd.org>
> >> CommitDate: 2023-03-25 00:06:40 +0000
> >> branch: main
> >> merge-base: 4e94ac9eb97fab16510b74ebcaa9316613182a72
> >> merge-base: CommitDate: 2023-03-25 00:06:40 +0000
> >> n613214 (--first-parent --count for merge-base)
> >>=20
> >> poudriere attempted to build 476 packages, starting
> >> with pkg (in order to build the 56 that I explicitly
> >> indicate that I want). It is my normal set of ports.
> >> The form of building is biased to allowing a high
> >> load average compared to the number of hardware
> >> threads (same as cores here): each builder is allowed
> >> to use the full count of hardware threads. The build
> >> used USE_TMPFS=3D3D"data" instead of the USE_TMPFS=3D3Dall I
> >> normally use on the build machine involved.
> >>=20
> >> And it produced some random errors during the attempted
> >> builds. A type of example that is easy to interpret
> >> without further exploration is:
> >>=20
> >> pkg_resources.extern.packaging.requirements.InvalidRequirement: Parse =
> =3D
> >> error at "'\x00\x00\x00\x00\x00\x00\x00\x00'": Expected W:(0-9A-Za-z)
> >>=20
> >> A fair number of errors are of the form: the build
> >> installing a previously built package for use in the
> >> builder but later the builder can not find some file
> >> from the package's installation.
> >>=20
> >> Another error reported was:
> >>=20
> >> ld: error: /usr/local/lib/libblkid.a: unknown file type
> >>=20
> >> For reference:
> >>=20
> >> [main-CA72-bulk_a-default] [2023-04-12_20h45m32s] [committing:] =
> Queued: =3D
> >> 476 Built: 2> >> Tobuild: 0    Time: 00:37:52
> >>=20
> >> I started another build that tried to build 224 packeges:
> >> the 11 failed and 213 skipped.
> >>=20
> >> Just 1 package built that failed before:
> >>=20
> >> [00:04:58] [09] [00:04:15] Finished databases/sqlite3@default | =3D
> >> sqlite3-3.41.0_1,1: Success
> >>=20
> >> It seems to be the only one where the original failure was not
> >> an example of complaining about the missing/corrupted content
> >> of a package install used for building. So it is an example
> >> of randomly varying behavior.
> >>=20
> >> That, in turn, allowed:
> >>=20
> >> [00:04:58] [01] [00:00:00] Building security/nss | nss-3.89
> >>=20
> >> to build but everything else failed or was skipped.
> >>=20
> >> The sqlite3 vs. other failure difference suggests that writes
> >> have random problems but later reads reliably see the problem
> >> that resulted (before the content is deleted).
> >>=20
> >>=20
> >> After the above:
> >>=20
> >> # zpool status
> >>  pool: zroot
> >> state: ONLINE
> >> config:
> >>=20
> >>        NAME        STATE     READ WRITE CKSUM
> >>        zroot       ONLINE       0     0     0
> >>          da0p8     ONLINE       0     0     0
> >>=20
> >> errors: No known data errors
> >>=20
> >> =08=E0=B9=84=C2=8DM # zpool scrub zroot
> >> # zpool status
> >>  pool: zroot
> >> state: ONLINE
> >>  scan: scrub repaired 0B in 00:16:25 with 0 errors on Wed Apr 12 =3D
> >> 22:15:39 2023
> >> config:
> >>=20
> >>        NAME        STATE     READ WRITE CKSUM
> >>        zroot       ONLINE       0     0     0
> >>          da0p8     ONLINE       0     0     0
> >>=20
> >> errors: No known data errors
> >>=20
> >>=20
> >> =3D3D=3D3D=3D3D
> >> Mark Millard
> >> marklmi at yahoo.com
> >=20
> >=20
> > Let's try this again. Claws-mail didn't include the list address in =
> the=20
> > header. Trying to reply, again, using exmh instead.
> >=20
> >=20
> > Did your pools suffer the EXDEV problem? The EXDEV also corrupted =
> files.
>
> As I reported, this was a jump from before the import
> to as things are tonight (here). So: NO, unless the
> existing code as of tonight still has the EXDEV problem!
>
> Prior to this experiment I'd not progressed any media
> beyond: main-n261544-cee09bda03c8-dirty Wed Mar 15 20:25:49.
>
> > I think, without sufficient investigation we risk jumping to
> > conclusions. I've taken an extremely cautious approach, rolling back
> > snapshots (as much as possible, i.e. poudriere datasets) when EXDEV
> > corruption was encountered.
>
> Again: nothing between main-n261544-cee09bda03c8-dirty and
> main-n262122-2ef2c26f3f13-dirty was involved at any stage.
>
> >=20
> > I did not rollback any snapshots in my MH mail directory. Rolling back
> > snapshots of my MH maildir would result in loss of email. I have to
> > live with that corruption. Corrupted files in my outgoing sent email
> > directory remain:
> >=20
> > slippy$ ugrep -cPa '\x00' ~/.Mail/note | grep -c :1=20
> > 53
> > slippy$=20
> >=20
> > There are 53 corrupted files in my note log of 9913 emails. Those =
> files
> > will never be fixed. They were corrupted by the EXDEV bug. Any new ZFS
> > or ZFS patches cannot retroactively remove the corruption from those
> > files.
> >=20
> > But my poudriere files, because the snapshots were rolled back, were
> > "repaired" by the rolled back snapshots.
> >=20
> > I'm not convinced that there is presently active corruption since
> > the problem has been fixed. I am convinced that whatever corruption
> > that was written at the time will remain forever or until those files
> > are deleted or replaced -- just like my email files written to disk at
> > the time.
>
> My test results and procedure just do not fit your conclusion
> that things are okay now if block_clonging is completely avoided.


Admitting I'm wrong: sending copies of my last reply to you back to myself, 
again and again, three times, I've managed to reproduce the corruption you 
are talking about.

>From my previous email to you.

header. Trying to reply                       ^^^^^^^^^
Here it is, nine additional bytes of garbage.

In another instance about 500 bytes were removed. I can reproduce the 
corruption at will now.

The EXDEV patch is applied. Block_cloning is disabled.


>
>
> =3D=3D=3D
> Mark Millard
> marklmi at yahoo.com


-- 
Cheers,
Cy Schubert <cy.schub...@cschubert.com>
FreeBSD UNIX:  <c...@freebsd.org>   Web:  https://FreeBSD.org
NTP:           <c...@nwtime.org>    Web:  https://nwtime.org

                        e^(i*pi)+1=0

Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75

Reply via email to