On 2023-07-17 09:53, Bruno Haible wrote:

On NetBSD, I apparently did not locate the right source code of the mbsinit
function, due to the complexity of the citrus code. And did not want to debug
it, because debugging in libc code without debugging information is often
a waste of time.

I looked into it. I think it's due to the funky business with struct _RuneStatePriv. There's a pointer's worth of data before the data that the converters see, and until recently there was another character that ended up perhaps needing another pointer's worth of padding. Two pointers (8 bytes each), plus the 12 bytes we already saw, would explain the 28 bytes you observed.

I fixed mbcel.h for that by installing the attached patch into diffutils. Thanks for pointing it out.


This macro is not used anywhere. How about adding a comment explaining
why it's defined but not used? Or if it's not needed we can remove it.

It's needed, namely as lower bound for _GL_MBSTATE_ZERO_SIZE:
0 < _GL_MBSTATE_INIT_SIZE <= _GL_MBSTATE_ZERO_SIZE <= sizeof (mbstate_t).

That's merely documentation, right? That is, only comments use the lower bound.

And I won't write
   sizeof (((mbstate_t) {0}).__count)
outside of comments, since the less identifiers with double-underscore
we use, the better.

On the other hand, the sizeof is more likely to notify builders if glibc changes internals in an incompatible way. (No big deal either way of course.)

One thing I noticed on NetBSD 9.3 x86-64 (it has GCC 7), is that it didn't optimize memset calls away. So in memcel.h I stuck with initializing by hand rather than using memset.
From c640bec4f3df010b47dae930ea13d997f0026f9b Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Tue, 18 Jul 2023 18:35:12 -0700
Subject: [PATCH] diff: fix mbcel bug on NetBSD

* lib/mbcel.h (mbcel_t):
Fix bug on NetBSD as I read its code incorrectly earlier.
Problem reported by Bruno Haible in:
https://lists.gnu.org/r/bug-gnulib/2023-07/msg00085.html
Mostly for documentation, use _GL_ATTRIBUTE_MAY_ALIAS to remind
compiler not to rely on strict C semantics for unions.
---
 lib/mbcel.h | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/lib/mbcel.h b/lib/mbcel.h
index 031cb49..69f4d5a 100644
--- a/lib/mbcel.h
+++ b/lib/mbcel.h
@@ -46,7 +46,8 @@
 #ifndef _MBCEL_H
 #define _MBCEL_H 1
 
-/* This file uses _GL_INLINE_HEADER_BEGIN, _GL_INLINE.  */
+/* This file uses _GL_INLINE_HEADER_BEGIN, _GL_INLINE,
+   _GL_ATTRIBUTE_MAY_ALIAS.  */
 #if !_GL_CONFIG_H_INCLUDED
  #error "Please include config.h first."
 #endif
@@ -100,18 +101,32 @@ mbcel_scan (char const *p, char const *lim)
   if (0 <= *p && *p <= 0x7f)
     return (mbcel_t) { .ch = *p, .len = 1 };
 
-  /* An initial mbstate_t; initialization optimized for some platforms.  */
+  /* An initial mbstate_t; initialization optimized for some platforms.
+     For details about these and other platforms, see wchar.in.h.  */
 #if defined __GLIBC__ && 2 < __GLIBC__ + (2 <= __GLIBC_MINOR__)
+  /* Although only a trivial optimization, it's worth it for GNU.  */
   mbstate_t mbs; mbs.__count = 0;
 #elif (defined __FreeBSD__ || defined __DragonFly__ || defined __OpenBSD__ \
        || (defined __APPLE__ && defined __MACH__))
-  /* Initialize for all encodings: UTF-8, EUC, etc.  */
-  union { mbstate_t m; struct { uchar_t ch; int utf8_want, euc_want; } s; } u;
+  /* These platforms have 128-byte mbstate_t.  What were they thinking?
+     Initialize just for supported encodings (UTF-8, EUC, etc.).
+     Avoid memset because some compilers generate function call code.  */
+  struct mbhidden { char32_t ch; int utf8_want, euc_want; }
+    _GL_ATTRIBUTE_MAY_ALIAS;
+  union { mbstate_t m; struct mbhidden s; } u;
   u.s.ch = u.s.utf8_want = u.s.euc_want = 0;
 # define mbs u.m
 #elif defined __NetBSD__
-  union { mbstate_t m; struct _RuneLocale *s; } u;
-  u.s = nullptr;
+  /* Like FreeBSD, but mbstate_t starts with a pointer and
+     (before joerg commit dated 2020-06-02 01:30:31 +0000)
+     up to another pointer's worth of padding.  */
+  struct mbhidden {
+    struct _RuneLocale *p[2];
+    char32_t ch; int utf8_want, euc_want;
+  } _GL_ATTRIBUTE_MAY_ALIAS;
+  union { mbstate_t m; struct mbhidden s; } u;
+  u.s.p[0] = u.s.p[1] = nullptr;
+  u.s.ch = u.s.utf8_want = u.s.euc_want = 0;
 # define mbs u.m
 #else
   /* mbstate_t has unknown structure or is not worth optimizing.  */
-- 
2.39.2

Reply via email to