On 1/29/20 7:34 AM, Bruno Haible wrote:
I would say that it's not worth the effort - except for the person(s)
who care a lot about Vax/VMS.
Normally I'd agree, but if Arnold cares about VAX/VMS and if we want
Gnulib dfa.c to match Gawk dfa.c, then in this particular case it makes
some sense to support 32-bit-only platforms, as it's easy to revert the
recent patch that made dfa.c assume 64-bit. So I installed the attached.
However, I see some other parts of departure for Gawk dfa.c:
* Gawk dfa.c/dfa.h does not use flexible array members or the
portable-to-7th-edition-Unix substitute provided by Gnulib, so I suggest
that Gawk import Gnulib lib/flexmember.h, and either "#define
FLEXIBLE_ARRAY_MEMBER 1" in config.h or (better) import Gnulib
m4/flexmember.m4.
* Gawk dfa.c doesn't use isblank, but instead defines its own is_blank
that is hard-coded to the C locale. Isn't [[:blank:]] supposed to be
locale-dependent? Or are you assuming that space and tab are the only
blank characters in all single-byte locales?
* Gawk dfa.c includes mbsupport.h if __DJGPP__ is defined. I suggest
moving this to Gawk config.h so that dfa.c need not worry about it.
* Gawk dfa.c replaces "#include <stdint.h>" with:
#ifndef VMS
#include <stdint.h>
#else
#define SIZE_MAX __INT32_MAX
#define PTRDIFF_MAX __INT32_MAX
#endif
I suppose we could add something like this to Gnulib dfa.c but it's a
bit ugly; is there a cleaner way to do it? Perhaps Gawk could supply its
own little substitute stdint.h on VMS. (Gnulib does this too but I
assume Gnulib's stdint.h is too heavyweight for Gawk.)
>From 335bfddb5ea0e6138a026ae723ea1e0ee2a2cd90 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Wed, 29 Jan 2020 10:58:26 -0800
Subject: [PATCH] dfa: do not assume 64-bit int
Problem reported for VAX/VMS C (!) by Arnold Robbins in:
https://lists.gnu.org/r/bug-gnulib/2020-01/msg00173.html
* lib/dfa.c (CHARCLASS_PAIR): Bring back this macro.
(CHARCLASS_WORD_BITS, charclass_word) [!UINT_LEAST64_MAX]:
Fall back to 32-bit words.
(CHARCLASS_INIT): Go back to having 8 32-bit args instead
of 4 64-bit args. All uses changed.
---
ChangeLog | 11 +++++++++++
lib/dfa.c | 40 +++++++++++++++++++++++++++++-----------
2 files changed, 40 insertions(+), 11 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index a861f4996..2e64116c1 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,14 @@
+2020-01-29 Paul Eggert <egg...@cs.ucla.edu>
+
+ dfa: do not assume 64-bit int
+ Problem reported for VAX/VMS C (!) by Arnold Robbins in:
+ https://lists.gnu.org/r/bug-gnulib/2020-01/msg00173.html
+ * lib/dfa.c (CHARCLASS_PAIR): Bring back this macro.
+ (CHARCLASS_WORD_BITS, charclass_word) [!UINT_LEAST64_MAX]:
+ Fall back to 32-bit words.
+ (CHARCLASS_INIT): Go back to having 8 32-bit args instead
+ of 4 64-bit args. All uses changed.
+
2020-01-27 Paul Eggert <egg...@cs.ucla.edu>
regex: remove limits-h dependency
diff --git a/lib/dfa.c b/lib/dfa.c
index 96ae560b1..4e9478394 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -84,6 +84,8 @@ isasciidigit (char c)
/* First integer value that is greater than any character code. */
enum { NOTCHAR = 1 << CHAR_BIT };
+#ifdef UINT_LEAST64_MAX
+
/* Number of bits used in a charclass word. */
enum { CHARCLASS_WORD_BITS = 64 };
@@ -91,8 +93,24 @@ enum { CHARCLASS_WORD_BITS = 64 };
at least CHARCLASS_WORD_BITS wide. Any excess bits are zero. */
typedef uint_least64_t charclass_word;
-/* An initializer for a charclass whose 64-bit words are A through D. */
-#define CHARCLASS_INIT(a, b, c, d) {{a, b, c, d}}
+/* Part of a charclass initializer that represents 64 bits' worth of a
+ charclass, where LO and HI are the low and high-order 32 bits of
+ the 64-bit quantity. */
+# define CHARCLASS_PAIR(lo, hi) (((charclass_word) (hi) << 32) + (lo))
+
+#else
+/* Fallbacks for pre-C99 hosts that lack 64-bit integers. */
+enum { CHARCLASS_WORD_BITS = 32 };
+typedef unsigned long charclass_word;
+# define CHARCLASS_PAIR(lo, hi) lo, hi
+#endif
+
+/* An initializer for a charclass whose 32-bit words are A through H. */
+#define CHARCLASS_INIT(a, b, c, d, e, f, g, h) \
+ {{ \
+ CHARCLASS_PAIR (a, b), CHARCLASS_PAIR (c, d), \
+ CHARCLASS_PAIR (e, f), CHARCLASS_PAIR (g, h) \
+ }}
/* The maximum useful value of a charclass_word; all used bits are 1. */
static charclass_word const CHARCLASS_WORD_MASK
@@ -1699,39 +1717,39 @@ add_utf8_anychar (struct dfa *dfa)
static charclass const utf8_classes[] = {
/* A. 00-7f: 1-byte sequence. */
- CHARCLASS_INIT (0xffffffffffffffff, 0xffffffffffffffff, 0, 0),
+ CHARCLASS_INIT (0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0, 0, 0, 0),
/* B. c2-df: 1st byte of a 2-byte sequence. */
- CHARCLASS_INIT (0, 0, 0, 0x00000000fffffffc),
+ CHARCLASS_INIT (0, 0, 0, 0, 0, 0, 0xfffffffc, 0),
/* C. 80-bf: non-leading bytes. */
- CHARCLASS_INIT (0, 0, 0xffffffffffffffff, 0),
+ CHARCLASS_INIT (0, 0, 0, 0, 0xffffffff, 0xffffffff, 0, 0),
/* D. e0 (just a token). */
/* E. a0-bf: 2nd byte of a "DEC" sequence. */
- CHARCLASS_INIT (0, 0, 0xffffffff00000000, 0),
+ CHARCLASS_INIT (0, 0, 0, 0, 0, 0xffffffff, 0, 0),
/* F. e1-ec + ee-ef: 1st byte of an "FCC" sequence. */
- CHARCLASS_INIT (0, 0, 0, 0x0000dffe00000000),
+ CHARCLASS_INIT (0, 0, 0, 0, 0, 0, 0, 0xdffe),
/* G. ed (just a token). */
/* H. 80-9f: 2nd byte of a "GHC" sequence. */
- CHARCLASS_INIT (0, 0, 0x000000000000ffff, 0),
+ CHARCLASS_INIT (0, 0, 0, 0, 0xffff, 0, 0, 0),
/* I. f0 (just a token). */
/* J. 90-bf: 2nd byte of an "IJCC" sequence. */
- CHARCLASS_INIT (0, 0, 0xffffffffffff0000, 0),
+ CHARCLASS_INIT (0, 0, 0, 0, 0xffff0000, 0xffffffff, 0, 0),
/* K. f1-f3: 1st byte of a "KCCC" sequence. */
- CHARCLASS_INIT (0, 0, 0, 0x000e000000000000),
+ CHARCLASS_INIT (0, 0, 0, 0, 0, 0, 0, 0xe0000),
/* L. f4 (just a token). */
/* M. 80-8f: 2nd byte of a "LMCC" sequence. */
- CHARCLASS_INIT (0, 0, 0x00000000000000ff, 0),
+ CHARCLASS_INIT (0, 0, 0, 0, 0xff, 0, 0, 0),
};
/* Define the character classes that are needed below. */
--
2.24.1