On Mon, 13 Jul 2020 13:02:44 +0200, Jan Stary wrote:
> This is current/amd64.
>
> On UTF input, awk segfaults when using a multi-character RS:
>
> $ cat /tmp/in
> č
>
> $ hexdump -C /tmp/in
> 00000000 c4 8d 0a |...|
> 00000003
>
> $ cat /tmp/in | awk '{print$1}'
> č
>
> $ cat /tmp/in | awk -v RS=x '{print$1}'
> č
>
> $ cat /tmp/in | awk -v RS=xy '{print$1}'
> Segmentation fault (core dumped)
Nice catch. The actual bug is caused by using a signed char as an
index into an array, resulting in a negative index. Once debugged,
the fix is simple.
- todd
diff --git a/b.c b/b.c
index c167b50..f7fbc0e 100644
--- a/b.c
+++ b/b.c
@@ -684,7 +684,7 @@ bool fnematch(fa *pfa, FILE *f, char **pbuf, int *pbufsize,
int quantum)
FATAL("stream '%.30s...' too
long", buf);
buf[k++] = (c = getc(f)) != EOF ? c : 0;
}
- c = buf[j];
+ c = (unsigned char)buf[j];
/* assert(c < NCHARS); */
if ((ns = pfa->gototab[s][c]) != 0)