On Wed, Feb 04, 2026 at 04:47:03PM +0100, Vincent Lefevre wrote: > In the upstream bug, Eric Blake said: "Several distros have add-on > patches that add wide char support, but to date, no one has yet > submitted a patch upstream that is both easy to maintain (doesn't > needlessly duplicate big blocks of code over char vs. wchar_t) and > which doesn't penalize speed on single-byte locales." FTR, in voreutils cut (0BSD: <http://ro.ws.co.ls/cut.1>, <https://git.sr.ht/~nabijaczleweli/voreutils/tree/trunk/item/cmd/cut.cpp>), this is implemented with the -d argument being a byte span ("field_sep"), so delimiter search reduces to memmem()/memchr() ("l.find(*field_sep)"), which means -d: -dя -d$'\377' -dupa are all equivalent; this seemed like an obvious generalisation to me, so cut(1), STANDARDS, just notes that > Allowing -d longer than one character is an extension, compatible > with the illumos gate ‒ some nonconformant implementations only allow > a single byte (the GNU system) or only use the first byte of the > delim (NetBSD, OpenBSD). Using NUL for an empty delim is likewise > an extension, compatible with the illumos gate, the GNU system, > NetBSD, and OpenBSD.
$ echo QWEaQWEabQWE | cut -d'ab' -f2 QWE $ echo QWEaQWEabQWE | /bin/cut -d'ab' -f2 /bin/cut: the delimiter must be a single character Try '/bin/cut --help' for more information. I believe you get the same result as the first line on the illumos gate (I tested this on tribblix, if memory serves). Parsing the input as characters only happens in -nb and -c modes, and only for mbrlen(), which is the minimum required. So duplication is not necessary. Of course, one can construe of an encoding where you could encode я into bytes two different ways, and you'd want cut -dя to match both. Whether that is real, whether you consider that to be real, and whether that would be a useful behaviour vs byte span matching will inform whether that implementation model is viable for coreutils. Best,
signature.asc
Description: PGP signature

