On 04/05/2025 at 15:27, Marc Haber wrote:
It looks like the \p{L} and other Unicode character classes dont match
anything if libperl is not installed.
According to my tests, they match at least ASCII letters, digits,
regular ASCII space and non-breakable space.
So we just extend the regexp to
match explictly what would be in ISO-8859-x, yielding the kind of
uncomfortable
commentre => qr/[-"_\.+!\$%&()\]\[;\/'’ A-Za-z0-9\x{a1}-\x{ac}\x{ae}-
\x{ff}\p{L}\p{Nd}\p{Zs}]*/,
So this allows the safe special characters below 0x40, a regular space,
the latin letters in both cases, digits, the high order characters that
are different in any ISO-8859 charset (explicitly excluding the non-
breaking space and soft hyphen), followed by the Unicode Letters,
Unicode Digits and Unicode Whitespace.
My test results with àœæßéÀÔùñ:
* with libperl5.40 and perl & perl-modules-5.40
* with LANG=fr_FR.UTF-8 or C.UTF-8
\p{L}\p{Nd}\p{Zs}: OK
\x{a1}-\x{ac}\x{ae}-\x{ff}: OK except œŒ
* with LANG=C
\p{L}\p{Nd}\p{Zs}: non-ASCII KO
\x{a1}-\x{ac}\x{ae}-\x{ff}: non-ASCII KO
Note: with LANG=C and either the original or new regexes, adduser
indefinitely hangs with high CPU load if the gecos field contains more
than 5 non-ASCII characters. It does not happen without libperl5.40.
This currently affects the installer.
* without libperl5.40 and perl, with or without perl-modules-5.40
* LANG=fr_FR.UTF-8 or C.UTF-8 or C
\p{L}\p{Nd}\p{Zs}: non-ASCII KO except à
\x{a1}-\x{ac}\x{ae}-\x{ff}: àœæÆß and uppercase accented letters KO
So, on a system without full perl (and probably with a non UTF-8-
locale), this will match most languages that have an ISO-8859 charset.
In a full system, we have full Unicode support.
d-i always installs C.UTF-8, so there is at least one UTF-8 locale.
Would this help the installer?
It looks like a step forward, but the new regex still does not match
some letters nor uppercase accented letters when libperl is not installed.