On Wednesday 25 March 2026 07:15:53 (+01:00), LamentXU wrote:
> I think there are sound opinions in both side so I will still let the
vote begin and see what the majority thinks. To be short,
>
>
> Reasons for supporting
> - semantically NUL is not whitespaces
> - the majority of other popular languages don't trim NUL
> Reasons for not supporting
> - Java do trim NUL
> - Security issues in existing code base
> - Already has mb_trim() and the second parameter instead to prevent
trimming NUL if people want
> - Unnecessary changes in the life-cycle
>
>
> This is a quite minor change (and thats why people don't talk about this
before, since little people run into the case of trimming NUL).
This change is not minor and most of all removing NUL from PHP's trim()
default cutset is a security issue.
In your first RFC you have concluded that trim() is about trimming
whitespace, and as isspace(\f) returns true, it is whitespace and should be
added to the default cutset string value (second parameter of trim(),
optional).
You underlined that with a comparison across different programming
languages to manifest the impression that trim() is about whitespace, and
especially for casual use, this is then in the spot of usability / locality
of expected behaviour.
While it is technically correct that isspace(\f) returns true, and \f is
commonly understood to be in the space character class and often in use of
other scripting languages like Python for their cutters or trimmers, this
does not change what trim() in PHP actually is, despite what we want it to
be. It most importantly does not automatically make such a change small or
straight forward or safe. It may make it appear that way, but
unfortunately, that view is without precision glasses.
What remains correct IMHO is the case of casually using trim() as a
whitespace trimmer, and when done that way, the trim() function in PHP
requires some extra-work, and that is looking up the default value of the
second parameter, to find out if it is applicable for use or if the second
parameter with a value of it's own needs to be provided for that use.
As the trim() function has two invocations, the user has to pick the right
one for the job. That may be conceived as extra-work by those who are not
aware that a function can have multiple invocations, e.g. new users or
users new to programming. This _is_ a real point.
You have suggested, that if the default value is composed entirely of
characters of the C space character class, then the function is easier to
use as a whitespace trimmer. Under this pretext (whitespace trimmer), I
think this remains correct.
Now for the parts, if you allow me, where this falls apart:
The first misconception as I understand it is the classification of the
trim() function being a whitespace trimmer. This is wrong, the correct
classification of the trim() function in PHP is a string trimmer. This
distinction is furthermore important because the trim() function is a
binary safe function and strings in PHP are array of bytes.
If we look more closely, we can see that with the default value, both in
stable and unstable (master) PHP, it is composed of *both* space and
control characters. When we apply the technique with the isspace() function
to classify the spaces within the default set, we get a high number, it is
either 5 out of 6 (stable) or 6 out of 7 (unstable).
However we can't just pick only one character classifier function. If we
use the same technique and use iscntrl() for the counter-check, we get a
similar high, if not exactly the same numbers: there are 5 out of 6
(stable) or 6 out of 7 (unstable) control characters in the default cutset.
This confirms that while trim() without the second parameter can be used as
a whitespace trimmer, it is *equally* used as a control character trimmer.
Henceforth the differentiation on being a whitespace trimmer remains
correct, but limited: It is not exclusively a whitespace trimmer.
A conclusion of the earlier discovery that \f was missing and NUL was
superfluous in stable under the pretext of a space trimming function, could
have also been resolved by correcting the understanding that trim() is not
an exclusive whitespace trimming function at all - whould have an analysis
of the character classes been done with due dilligence. It was not done, or
those who did this have not shared the outline of their solution here on
the list (unless my mail client has eaten up some of the messages again).
The second misconception so far in both RFCs lies in the comparison with
other programming languages. While this suffices as a first explorative
test for comparison purposes, it also was not done extensively. There it
was exclusively looked for default values, without taking into account that
when different values were provided if the function itself is an exclusive
whitespace trimmer or an ordinary string trimmer, and furthermore if binary
safety applies to the function or not.
If we take Python as an example with their cut family of functions, you
have correctly analyzed that the default value is entirely composed of all
characters of the space character class in the C default locale. However,
it is only the default value. There is no problem to use the default value
and add the NUL character to it as an additional character to have it in
the cutset.
That Python and PHP have a default value for what might be received as the
same family of functions - despite the different names - could have also
lead to the conclusion that different programming languages use a)
different names, b) different defaults and c) different implementations
resulting in d) overall different behaviour. This is why a programming
language provides documentation of their standard library functions so that
users can pick and choose the right function invocation for the job. This
is normally taken as a given, however as the argument is and was to change
a default value, not understanding how it fundamentally works (different
invocations) and which checks in terms of programming the programming
language, e.g. by changing a default value, are required (and not
optional), is a shortcomming in both RFC texts.
Now Python is not the only other programming language, only one other I
used here to illustrate the problem argueing with defaults while we have
already shown that the function (the object under discussion) is prone to
misclassification during the discussion, now furthermore misclassifying the
invocations the functions have.
Obviously it is easy to fall for that. This is certainly the reason why
programming languages for their standard libraries try to have as little
ambiguity as possible with their standard functions so that everything can
stay, or in case of a correction needed, resolve in clarity.
I'd like to illustrate that with another programming language, Go:
The string trimmer and the space trimmer are two different functions. This
is a good resolution of the problem you brought up, because now we can
reason with clarity whether the one or the other has a bug. What we can
immediately see in the Go standard library is that the optionality of the
second parameter is gone: the string trimmer requires to pass the cutset
next to the string, while the whitespace trimmer has one argument only.
The ambiguity the PHP trim() function, being only a string trimmer (like in
Python) with the second argument only optional because it came later (this
was a design decision, the function has two different invocations, of which
the second came later - this is important to understand), is completely
voided when having two functions with all their parameters mandatory to
pass as in Golang.
When we do the cross language comparison - and despite the limitations such
comparisons always have - and work actively with such limitations, we can
resolve the request to ease the casual use of the trim() function as a
whitespace trimming function also by finding out that a function in PHP is
missing and should be added:
trim_space()
however, this has not be mentioned so far in the discussion. IMHO a
shortcomming of the discussions, especially if any of those who voted yes
on the earlier RFC did actually bought into one of the two key arguments:
whitespace trimming -or- language comparison.
With all that found out, let's explain the security issue we face with the
proposal to remove NUL under the new light.
As illustrated, while it is not entirely wrong that trim() is a space
trimming function, it is equally technical correct that trim() is a control
character trimming function.
While so far the argument has been used to add a control character to the
default cutset (form-feed, at code-point 12, a C0 control character), the
nature of this change so far suggested that practically there was not much
need to discuss the change. Henceforth the understanding across the whole
group is likely very different without causing enough disturbance that
could endanger the vote. We can also see that in the vote of 100% yes by 25
individuals.
The nature of removing a character from the default cutset has far more
severe consquences by nature. As the trim() function is undoubtfully a
string trimmer (unless taken as a whitespace trimmer that, as we have
shown, is a mistake or imprecision at best), the use of PHP trim() is that
heavily undermined - if not sabotaged - that leading/ending NUL characters
remain within the string while the use of trim() under good faith in PHP
requires to remove these.
This is not just an annoyance removing more control characters as earlier
expected due to adding form-feed to the default cut-set, which already
violated the history rule of observable behaviour of the function - and in
this pretext of a programming language, undermined the rule of good faith
in that language - it has severe and dangerous consequences only by the
misunderstanding of the nature of a function due to the incomplete
comparison during character class anylsis and the very incomplete language
comparison that has been done so far.
While it is not too late to prevent bringing this RFC to vote or if it is
brought to vote, to reject it per vote, this should also show a problem
with the earlier change that is currently in unstable PHP:
Users of the programming language still face an issue, while not as grave
as a security issue like NUL byte injection, they remain unable when
switching to master to find out about the changes of the default cutset if
they intentionally use it. Obviously they use the PHP language's default
cutset, however that has changed. The language however is silent about
that. This is highly unexpected because the second parameter is optional,
and therefore if there would be a useful other default, it would be
provided as string, and not by leaving it out.
I therefore suggest to at least provide a new global string constant with
the value of the original default characters so that when preparing scripts
for the unstable and then later next release version of PHP a more or less
simple search and replace operation can be done replacing the use of the
trim family of functions in script code in the first invocation with the
second invocation using this new constant.
Additionally I'd suggest that this new global constant is backported to PHP
8.5 so that current stable code can be immunized against the change that
has been voted for and for which we must assume that it will come in next
PHP at the time of writing.
Because of the issues you as a new contributor has raised, specifically in
regard to the casual use of the trim family of functions, I'd suggest to
introduce the "trim_space()" function with a single string argument, that
is a dedicated whitespace trimmer capable of cutting UTF-8 encoded
whitespace characters, so that, with the year of the horse, we can trim
whitespace universally, and not only limited to the C locale.
My 2 cents,
-- hakre