Edit report at https://bugs.php.net/bug.php?id=54089&edit=1

 ID:                 54089
 User updated by:    nicolas dot grekas+php at gmail dot com
 Reported by:        nicolas dot grekas+php at gmail dot com
 Summary:            token_get_all with regards to __halt_compiler is not
                     binary safe
-Status:             Closed
+Status:             Assigned
 Type:               Bug
 Package:            Unknown/Other Function
 Operating System:   Any
 PHP Version:        5.3.5
 Assigned To:        iliaa
 Block user comment: N
 Private report:     N

 New Comment:

Excerpt from internals:

Nikita Popov    Fri, Sep 9, 2011 at 09:15

[...]
The problem with the patch is, that there are some tokens after
T_HALT_COMPILER that are of interest, namely the '(' ')' ';'. After
the patch it is impossible to get those tokens, without either
relexing the code after T_HALT_COMPILER (that way you get the binary
data problem back, just with much more complex code) or writing a
regular expression to match it (which is really hard, as there may be
any token dropped by the PHP parser in there, i.e. whitespace,
comments, PHP tags).

[...]
I would like this patch to be reverted on the 5.4 and trunk branches.
I assume it's too late to revert it on 5.3, as it has been there for
some time already. It is just counterproductive. (Alternatively one
could fix token_get_all to return the (); tokens after
__halt_compiler, too, but that would be hard, probably.)

--

Ferenc Kovacs   Fri, Sep 9, 2011 at 10:01

I think that it wouldn't be too hard.
>From a quick glance on the code, we should introduce a new local
variable, set that to true where we break now (
http://svn.php.net/viewvc/php/php-src/trunk/ext/tokenizer/tokenizer.c?view=markup#l155
) and don't break there but for the next ';'. another maybe less
confusing solution would be to explicitly add '(', ')' and ';' to the
result in the T_HALT_COMPILER condition before breking out of the
loop.
I will create a patch for this afternoon.

or could there be other important tokens after the __halt_compiler()
which should be present in the token_get_all() result?

--

Nicolas Grekas  Fri, Sep 9, 2011 at 10:46

> don't break there but for the next ';'.

You can also just count the number of semantic token after T_HALT_COMPILER (ie 
excluding whitespace and comments) and once you hit 3, halt.

> less confusing solution would be to explicitly add '(', ')' and ';' to the
> result in the T_HALT_COMPILER condition before breking out of the loop.

If you mean verifying that '(', ')' and (';' or T_CLOSE_TAG) are effectively 
following T_HALT_COMPILER, I think that's part of the syntax analyser's job, 
not tokenizer's.
If you're ok with this argument, then just couting 3 tokens is really the most 
basic "syntax analysis" we have to do to fix the pb, don't you think?

> could there be other important tokens after the __halt_compiler()
> which should be present in the token_get_all() result?

Maybe the binary data itself, as a big T_INLINE_HTML for example ?

Also, if token_get_all you be made binary safe, that would be very cool ! (no 
more eating of \x00-\x1F inside regular code) :)

--

Nikita Popov    Fri, Sep 9, 2011 at 19:39

> You can also just count the number of semantic token after T_HALT_COMPILER
> (ie excluding whitespace and comments) and once you hit 3, halt.
> [...]
> Maybe the binary data itself, as a big T_INLINE_HTML for example ?
In favor of both proposals!
Returning the next 3 tokens should be quite easy [1].
Returning the rest as an T_INLINE_HTML makes sense too, as extracting
the data is probably what you want. Though I have no idea how to
implement that ^^

[1]: 
https://github.com/nikic/php-src/commit/2d4cfa05947f04de447635ca5748b3b58defbfaf
(Not tested, only guessing)

--

Nikita Popov    Tue, Sep 13, 2011 at 09:16

I just set up an PHP environment and wrote a proper patch (including
test changes) to make it collect the next three tokens. It's a git
patch and I'm not sure whether it's compatible with SVN patches. I
would love it if this would go into 5.4 before beta. I didn't know how
one could fetch the rest into T_INLINE_HTML, so I'm hoping on help
here from someone who actually knowns C there :)


Previous Comments:
------------------------------------------------------------------------
[2011-03-17 23:51:25] nicolas dot grekas+php at gmail dot com

Latest 5.3.6 has been released with the patch...
So now token_get_all() stops on T_HALT_COMPILER for ever :)

------------------------------------------------------------------------
[2011-03-10 08:26:30] nicolas dot grekas+php at gmail dot com

Really, the actual patch is a step backward, I can't do things that were easy 
before (getting the halt_compiler_offset with token_get_all)...
Please consider reverting it!

------------------------------------------------------------------------
[2011-03-03 15:44:33] nicolas dot grekas+php at gmail dot com

Sorry to reopen. As 5.3.6 is in RC, I just want to be sure my previous comment 
has been read. What about reverting the patch ?

------------------------------------------------------------------------
[2011-03-01 10:15:47] nicolas dot grekas+php at gmail dot com

Thanks for the patch. After reading it, I'm not sure it really helps, 
considering that the stop on T_HALT_COMPILER was already easily feasible in 
plain PHP. In fact, it may be worse, because now if I want to access data after 
T_HALT_COMPILER in PHP using tokenizer, I have to write more code, as the data 
is missing from the token array.

As a corner case also, __halt_compiler is always followed by 3 valid tokens: 
"(", ")" then ";" or T_CLOSE_TAG, with any number of 
T_WHITESPACE/T_COMMENT/T_DOC_COMMENT between.

My view is that this "bug" can be fixed by introducing a new 
T_UNEXPECTED_CHARACTER token type, matching those "Unexpected character in 
input" warnings: this would fix token_get_all binary unsafeness. Is it a good 
idea? I don't know if it's difficult to implement, nor if it would introduce 
any BC break, so maybe a "Won't fix" on this bug is enough?

Could the patch be reverted? I'm afraid it's the best for tokenizer users...
Here is what I was using before the patch to work around this binary 
incompatibility:

<?php

// New token matching an "Unexpected character in input"
define('T_UNEXPECTED_CHARACTER', -1); 

$src_tokens = @token_get_all($code);
$bin_tokens = array();
$offset =  0;
$i      = -1;

while (isset($src_tokens[++$i]))
{
        $t = isset($src_tokens[$i][1]) ? $src_tokens[$i][1] : $src_tokens[$i];

        while ($t[0] !== $code[$offset])
                $bin_tokens[] = array(T_UNEXPECTED_CHARACTER, $code[$offset++]);

        $offset += strlen($t);
        $bin_tokens[] = $src_tokens[$i];
        unset($src_tokens[$i]);
}

// Here, $bin_tokens contains binary safe tokens

?>

------------------------------------------------------------------------
[2011-02-28 16:18:35] il...@php.net

This bug has been fixed in SVN.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.



------------------------------------------------------------------------


The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

    https://bugs.php.net/bug.php?id=54089


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=54089&edit=1

Reply via email to