Edit report at https://bugs.php.net/bug.php?id=54089&edit=1
ID: 54089 User updated by: nicolas dot grekas+php at gmail dot com Reported by: nicolas dot grekas+php at gmail dot com Summary: token_get_all with regards to __halt_compiler is not binary safe -Status: Closed +Status: Assigned Type: Bug Package: Unknown/Other Function Operating System: Any PHP Version: 5.3.5 Assigned To: iliaa Block user comment: N Private report: N New Comment: Excerpt from internals: Nikita Popov Fri, Sep 9, 2011 at 09:15 [...] The problem with the patch is, that there are some tokens after T_HALT_COMPILER that are of interest, namely the '(' ')' ';'. After the patch it is impossible to get those tokens, without either relexing the code after T_HALT_COMPILER (that way you get the binary data problem back, just with much more complex code) or writing a regular expression to match it (which is really hard, as there may be any token dropped by the PHP parser in there, i.e. whitespace, comments, PHP tags). [...] I would like this patch to be reverted on the 5.4 and trunk branches. I assume it's too late to revert it on 5.3, as it has been there for some time already. It is just counterproductive. (Alternatively one could fix token_get_all to return the (); tokens after __halt_compiler, too, but that would be hard, probably.) -- Ferenc Kovacs Fri, Sep 9, 2011 at 10:01 I think that it wouldn't be too hard. >From a quick glance on the code, we should introduce a new local variable, set that to true where we break now ( http://svn.php.net/viewvc/php/php-src/trunk/ext/tokenizer/tokenizer.c?view=markup#l155 ) and don't break there but for the next ';'. another maybe less confusing solution would be to explicitly add '(', ')' and ';' to the result in the T_HALT_COMPILER condition before breking out of the loop. I will create a patch for this afternoon. or could there be other important tokens after the __halt_compiler() which should be present in the token_get_all() result? -- Nicolas Grekas Fri, Sep 9, 2011 at 10:46 > don't break there but for the next ';'. You can also just count the number of semantic token after T_HALT_COMPILER (ie excluding whitespace and comments) and once you hit 3, halt. > less confusing solution would be to explicitly add '(', ')' and ';' to the > result in the T_HALT_COMPILER condition before breking out of the loop. If you mean verifying that '(', ')' and (';' or T_CLOSE_TAG) are effectively following T_HALT_COMPILER, I think that's part of the syntax analyser's job, not tokenizer's. If you're ok with this argument, then just couting 3 tokens is really the most basic "syntax analysis" we have to do to fix the pb, don't you think? > could there be other important tokens after the __halt_compiler() > which should be present in the token_get_all() result? Maybe the binary data itself, as a big T_INLINE_HTML for example ? Also, if token_get_all you be made binary safe, that would be very cool ! (no more eating of \x00-\x1F inside regular code) :) -- Nikita Popov Fri, Sep 9, 2011 at 19:39 > You can also just count the number of semantic token after T_HALT_COMPILER > (ie excluding whitespace and comments) and once you hit 3, halt. > [...] > Maybe the binary data itself, as a big T_INLINE_HTML for example ? In favor of both proposals! Returning the next 3 tokens should be quite easy [1]. Returning the rest as an T_INLINE_HTML makes sense too, as extracting the data is probably what you want. Though I have no idea how to implement that ^^ [1]: https://github.com/nikic/php-src/commit/2d4cfa05947f04de447635ca5748b3b58defbfaf (Not tested, only guessing) -- Nikita Popov Tue, Sep 13, 2011 at 09:16 I just set up an PHP environment and wrote a proper patch (including test changes) to make it collect the next three tokens. It's a git patch and I'm not sure whether it's compatible with SVN patches. I would love it if this would go into 5.4 before beta. I didn't know how one could fetch the rest into T_INLINE_HTML, so I'm hoping on help here from someone who actually knowns C there :) Previous Comments: ------------------------------------------------------------------------ [2011-03-17 23:51:25] nicolas dot grekas+php at gmail dot com Latest 5.3.6 has been released with the patch... So now token_get_all() stops on T_HALT_COMPILER for ever :) ------------------------------------------------------------------------ [2011-03-10 08:26:30] nicolas dot grekas+php at gmail dot com Really, the actual patch is a step backward, I can't do things that were easy before (getting the halt_compiler_offset with token_get_all)... Please consider reverting it! ------------------------------------------------------------------------ [2011-03-03 15:44:33] nicolas dot grekas+php at gmail dot com Sorry to reopen. As 5.3.6 is in RC, I just want to be sure my previous comment has been read. What about reverting the patch ? ------------------------------------------------------------------------ [2011-03-01 10:15:47] nicolas dot grekas+php at gmail dot com Thanks for the patch. After reading it, I'm not sure it really helps, considering that the stop on T_HALT_COMPILER was already easily feasible in plain PHP. In fact, it may be worse, because now if I want to access data after T_HALT_COMPILER in PHP using tokenizer, I have to write more code, as the data is missing from the token array. As a corner case also, __halt_compiler is always followed by 3 valid tokens: "(", ")" then ";" or T_CLOSE_TAG, with any number of T_WHITESPACE/T_COMMENT/T_DOC_COMMENT between. My view is that this "bug" can be fixed by introducing a new T_UNEXPECTED_CHARACTER token type, matching those "Unexpected character in input" warnings: this would fix token_get_all binary unsafeness. Is it a good idea? I don't know if it's difficult to implement, nor if it would introduce any BC break, so maybe a "Won't fix" on this bug is enough? Could the patch be reverted? I'm afraid it's the best for tokenizer users... Here is what I was using before the patch to work around this binary incompatibility: <?php // New token matching an "Unexpected character in input" define('T_UNEXPECTED_CHARACTER', -1); $src_tokens = @token_get_all($code); $bin_tokens = array(); $offset = 0; $i = -1; while (isset($src_tokens[++$i])) { $t = isset($src_tokens[$i][1]) ? $src_tokens[$i][1] : $src_tokens[$i]; while ($t[0] !== $code[$offset]) $bin_tokens[] = array(T_UNEXPECTED_CHARACTER, $code[$offset++]); $offset += strlen($t); $bin_tokens[] = $src_tokens[$i]; unset($src_tokens[$i]); } // Here, $bin_tokens contains binary safe tokens ?> ------------------------------------------------------------------------ [2011-02-28 16:18:35] il...@php.net This bug has been fixed in SVN. Snapshots of the sources are packaged every three hours; this change will be in the next snapshot. You can grab the snapshot at http://snaps.php.net/. Thank you for the report, and for helping us make PHP better. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=54089 -- Edit this bug report at https://bugs.php.net/bug.php?id=54089&edit=1