Edit report at http://bugs.php.net/bug.php?id=54028&edit=1

 ID:                 54028
 Comment by:         carsten_sttgt at gmx dot de
 Reported by:        schmale at froglogic dot com
 Summary:            Directory::read() cannot handle non-unicode chars
                     properly
 Status:             Bogus
 Type:               Bug
 Package:            Directory function related
 Operating System:   Windows 7
 PHP Version:        5.3.5
 Block user comment: N
 Private report:     N

 New Comment:

> Windows supports UCS-2 internally via the wild char APIs.

I now... I'm just wondering why:



"mb_detect_encoding($content)" is returing 'UTF-8'

and

"mb_check_encoding($content, 'UTF-8')" is returning FALSE?





Also I think there is another problem:

| C:\Users\Carsten Wiedmann>php -r "echo realpath('.');"

| C:\Users\Carsten Wiedmann

| C:\Users\Carsten Wiedmann>cd Startmenü

| 

| C:\Users\Carsten Wiedmann\Startmenü>php -r "echo realpath('.');"

| 

| C:\Users\Carsten Wiedmann\Startmenü>



Regards,

Carsten


Previous Comments:
------------------------------------------------------------------------
[2011-02-25 13:32:49] paj...@php.net

There is no UTF-8 support in Windows APIs or in PHP for the file system
APIs.



Windows supports UCS-2 internally via the wild char APIs. PHP relies on
the ANSI 

APIs and the encoding is then the runtime encoding (whatever is set for
the 

running process or system wild).



The feature request I was referring to is about making PHP uses the wild
char API 

and accepts UTF-8 as input (and output).

------------------------------------------------------------------------
[2011-02-25 13:29:15] carsten_sttgt at gmx dot de

| and the problem does only occur with Windows/CLI.



I have no difference between CGI and CLI (both executed from the shell)



Of course, something is courious:

<?php

$directory = dir(getenv('USERPROFILE'));

while (false !== ($content = $directory->read())) {

    if (mb_check_encoding($content, 'UTF-8') === false) {

        printf('Returned non-utf-8 (%s)', $content);

        printf(" Encoding: %s\r\n", mb_detect_encoding($content));

    }

}

?>



And the output is:

Returned non-utf-8 (Startmenü) Encoding: UTF-8





Regards,

Carsten

------------------------------------------------------------------------
[2011-02-15 17:10:43] schmale at froglogic dot com

Well, I don't know what Windows uses as encoding, but I sure do know,
that it works properly with the Windows CGI version. The point is, a
directory called 'Startmenü' will return 'Startmenü' with Linux/CGI,
Linux/CLI, Windows/CGI, but NOT with Windows/CLI - the latter returning
'Startmenñæ' (or sth similar). In other words: The behaviour with
Windows/CLI is broken, where the other versions return the exact name of
the directory, as expected.



So I think it has nothing (little) to do with unicode filesystem support
or the encoding of Windows, but with differences between CGI and CLI.

------------------------------------------------------------------------
[2011-02-15 16:54:17] paj...@php.net

There is already a feature request for unicode filesystem support.



Btw, Windows does not use UTF-8 for its encoding.

------------------------------------------------------------------------
[2011-02-15 16:51:20] schmale at froglogic dot com

Description:
------------
Notice: This problem does ONLY affect the CLI interpreter, NOT the CGI.



Using dir('path/to/dir'), the read() method does not return UTF-8, if
the directory contains e.g. umlauts (ä, ö, ü). I tested this on Linux
and Windows, both CGI and CLI, and the problem does only occur with
Windows/CLI.

Test script:
---------------
$path = 'path/to/directory/which/contains/umlauts';



$directory = dir($path);

while (false !== ($content = $directory->read())) {

    if (mb_check_encoding($content, 'UTF-8') === false) {

        fprintf(STDERR, 'Returned non-utf-8 (%s)', $content);

    }

}



Expected result:
----------------
The expected result, of course, was that the return value of read is
always encoded in UTF-8, i.e. no messages are print, when we run the
script.

Actual result:
--------------
If a subdirectory contains umlauts (or I guess any non-unicode
character), a message is print, i.e. the return value is not encoded in
UTF-8.


------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=54028&edit=1

Reply via email to