Edit report at https://bugs.php.net/bug.php?id=47096&edit=1

 ID:                 47096
 Comment by:         nicolas dot grekas+php at gmail dot com
 Reported by:        nuabaranda at web dot de
 Summary:            move_uploaded_file not OS encoding aware
 Status:             Open
 Type:               Bug
 Package:            Filesystem function related
 Operating System:   win32 only - Windows XP
 PHP Version:        5.2.8
 Block user comment: N
 Private report:     N

 New Comment:

Well, if you really need it, there may be one possibility using a COM object:

$fs = new \COM('Scripting.FileSystemObject', null, CP_UTF8);


Previous Comments:
------------------------------------------------------------------------
[2012-04-03 15:12:07] salsi at icosaedro dot it

Just to complete my little survey of the file names encoding issue:

1. Under Windows Vista, in the control panel "Regional and Language Settings" 
also the "Formats" panel must be set accordingly to the language selected in 
the "Advanced" panel in order to set the LC_CTYPE property; the "Advanced" 
panel only selects the translation mapping between Unicode and multi-byte 
encoding but does not set the locale properties.
For example, on a western country LC_CTYPE="english_United States.1252" while 
in Japan it might be LC_CTYPE="Japanese_Japan.1252".

2. Windows applies the "best fit" conversion table 
(http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/) when 
translating from Unicode file names to multi-byte file name 
(http://msdn.microsoft.com/en-us/library/windows/desktop/dd374047%28v=vs.85%29.aspx);
 characters that have not a best fit are replaced by a question mark "?".
So, for example, when the japanese locale is set (code page 932) the Latin 
capital letter A with dieresis ("Ä") might map to the plain capital letter "A" 
and accented vouels like "àèìòù" might be translated to the plain ASCII 
letters "aeiou".
This means that from inside PHP file names retrieved from the file system via 
dir() or getcwd() are only APPROXYMATIONS of the real path and there is no way 
to detect if they really match the actual name.


Conclusions
===========

Under Unix and Linux with a properly set locale, PHP program can access and 
retrieve any file name that match the current locale; UTF-8 is the better 
choice here.

Under Windows, PHP programs can generate and can access any file or file path 
that contains only characters included in the current code page table; however, 
PHP programs cannot trust on file names retrieved from the file system because 
these might be arbitrarily mangled and there is no way to detect such artifact.

------------------------------------------------------------------------
[2012-03-17 18:19:24] salsi at icosaedro dot it

As PHP operates under Windows as a "non-Unicode aware program", file names are 
bare array of bytes represented under PHP as "string"; these strings are 
converted back and forth to Unicode by Windows according to the currently 
selected "code page table" (see "Control Panel", "Regional and Language 
Options", "Administrative" tab panel, "Language for non-Unicode programs"). 
Unfortunately, UTF-8 encoding is not available there, so whatever locale you 
choose, some Unicode file names may still remain unaccessible to PHP.

For example, if your system locale is any western european encoding (code page 
1252), there is no way to refer to a file whose name is "日本語"; only on 
Windows system with japanese locale set (code page 932) you can access such a 
name, provided that the "string" that represents that name be properly encoded 
as requested by the code page 932, that is "\x93\xfa\x96\x7b\x8c\xea".

So, if you have a generic name of a file (along with its path) as a Unicode 
string $u (for example UTF-8 encoded) and you want to try to save it with that 
name under Windows, you must first check the current locale calling 
setlocale(LC_CTYPE, 0) to retrieve the current code page, then you must convert 
$u to an array of bytes according to the code page; if one or more code points 
have no counterpart in the current code page, the file cannot be saved with 
that name from PHP. Dot.

To complicate the implementation of such an algorithm, neither mbstring nor 
iconv are aware of all the Windows code pages, so you must write these 
conversion routines by yourself. This is just what I have done experimentally 
under PHP, and it appears to work nicely 
(http://www.icosaedro.it/phplint/libraries.cgi?lib=stdlib/it/icosaedro/io/FileName.html).
 Hopefully some day something similar will be available in PHP core lib., or 
some other abstraction layer of classes may provide full access to the Unicode 
realm.

References:

http://en.wikipedia.org/wiki/Windows_code_page

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/

------------------------------------------------------------------------
[2011-09-23 03:02:09] xd-yang at qq dot com

Since basename() is locale aware, why not move_uploaded_file()?
A common remedial measure is to use iconv() to explicitly convert the 
destination filename encoding usually from utf-8 to ansi(like gb2312). But this 
becomes complicated and unreachable in a multilingual CMS, like wordpress. Can 
this issue be solved in the future?

------------------------------------------------------------------------
[2009-02-26 09:46:51] mm107137 at spamcorptastic dot com

I have the same problem under debian host (ovh hoster).
Filename with french accents passed to move_upload_file are destroyed.
There's no problems if filename is not passed as utf8.

Very annoying

------------------------------------------------------------------------
[2009-02-06 20:21:49] mindfreakthemon at gmail dot com

And on Windows 7 and Vista under Apache 2.2 that bug exists too.

------------------------------------------------------------------------


The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

    https://bugs.php.net/bug.php?id=47096


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=47096&edit=1

Reply via email to