On 2024-09-19 07:27, Christian Franke via Cygwin wrote:
Mark Liam Brown via Cygwin wrote:
On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
<cygwin@cygwin.com> wrote:
Christian Franke via Cygwin wrote:
Thomas Wolff via Cygwin wrote:
Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
If a file name contains an invalid (truncated) UTF-8 sequence, open()
does not refuse to create the file. Later readdir() returns a
different name which could not be used to access the file.
Testcase with U+1F321 (Thermometer):
$ uname -r
3.5.4-1.x86_64
$ printf $'\U0001F321' | od -A none -t x1
f0 9f 8c a1
$ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
$ touch 'file2-'$'\xf0\x9f\x8c''.ext'
$ touch 'file3-'$'\xf0\x9f\x8c'
$ ls -1
ls: cannot access 'file2-.?ext': No such file or directory
ls: cannot access 'file3-': No such file or directory
'file1-'$'\360\237\214\241''.ext'
file2-.?ext
file3-
I don't reproduce this.
Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
which needs to call stat(). Plain 'ls' does not, so the errors do not
occur then.
While the file name gets mangled, all resulting file names are valid
and
listed:
In file2 the sequence is turned into U+17B3 but exchanged with the dot.
In file3 the same sequence is just dropped.
$ ls -1|cat
file1-🌡.ext
file2-.ឳext
file3-
However, ls file2* fails, as does ls *.
On the other hand, ls file3- fails too, so some mapping error occurs
internally.
Also, the files cannot be deleted from cygwin (need to use cmd).
'rm' using the original names works for file2-..., but not for file3-...
$ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
removed 'file2-'$'\360\237\214''.ext'
$ rm -v 'file3-'$'\xf0\x9f\x8c'
rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
Further tests suggest that the problem only occurs with:
- incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
- complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
'high surrogate' range (0xD800..0xDBFF).
Makes perfect sense, the Windows kernel uses UTF16 internally.
Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> UTF-16
mappings. This makes no sense:
$ touch 'file-'$'\xed\xa0\x80''.ext' # creates L"file-\xD800.ext" on NTFS
$ strace ls -F
...
... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" >
"file-\xE2\x9E\xB3.ext")
...
... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...)
...
ls: cannot access 'file-?.ext': No such file or directory
file-?.ext
$ rm -v 'file-'$'\xed\xa0\x80''.ext'
removed 'file-'$'\355\240\200''.ext'
The UTF-8 sequence returned by readdir() decodes to U+27B3 (White-Feathered
Rightwards Arrow).
This could be fixed by handling UTF-8 of the surrogate range similar to other
invalid sequences: Map each invalid byte to unicode range U+FF80 to U+FFFF. This
works as expected if the above UTF-8 sequence is truncated:
$ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS
$ ls -F
'file-'$'\355\240''.ext'
Surrogates halves are invalid for UTF-8 encoding; they should be first be
encoded as a valid UTF-16 code point.
The encoder should just fail if it encounters any invalid sequence!
Handling surrogates or other invalid values as anything other than invalid turns
the encoding into what has been called WTF-8 where W may be for Windows! ;^>
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut
-- Antoine de Saint-Exupéry
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple