Am 27.06.2025 um 12:30 schrieb Corinna Vinschen via Cygwin:
Hi Christian,

On Jun 26 19:07, Christian Franke via Cygwin wrote:
Corinna Vinschen via Cygwin wrote:
On Jun 25 16:59, Christian Franke via Cygwin wrote:
On Sun, 15 Sep 2024 19:47:11 +0200, Christian Franke wrote:
If a file name contains an invalid (truncated) UTF-8 sequence, open()
does not refuse to create the file. Later readdir() returns a different
name which could not be used to access the file.

Testcase with U+1F321 (Thermometer):

$ uname -r
3.5.4-1.x86_64

$ printf $'\U0001F321' | od -A none -t x1
   f0 9f 8c a1

$ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'

$ touch 'file2-'$'\xf0\x9f\x8c''.ext'

$ touch 'file3-'$'\xf0\x9f\x8c'

$ ls -1
ls: cannot access 'file2-.?ext': No such file or directory
ls: cannot access 'file3-': No such file or directory
'file1-'$'\360\237\214\241''.ext'
file2-.?ext
file3-
[...]
I don't know exactly where this happens, but the input of the
conversion is invalid UTF-8 because it's missing the 4th byte.
There's no way to represent these filenames on Windows
filesystems storing filenames as UTF-16 values.

So the problem here is that the conversion somehow misses that
the 4th byte is invalid and just plods forward and converts the
leading three bytes into the matching high surrogate value and
then stumbles over the conversion for the low surrogate.

It would be really helpful to have an STC for this problem.
With some trial and error I found a testcase for this more serious problem
reported yesterday but not quoted above:

In cases like file3-... above, the converted Windows path ends with
0xF000. This suggests that this is an accidental conversion of the
terminating null to the 0xF0xx range.

In some cases, the created Windows file name has random garbage
behind the 0xF000. Then even Cygwin is not able to access or unlink
the file after creation.
Testcase (attached):
Thanks for the testcase!

I found the problem in the newlib core function creating wchar_t from
UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
low surrogate already after reading byte 3, without checking if byte 4
of the UTF-8 sequence is a valid byte. Hilarity ensues.
I'm afraid the fix may have broken mbrtowc as I just reported to the list, with a test case, thus also breaking mintty. The low surrogate MUST be created after byte 3 because otherwise the high surrogate cannot be delivered after byte 4 as it needs to. I think it's a drawback of UTF-16 that must be swallowed, even if some incorrect sequences slip through somehow.

Thomas

Fortunately this bug has only been introduced very recently, to wit, on
2009-03-24, a mere 16 years ago.  And it is my bug and mine alone :}

I'm just prep'ing a fix which I'll push in a minute or two.


Thanks,
Corinna



--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Reply via email to