On Thu, 19 Sept 2024 at 16:46, Brian Inglis via Cygwin <cygwin@cygwin.com> wrote: > > On 2024-09-19 07:27, Christian Franke via Cygwin wrote: > > Mark Liam Brown via Cygwin wrote: > >> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin > >> <cygwin@cygwin.com> wrote: > >>> Christian Franke via Cygwin wrote: > >>>> Thomas Wolff via Cygwin wrote: > >>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin: > >>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin: > >>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open() > >>>>>>> does not refuse to create the file. Later readdir() returns a > >>>>>>> different name which could not be used to access the file. > >>>>>>> > >>>>>>> Testcase with U+1F321 (Thermometer): > >>>>>>> > >>>>>>> $ uname -r > >>>>>>> 3.5.4-1.x86_64 > >>>>>>> > >>>>>>> $ printf $'\U0001F321' | od -A none -t x1 > >>>>>>> f0 9f 8c a1 > >>>>>>> > >>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' > >>>>>>> > >>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext' > >>>>>>> > >>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c' > >>>>>>> > >>>>>>> $ ls -1 > >>>>>>> ls: cannot access 'file2-.?ext': No such file or directory > >>>>>>> ls: cannot access 'file3-': No such file or directory > >>>>>>> 'file1-'$'\360\237\214\241''.ext' > >>>>>>> file2-.?ext > >>>>>>> file3- > >>>>>> I don't reproduce this. > >>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto' > >>>> which needs to call stat(). Plain 'ls' does not, so the errors do not > >>>> occur then. > >>>> > >>>> > >>>>>> While the file name gets mangled, all resulting file names are valid > >>>>>> and > >>>>>> listed: > >>>>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot. > >>>>>> In file3 the same sequence is just dropped. > >>>>>> $ ls -1|cat > >>>>>> file1-🌡.ext > >>>>>> file2-.ឳext > >>>>>> file3- > >>>>>> > >>>>>> However, ls file2* fails, as does ls *. > >>>>> On the other hand, ls file3- fails too, so some mapping error occurs > >>>>> internally. > >>>>> Also, the files cannot be deleted from cygwin (need to use cmd). > >>>> 'rm' using the original names works for file2-..., but not for file3-... > >>>> > >>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext' > >>>> removed 'file2-'$'\360\237\214''.ext' > >>>> > >>>> $ rm -v 'file3-'$'\xf0\x9f\x8c' > >>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory > >>>> > >>> Further tests suggest that the problem only occurs with: > >>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit) > >>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16 > >>> 'high surrogate' range (0xD800..0xDBFF). > >> Makes perfect sense, the Windows kernel uses UTF16 internally. > > > > > > Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> UTF-16 > > mappings. This makes no sense: > > > > $ touch 'file-'$'\xed\xa0\x80''.ext' # creates L"file-\xD800.ext" on NTFS > > > > $ strace ls -F > > ... > > ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" > > > "file-\xE2\x9E\xB3.ext") > > ... > > ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...) > > ... > > ls: cannot access 'file-?.ext': No such file or directory > > file-?.ext > > > > $ rm -v 'file-'$'\xed\xa0\x80''.ext' > > removed 'file-'$'\355\240\200''.ext' > > > > The UTF-8 sequence returned by readdir() decodes to U+27B3 (White-Feathered > > Rightwards Arrow). > > > > > > This could be fixed by handling UTF-8 of the surrogate range similar to > > other > > invalid sequences: Map each invalid byte to unicode range U+FF80 to U+FFFF. > > This > > works as expected if the above UTF-8 sequence is truncated: > > > > $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS > > > > $ ls -F > > 'file-'$'\355\240''.ext' > > Surrogates halves are invalid for UTF-8 encoding; they should be first be > encoded as a valid UTF-16 code point. > The encoder should just fail if it encounters any invalid sequence! > Handling surrogates or other invalid values as anything other than invalid > turns > the encoding into what has been called WTF-8 where W may be for Windows! ;^> > Nope, the WTF-8 means "What the F*ck-8"!
Ced -- Cedric Blancher <cedric.blanc...@gmail.com> [https://plus.google.com/u/0/+CedricBlancher/] Institute Pasteur -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple