On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin <cygwin@cygwin.com> wrote: > > Christian Franke via Cygwin wrote: > > Thomas Wolff via Cygwin wrote: > >> > >> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin: > >>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin: > >>>> If a file name contains an invalid (truncated) UTF-8 sequence, open() > >>>> does not refuse to create the file. Later readdir() returns a > >>>> different name which could not be used to access the file. > >>>> > >>>> Testcase with U+1F321 (Thermometer): > >>>> > >>>> $ uname -r > >>>> 3.5.4-1.x86_64 > >>>> > >>>> $ printf $'\U0001F321' | od -A none -t x1 > >>>> f0 9f 8c a1 > >>>> > >>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' > >>>> > >>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext' > >>>> > >>>> $ touch 'file3-'$'\xf0\x9f\x8c' > >>>> > >>>> $ ls -1 > >>>> ls: cannot access 'file2-.?ext': No such file or directory > >>>> ls: cannot access 'file3-': No such file or directory > >>>> 'file1-'$'\360\237\214\241''.ext' > >>>> file2-.?ext > >>>> file3- > >>> I don't reproduce this. > > > > Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto' > > which needs to call stat(). Plain 'ls' does not, so the errors do not > > occur then. > > > > > >>> > >>> While the file name gets mangled, all resulting file names are valid > >>> and > >>> listed: > >>> In file2 the sequence is turned into U+17B3 but exchanged with the dot. > >>> In file3 the same sequence is just dropped. > >>> $ ls -1|cat > >>> file1-🌡.ext > >>> file2-.ឳext > >>> file3- > >>> > >>> However, ls file2* fails, as does ls *. > >> On the other hand, ls file3- fails too, so some mapping error occurs > >> internally. > >> Also, the files cannot be deleted from cygwin (need to use cmd). > > > > 'rm' using the original names works for file2-..., but not for file3-... > > > > $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext' > > removed 'file2-'$'\360\237\214''.ext' > > > > $ rm -v 'file3-'$'\xf0\x9f\x8c' > > rm: cannot remove 'file3-'$'\360\237\214': No such file or directory > > > > Further tests suggest that the problem only occurs with: > - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit) > - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16 > 'high surrogate' range (0xD800..0xDBFF).
Makes perfect sense, the Windows kernel uses UTF16 internally. Mark -- IT Infrastructure Consultant Windows, Linux -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple