On 8/15/23 16:00, Tomas Kalibera wrote:
On 8/15/23 09:04, Ivan Krylov wrote:
В Tue, 15 Aug 2023 08:38:11 +0200
Tomas Kalibera <tomas.kalib...@gmail.com> пишет:
As this was reported to be regression in 4.3, it is entirely possible
this change came with a regression (though a bit surprising we didn't
catch it earlier by testing), so it would be a great help if I could
have the example and debug it.
Sorry, let me try to be more clear.
The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.
This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.
Take the ø (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)
stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()
Does this work? I don't have access to a UTF-8 Windows machine right
now.
Thanks, yes, I can reproduce the problem. Some Windows functions
impose 260 wide characters limit, but other 260 bytes limit, so one
can create a file with a name too long to be found by FindNextFileA.
In R 4.2, we used readdir() from mingw-w64, which itself used
findnext, which however had the same problem, it used a buffer of size
260 bytes and from the code of mingw-w64 and the Windows
documentation, it should have behaved the same, it should have stopped
the search on such a long file name. However, in my use case, R 4.2.3
crashed inside findnext due to stack overrun, R 4.1.3 worked, but
clearly it would require a different use case to overrun this buffer
as it didn't use UTF-8. This suggests that findnext didn't have a
check for this and hence caused memory corruption, which can lead to a
crash or work by coincidence. Which could have been the case for the
user reporting this as a regression compared to R 4.2. But it is not a
regression, the problem existed for long.
So, yes, we'd probably have to use wide variants of
FindNext/FindFirst. I'll fix.
Fixed in R-devel (84960). Please let me know if you see any problem with
the fix.
Thanks,
Tomas
Thanks for debugging this,
Tomas
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel