Re: [Rd] R-4.3 version list.files function could not work correctly in chinese

Tomas Kalibera Wed, 16 Aug 2023 00:42:44 -0700


On 8/15/23 16:00, Tomas Kalibera wrote:

On 8/15/23 09:04, Ivan Krylov wrote:
В Tue, 15 Aug 2023 08:38:11 +0200
Tomas Kalibera <[email protected]> пишет:
As this was reported to be regression in 4.3, it is entirely possible
this change came with a regression (though a bit surprising we didn't
catch it earlier by testing), so it would be a great help if I could
have the example and debug it.
Sorry, let me try to be more clear.

The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.

This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.

Take the ø (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)

stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()

Does this work? I don't have access to a UTF-8 Windows machine right
now.
Thanks, yes, I can reproduce the problem. Some Windows functionsimpose 260 wide characters limit, but other 260 bytes limit, so onecan create a file with a name too long to be found by FindNextFileA.
In R 4.2, we used readdir() from mingw-w64, which itself usedfindnext, which however had the same problem, it used a buffer of size260 bytes and from the code of mingw-w64 and the Windowsdocumentation, it should have behaved the same, it should have stoppedthe search on such a long file name. However, in my use case, R 4.2.3crashed inside findnext due to stack overrun, R 4.1.3 worked, butclearly it would require a different use case to overrun this bufferas it didn't use UTF-8. This suggests that findnext didn't have acheck for this and hence caused memory corruption, which can lead to acrash or work by coincidence. Which could have been the case for theuser reporting this as a regression compared to R 4.2. But it is not aregression, the problem existed for long.
So, yes, we'd probably have to use wide variants ofFindNext/FindFirst. I'll fix.

Fixed in R-devel (84960). Please let me know if you see any problem withthe fix.


Thanks,
Tomas


Thanks for debugging this,
Tomas


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R-4.3 version list.files function could not work correctly in chinese

Reply via email to