Re: [Rd] R-4.3 version list.files function could not work correctly in chinese

Tomas Kalibera Mon, 14 Aug 2023 23:38:34 -0700


On 8/13/23 13:16, Ivan Krylov wrote:

Found it! Looks like a buffer length problem. This isn't limited to
Chinese, just more likely to happen when a character takes three bytes
to represent in UTF-8. (Any filename containing characters which take
more than one byte to represent in UTF-8 may fail.)

If a directory contains a file with a sufficiently long name,
FindNextFile() fails with ERROR_MORE_DATA (0xEA, 234), making
R_readdir() return NULL, stopping list_files() prematurely:

# everything seems to work fine...

list.files("测试文件")
# [1] "测试中文-non-utf8-ЪЪЪЪЪ
测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文.txt"
# [2] "测试中文-non-utf8-ЪЪЪЪЪ.txt"
# [3] "测试中文-utf-8.txt"

# now create a file with an even longer name

list.files("测试文件")
# [1] "测试中文-non-utf8-ЪЪЪЪЪ
测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文.txt"

# the files are still there, but not visible to list.files():

Thanks, Ivan, could you please turn this into a complete minimalreproducible example, ideally with only ASCII characters (if enough totrigger)? Or any reproducible example would do. I would have a looklater today.


system("cmd /c dir /s *.txt")
#  Volume in drive C has no label.
#  Volume Serial Number is A85A-AA74
#
#  Directory of C:\R\R-4.3.1\bin\x64\????
#
# 08/12/2023  07:57 AM                22 ????-non-utf8-?????
????????????????????????????????????????????????????.txt
# 08/12/2023 07:57 AM                22 ????-non-utf8-?????
????????????????????????????????????????????????????????????????????????????????????????????????????????.txt
# 08/12/2023  07:57 AM                22 ????-non-utf8-?????.txt
# 08/12/2023  07:56 AM                18 ????-utf-8.txt
# 4 File(s)             84 bytes
#
#       Total Files Listed:
#                4 File(s)             84 bytes
#                0 Dir(s)  29,281,538,048 bytes free
# [1] 0

Increasing the path length limits [*] doesn't help, since it's the
filename length limit that we're bumping against. While both
WIN32_FIND_DATAA and WIN32_FIND_DATAW contain fixed-size buffers, a
valid filename may take more than MAX_PATH bytes to represent in UTF-8
while still being under the limit of MAX_PATH wide characters. This may
mean having to rewrite list_files in terms of R_wopendir()/R_wreaddir()
for Windows. As a workaround, we may use the short filename (which
sometimes may not exist, alas) when FindNextFile() fails with
ERROR_MORE_DATA.

I admit I didn't get your analysis. However, I've rewritten this codefor R 4.3 to support long paths (when enabled in the system), more inhttps://blog.r-project.org/2023/03/07/path-length-limit-on-windows/index.html.As this was reported to be regression in 4.3, it is entirely possiblethis change came with a regression (though a bit surprising we didn'tcatch it earlier by testing), so it would be a great help if I couldhave the example and debug it.


Thanks,
Tomas

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R-4.3 version list.files function could not work correctly in chinese

Reply via email to