On Mon, Dec 4, 2017 at 5:00 PM, Gregory Szorc <g...@mozilla.com> wrote:
> On Wed, Nov 29, 2017 at 5:09 PM, Karl Tomlinson <mozn...@karlt.net> wrote: > >> I've always found this confusing, and so I'll write down the >> understanding I've reached, in the hope that either it will help >> others, or others can help me by correcting if these are >> misunderstandings. >> >> On Unix systems: >> >> `nativePath` >> >> contains the bytes corresponding to the native filename used >> by native system calls. >> >> `path` >> >> is a UTF-16 encoding of an attempt to provide a human >> readable version of the native filename. This involves >> interpreting native bytes according to the character encoding >> specified by the current locale of the application as >> indicated by nl_langinfo(CODESET). >> >> For different locales, the same file can have a different >> `path`. >> >> The native bytes may not be valid UTF-8, and so if the >> character encoding is UTF-8, then there may not be a valid >> `path` that can be encoded to produce the same `nativePath`. >> >> It is best to use `nativePath` for working with filenames, >> including conversion to URI, but use `path` when displaying >> names in the UI. >> >> On WINNT systems: >> >> `path` >> >> contains wide characters corresponding to the native filename >> used by native wide character system APIs. For at least most >> configurations, I assume wide characters are UTF-16, in which >> case this is also human readable. >> >> `nativePath` >> >> is an attempt to represent the native filename in the native >> multibyte character encoding specified by the current locale >> of the application. >> >> For different locales, I assume the same file can have a >> different `nativePath`. >> >> I assume there is not necessarily a valid multibyte character >> encoding, and so there may not be a valid `nativePath` that >> can be decoded to produce the same `path`. >> >> It is best to use `path` for working with filenames. >> Conversion to URI involves assuming `path` is UTF-16 and >> converting to UTF-8. >> >> The parameters mean very different things on different systems, >> and so it is not generally possible to write XP code with either >> of these, but Gecko attempts to do so anyway. >> >> The numbers of applications not using UTF-8 and filenames not >> valid UTF-8 are much smaller on Unix systems than the numbers of >> applications not using UTF-8 and non-ASCII filenames on WINNT >> systems, and so choosing to work with `path` provides more >> compatibility than working with `nativePath`. >> > > My understanding from contributing to Mercurial (version control tools are > essentially virtual filesystems that need to store paths and realize them > on multiple operating systems and filesystems) is as follows. > > On Linux/POSIX, a path is any sequence of bytes not containing a NULL > byte. As such, a path can be modeled by char*. Any tool that needs to > preserve paths from and to I/O functions should be using NULL terminated > bytes internally. For display, etc, you can attempt to decode those bytes > using an encoding. But there's no guarantee the byte sequences from the > filesystem/OS will be e.g. proper UTF-8. And if you normalize all paths to > e.g. Unicode internally, this can lead to round tripping errors. > > On Windows, there are multiple APIs for I/O. Under the hood, NTFS is > purportedly using UTF-16 to store filenames. Although I can't recall if it > is actual UTF-16 or just byte pairs. This means you should be using the > *W() functions for all I/O. e.g. CreateFileW(). (Note: if you use > CreateFileW(), you can also use the "\\?\" prefix on the filename to avoid > MAX_PATH (260 character) limitations. If you want to preserve filenames on > Windows, you should be using these functions. If you use the *A() functions > (e.g. CreateFileA()) or use the POSIX libc compatibility layer (e.g. > open()), it is very difficult to preserve exact byte sequences. Further > clouding matters is that values from environment variables, command line > arguments, etc may be in unexpected/different encodings. I can't recall > specifics here. But there are definitely cases where the bytes being > exposed may not match exactly what is stored in NTFS. > > In addition to that, there are various normalizations that the operating > system or filesystem may apply to filenames. For example, Windows has > various reserved filenames that the Windows API will disallow (but NTFS can > store if you use the NTFS APIs directly) and MacOS or its filesystem will > silently munge certain Unicode code points (this is a fun one because of > security implications). > > In all cases, if a filename originates from something other than the > filesystem, it is probably a good idea to normalize it to Unicode > internally and then spit out UTF-8 (or whatever the most-native API > expects). > > Many programming languages paper over these subtle differences leading to > badness. For example, the preferred path APIs for Python and Rust assume > paths are Unicode (they have their own logic to perform encoding/decoding > behind the scenes and depending on settings, run-time failures or data loss > can occur). In both cases, there are OS-specific/native path primitives > that give you access to the raw bytes. If you want to be resilient around > preserving data and not munging byte sequences, these primitives should be > used. But it can be difficult because tons of software normalizes paths to > Unicode and having to deal with a platform-specific data type in all > consumers can be very annoying. > > I hope this info is useful! > Quick follow-up: reading https://simonsapin.github.io/wtf-8/ would be a good deep dive. _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform