On Mon, Dec 4, 2017 at 5:00 PM, Gregory Szorc <g...@mozilla.com> wrote:

> On Wed, Nov 29, 2017 at 5:09 PM, Karl Tomlinson <mozn...@karlt.net> wrote:
>
>> I've always found this confusing, and so I'll write down the
>> understanding I've reached, in the hope that either it will help
>> others, or others can help me by correcting if these are
>> misunderstandings.
>>
>> On Unix systems:
>>
>>   `nativePath`
>>
>>      contains the bytes corresponding to the native filename used
>>      by native system calls.
>>
>>   `path`
>>
>>      is a UTF-16 encoding of an attempt to provide a human
>>      readable version of the native filename.  This involves
>>      interpreting native bytes according to the character encoding
>>      specified by the current locale of the application as
>>      indicated by nl_langinfo(CODESET).
>>
>>      For different locales, the same file can have a different
>>      `path`.
>>
>>      The native bytes may not be valid UTF-8, and so if the
>>      character encoding is UTF-8, then there may not be a valid
>>      `path` that can be encoded to produce the same `nativePath`.
>>
>>   It is best to use `nativePath` for working with filenames,
>>   including conversion to URI, but use `path` when displaying
>>   names in the UI.
>>
>> On WINNT systems:
>>
>>   `path`
>>
>>      contains wide characters corresponding to the native filename
>>      used by native wide character system APIs.  For at least most
>>      configurations, I assume wide characters are UTF-16, in which
>>      case this is also human readable.
>>
>>   `nativePath`
>>
>>      is an attempt to represent the native filename in the native
>>      multibyte character encoding specified by the current locale
>>      of the application.
>>
>>      For different locales, I assume the same file can have a
>>      different `nativePath`.
>>
>>      I assume there is not necessarily a valid multibyte character
>>      encoding, and so there may not be a valid `nativePath` that
>>      can be decoded to produce the same `path`.
>>
>>   It is best to use `path` for working with filenames.
>>   Conversion to URI involves assuming `path` is UTF-16 and
>>   converting to UTF-8.
>>
>> The parameters mean very different things on different systems,
>> and so it is not generally possible to write XP code with either
>> of these, but Gecko attempts to do so anyway.
>>
>> The numbers of applications not using UTF-8 and filenames not
>> valid UTF-8 are much smaller on Unix systems than the numbers of
>> applications not using UTF-8 and non-ASCII filenames on WINNT
>> systems, and so choosing to work with `path` provides more
>> compatibility than working with `nativePath`.
>>
>
> My understanding from contributing to Mercurial (version control tools are
> essentially virtual filesystems that need to store paths and realize them
> on multiple operating systems and filesystems) is as follows.
>
> On Linux/POSIX, a path is any sequence of bytes not containing a NULL
> byte. As such, a path can be modeled by char*. Any tool that needs to
> preserve paths from and to I/O functions should be using NULL terminated
> bytes internally. For display, etc, you can attempt to decode those bytes
> using an encoding. But there's no guarantee the byte sequences from the
> filesystem/OS will be e.g. proper UTF-8. And if you normalize all paths to
> e.g. Unicode internally, this can lead to round tripping errors.
>
> On Windows, there are multiple APIs for I/O. Under the hood, NTFS is
> purportedly using UTF-16 to store filenames. Although I can't recall if it
> is actual UTF-16 or just byte pairs. This means you should be using the
> *W() functions for all I/O. e.g. CreateFileW(). (Note: if you use
> CreateFileW(), you can also use the "\\?\" prefix on the filename to avoid
> MAX_PATH (260 character) limitations. If you want to preserve filenames on
> Windows, you should be using these functions. If you use the *A() functions
> (e.g. CreateFileA()) or use the POSIX libc compatibility layer (e.g.
> open()), it is very difficult to preserve exact byte sequences. Further
> clouding matters is that values from environment variables, command line
> arguments, etc may be in unexpected/different encodings. I can't recall
> specifics here. But there are definitely cases where the bytes being
> exposed may not match exactly what is stored in NTFS.
>
> In addition to that, there are various normalizations that the operating
> system or filesystem may apply to filenames. For example, Windows has
> various reserved filenames that the Windows API will disallow (but NTFS can
> store if you use the NTFS APIs directly) and MacOS or its filesystem will
> silently munge certain Unicode code points (this is a fun one because of
> security implications).
>
> In all cases, if a filename originates from something other than the
> filesystem, it is probably a good idea to normalize it to Unicode
> internally and then spit out UTF-8 (or whatever the most-native API
> expects).
>
> Many programming languages paper over these subtle differences leading to
> badness. For example, the preferred path APIs for Python and Rust assume
> paths are Unicode (they have their own logic to perform encoding/decoding
> behind the scenes and depending on settings, run-time failures or data loss
> can occur). In both cases, there are OS-specific/native path primitives
> that give you access to the raw bytes. If you want to be resilient around
> preserving data and not munging byte sequences, these primitives should be
> used. But it can be difficult because tons of software normalizes paths to
> Unicode and having to deal with a platform-specific data type in all
> consumers can be very annoying.
>
> I hope this info is useful!
>

Quick follow-up: reading https://simonsapin.github.io/wtf-8/ would be a
good deep dive.
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to