On Wed, Nov 29, 2017 at 5:09 PM, Karl Tomlinson <mozn...@karlt.net> wrote:

> I've always found this confusing, and so I'll write down the
> understanding I've reached, in the hope that either it will help
> others, or others can help me by correcting if these are
> misunderstandings.
>
> On Unix systems:
>
>   `nativePath`
>
>      contains the bytes corresponding to the native filename used
>      by native system calls.
>
>   `path`
>
>      is a UTF-16 encoding of an attempt to provide a human
>      readable version of the native filename.  This involves
>      interpreting native bytes according to the character encoding
>      specified by the current locale of the application as
>      indicated by nl_langinfo(CODESET).
>
>      For different locales, the same file can have a different
>      `path`.
>
>      The native bytes may not be valid UTF-8, and so if the
>      character encoding is UTF-8, then there may not be a valid
>      `path` that can be encoded to produce the same `nativePath`.
>
>   It is best to use `nativePath` for working with filenames,
>   including conversion to URI, but use `path` when displaying
>   names in the UI.
>
> On WINNT systems:
>
>   `path`
>
>      contains wide characters corresponding to the native filename
>      used by native wide character system APIs.  For at least most
>      configurations, I assume wide characters are UTF-16, in which
>      case this is also human readable.
>
>   `nativePath`
>
>      is an attempt to represent the native filename in the native
>      multibyte character encoding specified by the current locale
>      of the application.
>
>      For different locales, I assume the same file can have a
>      different `nativePath`.
>
>      I assume there is not necessarily a valid multibyte character
>      encoding, and so there may not be a valid `nativePath` that
>      can be decoded to produce the same `path`.
>
>   It is best to use `path` for working with filenames.
>   Conversion to URI involves assuming `path` is UTF-16 and
>   converting to UTF-8.
>
> The parameters mean very different things on different systems,
> and so it is not generally possible to write XP code with either
> of these, but Gecko attempts to do so anyway.
>
> The numbers of applications not using UTF-8 and filenames not
> valid UTF-8 are much smaller on Unix systems than the numbers of
> applications not using UTF-8 and non-ASCII filenames on WINNT
> systems, and so choosing to work with `path` provides more
> compatibility than working with `nativePath`.
>

My understanding from contributing to Mercurial (version control tools are
essentially virtual filesystems that need to store paths and realize them
on multiple operating systems and filesystems) is as follows.

On Linux/POSIX, a path is any sequence of bytes not containing a NULL byte.
As such, a path can be modeled by char*. Any tool that needs to preserve
paths from and to I/O functions should be using NULL terminated bytes
internally. For display, etc, you can attempt to decode those bytes using
an encoding. But there's no guarantee the byte sequences from the
filesystem/OS will be e.g. proper UTF-8. And if you normalize all paths to
e.g. Unicode internally, this can lead to round tripping errors.

On Windows, there are multiple APIs for I/O. Under the hood, NTFS is
purportedly using UTF-16 to store filenames. Although I can't recall if it
is actual UTF-16 or just byte pairs. This means you should be using the
*W() functions for all I/O. e.g. CreateFileW(). (Note: if you use
CreateFileW(), you can also use the "\\?\" prefix on the filename to avoid
MAX_PATH (260 character) limitations. If you want to preserve filenames on
Windows, you should be using these functions. If you use the *A() functions
(e.g. CreateFileA()) or use the POSIX libc compatibility layer (e.g.
open()), it is very difficult to preserve exact byte sequences. Further
clouding matters is that values from environment variables, command line
arguments, etc may be in unexpected/different encodings. I can't recall
specifics here. But there are definitely cases where the bytes being
exposed may not match exactly what is stored in NTFS.

In addition to that, there are various normalizations that the operating
system or filesystem may apply to filenames. For example, Windows has
various reserved filenames that the Windows API will disallow (but NTFS can
store if you use the NTFS APIs directly) and MacOS or its filesystem will
silently munge certain Unicode code points (this is a fun one because of
security implications).

In all cases, if a filename originates from something other than the
filesystem, it is probably a good idea to normalize it to Unicode
internally and then spit out UTF-8 (or whatever the most-native API
expects).

Many programming languages paper over these subtle differences leading to
badness. For example, the preferred path APIs for Python and Rust assume
paths are Unicode (they have their own logic to perform encoding/decoding
behind the scenes and depending on settings, run-time failures or data loss
can occur). In both cases, there are OS-specific/native path primitives
that give you access to the raw bytes. If you want to be resilient around
preserving data and not munging byte sequences, these primitives should be
used. But it can be difficult because tons of software normalizes paths to
Unicode and having to deal with a platform-specific data type in all
consumers can be very annoying.

I hope this info is useful!
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to