Joel Rees writes: > 2014/12/03 22:23 "Dmitrij D. Czarkoff" <[email protected]>: > > > > First of all, I really don't believe that preservation of non-canonical > > form should be a consideration for any software. > > There is no particular canonical form for some kinds of software. > > Unix, in particular, happens to have file name limitations that are > compatible with all versions of Unicode past 2.0, at least, in UTF-8, but > it has no native encoding.
To me, the current state of affairs--where filenames can contain anything and the same filename can and does get interpreted differently by different programs--feels extremely dangerous. Moving to a single, well-defined encoding for filenames would make things simpler and safer. Well, it might. That's why we're discussing this carefully, to figure out if something like this is actually workable. There are two kinds of features being discussed: 1) Unicode normalization. This is analogous to case insensitivity: multiple filenames map to the same (normalized) filename. 2) Disallowing particular characters. 1-31 and invalid UTF-8 sequences are popular examples. Maybe one is workable. Maybe both are, or neither. Say I have a hypothetical machine with the above two features (normalizing to NFC, disallowing 1-31/invalid UTF-8). Now I log into a typical Unix "anything but \0 or /" machine, via SFTP or whatever. What are the failure modes? The first kind is that I could type "get x" followed by "get y", where x and y are canonically the same in Unicode but represented differently because they're not normalized on the remote host. I would expect this to work smoothly: first I download x to NFC(x), and then b overwrites it. The second kind is that I could type "get z", where z contains an invalid character. How should my system handle this? Error as if I had asked for a filename that's too long? Come up with a new errno? I don't know, but in this hypothetical machine it should fail somehow. But creating new files is only part of the problem. If we still allow them in existing files, we lose all the security/robustness benefits and just annoy ourselves by adding restrictions with no point. So say I mount a filesystem containing the same files a, b, and c. What happens? - Fail to mount? (Simultaneously simplest, safest, and least useful) - Hide the files? (Seems potentially unsafe) - Try to escape the filenames? (Seems crazy) Is it currently possible to take a hex editor and add "/" to a filename (as opposed to a pathname) inside a disk image? If that's possible, how do systems currently deal with it? Because it's the same problem. FAT32 has both case insensitivity and disallowed characters. How well does OpenBSD handle those restrictions? If not optimally, then how can they be made better? If it already handles them with aplomb, then is it applicable to the above scenarios? -- Anthony J. Bentley

