On 27Apr2021 10:17, Kevin J. McCarthy <[email protected]> wrote:
>Ticket 351 on gitlab (https://gitlab.com/muttmua/mutt/-/issues/351)
>noted that an attachment 中文名称.txt, when launched via a mailcap
>viewer, created a tempfile "____________.txt".
Ouch.
>This is because of the sanitize_filename() functions, which have an
>allow-list of
>"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+@{}._-:%/"
>(with the '/' disabled for filenames).
>
>I'd be reluctant to change sanitization for the %{<parameter>} or %t
>expandos, but this does seem to be a bit strict for the filename.
>Oswald notes in the ticket that 8-bit characters are harmless at the
>system level (Oswald, feel free to reply/clarify - I'm not trying to
>put words in your mouth).
First remark:
I think we should make clear that this only makes sense when you're
encoding filenames as UTF-8, where all multibyte sequences have a high
bit set. This isn't necessarily the case with other encodings.
Second remark:
As one who has long been less than enthused by sanitising filenames,
what exactly are we trying to accomplish when we sanitise a filename?
- avoid trickiness like whitespace and quote characters, which cause a
little pain for users of the files in scripting settings?
- avoiding $ and ` et al, which cause hazards for the very careless
script author? (but inly if injected blindly)
- avoiding other shell punctuation like redirections? same issue
- avoiding escape paths such as absolute paths (/etc/passwd, oh root-run
mutt user?) or ../blah to get out of the scratch area?
Without qualifying these objectives, "sanitisation" means little (or too
much, depending where you stand).
>On the one hand these are temp files, but Mutt already tries to
>preserve the filename to make for a nicer user interaction. It seems
>if we can preserve unicode filenames better we ought to do that too.
"Unicode filenames" isn't a meaningful term in UNIX, as the API is C
strings - byte sequences with NUL terminators. I suspect you mean "UTF-8
encoded names", which is the common modern default.
>What if we added an allow_8bit parameter to the function, that also
>passed through bytes with the 8th bit set? I'd keep this set off in
>all other invocations except the mailcap invocations.
Of course, the trickiness is that header things like filenames are,
IIRC, "bytes". Without a charset, do we inherently know anything about
them _as characters_?
I'm +1 for allow_8bit if we make it clear in the docs (and implemented
it correctly in the code) that this refers to the in-filesystem byte
encoding of the filename. _Not_ hypothetical "Unicode". One person's
"Unicode" is another's Shift-JIS :-)
https://en.wikipedia.org/wiki/Mojibake
One has but to shift one's shell locale to see this play out.
Cheers,
Cameron Simpson <[email protected]>