Re: International characters and serving files

2024-02-10 Thread Maxim Dounin
Hello!

On Sat, Feb 10, 2024 at 03:14:02PM +1000, David Connors wrote:

> Hi All,
> 
> I have moved off IIS/WIndows onto nginx on ubuntu a while back. Since doing
> so I receive 404s for files with international characters in their name.
> I've added the charset utf-8 directive to the nginx config. Looking at the
> request:
> 
> https://www.davidconnors.com/wp-content/uploads/2022/08/Aliinale-Für-Alina.pdf
> 
> Confirm that is exists on the file exist on the filesystem:
> 
> -rwx--  1 www-data www-data 10560787 Aug 21  2022 Aliinale-Für-Alina.pdf
> 
> if I copy that from that name to a.pdf and request that it serves fine.
> 
> Access log shows the character with the diacritic mark is escaped:
> 172.68.210.38 - - [10/Feb/2024:05:11:27 +] "GET
> /wp-content/uploads/2022/08/Aliinale-F%C3%BCr-Alina.pdf HTTP/1.1" 404 27524
> "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15
> (KHTML, like Gecko) Version/17.2.1 Safari/605.1.15"
> 
> What confirmation directive am I missing?

File names on Unix systems are typically stored as bytes, and it 
is user's responsibility to interpret them according to a 
particular character set.

As long as nginx returns 404, this suggests that you don't have a 
file with the name with C3 BC UTF-8 bytes in it: instead, there is 
something different.  My best guess is that you are using Latin1 
as a charset for your terminal, and there is an FC byte instead.  To 
see what's there in fact, consider looking at the raw bytes in the 
file name with something like "ls | hd".

Also, you can use nginx autoindex module - it will generate a page 
with properly escaped links, so it will be possible to access 
files regardless of the charset used in the file names.

-- 
Maxim Dounin
http://mdounin.ru/
___
nginx mailing list
nginx@nginx.org
https://mailman.nginx.org/mailman/listinfo/nginx


Re: International characters and serving files

2024-02-10 Thread David Connors
On Sun, 11 Feb 2024 at 00:24, Maxim Dounin  wrote:

> File names on Unix systems are typically stored as bytes, and it
> is user's responsibility to interpret them according to a
> particular character set.
>
> As long as nginx returns 404, this suggests that you don't have a
> file with the name with C3 BC UTF-8 bytes in it: instead, there is
> something different.  My best guess is that you are using Latin1
> as a charset for your terminal, and there is an FC byte instead.  To
> see what's there in fact, consider looking at the raw bytes in the
> file name with something like "ls | hd".
>
> Also, you can use nginx autoindex module - it will generate a page
> with properly escaped links, so it will be possible to access
> files regardless of the charset used in the file names.
>

You were spot on Maxim. Thank you so much. I fixed it with mv
Aliinale-Für-Alina.pdf Aliinale-Für-Alina.pdf where the first was the
autocompletion from the shell and the second was the UTF-8 pasted from
WordPress.
___
nginx mailing list
nginx@nginx.org
https://mailman.nginx.org/mailman/listinfo/nginx