Bug#290550: igal: uri encoding confused with content

pcg Sat, 15 Jan 2005 06:54:00 -0800

On Sat, Jan 15, 2005 at 11:51:18PM +1000, Alexander Zangerl <[EMAIL PROTECTED]> 
wrote:
> >The author would be well-advised to actually read the rfc that is being
> >referenced, as the very same rfc explains how to encode unsafe
> >characters.
> 
> your snooty report offends. your holier-than-thou affection *might* be
> acceptable if you had provided a patch for the problem. i see no patch here.


This is silly. What was snooty was the message from igal, telling me to
read a RFC where it is obvious that the author of that message didn't
carefully read it.

In any case, it's a bug, wether a patch is attached or not does not make it
into a non-bug.

Also, you seem to misunderstand the problem, so writing a patch that you
wouldn't apply anyway would have mode no sense, right?

> but if your url was iso-8859-1 and outside ascii (eg. äöüß etc), 
> then you're wrong: there is a fundamental limitation making all 

No, I am not. Why do you assume the filename was iso-8859-1? As a matter
of fact, filenames are encoded using octets in the same way as urls. There
is a 1-1 mapping between filenames and urls.

What you assume is that webservers map octets 0-127 differently than
octets 128-255 when accessing files, but that is not the case.

Although webservers *could* map these octets differently, the same is true
for ascii, too, so your assumption that it magically works for octets in
the ascii ange but not for octets with the high bit set is flawed.

Sumary: 

1. URLs are well-defined as encoding octets, there is no magical lmitation
   to ascii (as you know).

2. filenames are well-defined as encoding octets (igal is written in perl,
   and perl has this propery).

3. There is a 1-1 mapping between those two that usually works. When it
   doesn't wor, then mapping ascii doens't work, either.

> http://www.w3.org/TR/html40/appendix/notes.html#h-B.2.1 *suggests*
> conversion to utf-8 and then a % encoding for urls, and mentions the older 
> practice of using just iso-8859-1 and its %-encoding. 

Yes, but it doesn't suggest that the webserver does that, and they factually
don't.

It's easy to convert my image urls into utf-8 (which they already were,
btw., not iso-8859-1 as you might have assumed - another point why igal
shouldn't care about interpreting filenames as characters): I just recode
my filenames to utf-8 and update the urls.

> not all webservers distinguish properly between these two cases and the
> mechanism also depends on the web server filesystem (whether it wants to
> see iso-8859-1 filenames or whether unicode is expected).

No, this is fundamentally wrong. The webserver in general has no problem with
that at all. The webserver does not need to know any character encoding, just
as your kernel does not need to know any character encoding to access files.

Assuming that a webserver does a non-1-1 mapping for octets outside 0-127
when accessing files is not realistic on the posix systems igal runs on.

> >It is true that IMG SRC cannot contain spaces (For example), but this
> >does in no way mean that image filenames were at fault.  
> 
> this is silly. the image filename is unrepresentable -> the filename poses
> the problem. igal confronts you with a problem report.

First you claim that the flename is unrepresentable, then you claim it ca be
represented twice?

This is not only contradictory, it's both wrong:

> "blödian.jpg" can be represented as "bl%F6ian.jpg" or "bl%C3%B6dian.jpg".

blödian does not need any representation, the filename is either
bl\xf6dian.jpg or bl\xc3\xb6dian.jpg. It doesn't have two filenames at the
same time.

If that file were part of the images I ran igal on, then bl%C3%B6dian.jpg
is the only correct representation.

igal has no need for artifical re-interpreting of the filename octets into
characters.

The only time where a webserver needs to care for the encoding is when it
generates a directory listing. Only then does it need the encoding, and
under posix systems it's unlikely that the webserver knows about that.

> both are legal, both are possible, both have been or are in use out there,
> either of them will work or fail in a specific situation. for example,
> apache on my debian box likes the first and doesn't grok the latter.

Then your apache is broken. Apache easily groks both urls, but the fact
is that you probably only had one of the files. If you had provided both
files, apache would happily serve them.

(Same as with any other webserver...)

Filenames, under posix, have no encoding attached. They are just
octets. It might be different under windows, when e.g. perl gives you
latin1 and the non-posix webserver would require utf-8-encoded urls (the
underlying ucs-2 representation is not encodable into urls).

What igal does is assume that there were some hidden encoding. That is,
however, not the case.

> so, which of two evils do you want igal to choose? 

igal has no choice to make. it already has the fileame in the corretc
encoding. it doesn't matter wether the filename is encoded in utf-8 or
latin1.

igal would need to chose if it had to re-encode the filename in some way.
however, why should igal do that?

> i think that suggesting the safe course (ie. to avoid the charset trouble)
> to the user is actually a reasonable approach. 

I think there is no issue, as just converting the filename to a uri without
making the problem artificially complex is the right thing to do in about
100% of the cases. If you are concerned about the weird unknown misocnfigured
webserver then a warning might be more appropriate.

In the very least, igal should offer to convert the filenames 1-1, which
works everywhere instead of either renaming filenames or simply not
working, which are both poor choices.

> having said that, i'll think about it a bit more and maybe add 
> both common encodings to the list of choices.

That is fundamentally flawed. You don't *know* the filename encoding. It's
just a bunch of octets. You cannot offer "common" encodings for uris, as
you cannot re-encode filenames to that encoding.

You would first have to know the encoding the filenames are in (using the
nl_langinfo gives a reasonable starting point, but it's very common that
locale encoding and filename encoding differ).

If you then know how to interpret the filenames, you could then re-encode
them to another charset.

However, the only effect this will have is that the webserver will not
find the files unless the two character sets are identical. Even then,
due to normalization, this ight not work. The only safe way is not
(mis-)interpreting the filenames in any way.

It would make sense to allow one to specify a filename encoding, namely
for setting the encoding of the generated webpages (or, better, using
utf-8 and converting all filenames).

I am not asking for that, that is just a wishlist. The message, however,
is a bug (and they lead to data loss, as these are the only options that
igal provides).

You are well-advised to read the rfc reference provided by igal and the
reference you provided in this mail very carefully.

Then you will understand that uris have no attached encoding, but they still
work fine without that information. Only when you try to interpret uris you
will need encoding information. Same case is with filenames.

So all what you want to do is complicate matters by artifically forcing
uris and filenames into some encoding which is unnatural, and broken.

_Be good, be well_

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      [EMAIL PROTECTED]
      --==---/ / _ \/ // /\ \/ /      http://schmorp.de/
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#290550: igal: uri encoding confused with content

Reply via email to