On Sat, Jan 15, 2005 at 11:51:18PM +1000, Alexander Zangerl <[EMAIL PROTECTED]> wrote: > >The author would be well-advised to actually read the rfc that is being > >referenced, as the very same rfc explains how to encode unsafe > >characters. > > your snooty report offends. your holier-than-thou affection *might* be > acceptable if you had provided a patch for the problem. i see no patch here.
This is silly. What was snooty was the message from igal, telling me to read a RFC where it is obvious that the author of that message didn't carefully read it. In any case, it's a bug, wether a patch is attached or not does not make it into a non-bug. Also, you seem to misunderstand the problem, so writing a patch that you wouldn't apply anyway would have mode no sense, right? > but if your url was iso-8859-1 and outside ascii (eg. äöüß etc), > then you're wrong: there is a fundamental limitation making all No, I am not. Why do you assume the filename was iso-8859-1? As a matter of fact, filenames are encoded using octets in the same way as urls. There is a 1-1 mapping between filenames and urls. What you assume is that webservers map octets 0-127 differently than octets 128-255 when accessing files, but that is not the case. Although webservers *could* map these octets differently, the same is true for ascii, too, so your assumption that it magically works for octets in the ascii ange but not for octets with the high bit set is flawed. Sumary: 1. URLs are well-defined as encoding octets, there is no magical lmitation to ascii (as you know). 2. filenames are well-defined as encoding octets (igal is written in perl, and perl has this propery). 3. There is a 1-1 mapping between those two that usually works. When it doesn't wor, then mapping ascii doens't work, either. > http://www.w3.org/TR/html40/appendix/notes.html#h-B.2.1 *suggests* > conversion to utf-8 and then a % encoding for urls, and mentions the older > practice of using just iso-8859-1 and its %-encoding. Yes, but it doesn't suggest that the webserver does that, and they factually don't. It's easy to convert my image urls into utf-8 (which they already were, btw., not iso-8859-1 as you might have assumed - another point why igal shouldn't care about interpreting filenames as characters): I just recode my filenames to utf-8 and update the urls. > not all webservers distinguish properly between these two cases and the > mechanism also depends on the web server filesystem (whether it wants to > see iso-8859-1 filenames or whether unicode is expected). No, this is fundamentally wrong. The webserver in general has no problem with that at all. The webserver does not need to know any character encoding, just as your kernel does not need to know any character encoding to access files. Assuming that a webserver does a non-1-1 mapping for octets outside 0-127 when accessing files is not realistic on the posix systems igal runs on. > >It is true that IMG SRC cannot contain spaces (For example), but this > >does in no way mean that image filenames were at fault. > > this is silly. the image filename is unrepresentable -> the filename poses > the problem. igal confronts you with a problem report. First you claim that the flename is unrepresentable, then you claim it ca be represented twice? This is not only contradictory, it's both wrong: > "blödian.jpg" can be represented as "bl%F6ian.jpg" or "bl%C3%B6dian.jpg". blödian does not need any representation, the filename is either bl\xf6dian.jpg or bl\xc3\xb6dian.jpg. It doesn't have two filenames at the same time. If that file were part of the images I ran igal on, then bl%C3%B6dian.jpg is the only correct representation. igal has no need for artifical re-interpreting of the filename octets into characters. The only time where a webserver needs to care for the encoding is when it generates a directory listing. Only then does it need the encoding, and under posix systems it's unlikely that the webserver knows about that. > both are legal, both are possible, both have been or are in use out there, > either of them will work or fail in a specific situation. for example, > apache on my debian box likes the first and doesn't grok the latter. Then your apache is broken. Apache easily groks both urls, but the fact is that you probably only had one of the files. If you had provided both files, apache would happily serve them. (Same as with any other webserver...) Filenames, under posix, have no encoding attached. They are just octets. It might be different under windows, when e.g. perl gives you latin1 and the non-posix webserver would require utf-8-encoded urls (the underlying ucs-2 representation is not encodable into urls). What igal does is assume that there were some hidden encoding. That is, however, not the case. > so, which of two evils do you want igal to choose? igal has no choice to make. it already has the fileame in the corretc encoding. it doesn't matter wether the filename is encoded in utf-8 or latin1. igal would need to chose if it had to re-encode the filename in some way. however, why should igal do that? > i think that suggesting the safe course (ie. to avoid the charset trouble) > to the user is actually a reasonable approach. I think there is no issue, as just converting the filename to a uri without making the problem artificially complex is the right thing to do in about 100% of the cases. If you are concerned about the weird unknown misocnfigured webserver then a warning might be more appropriate. In the very least, igal should offer to convert the filenames 1-1, which works everywhere instead of either renaming filenames or simply not working, which are both poor choices. > having said that, i'll think about it a bit more and maybe add > both common encodings to the list of choices. That is fundamentally flawed. You don't *know* the filename encoding. It's just a bunch of octets. You cannot offer "common" encodings for uris, as you cannot re-encode filenames to that encoding. You would first have to know the encoding the filenames are in (using the nl_langinfo gives a reasonable starting point, but it's very common that locale encoding and filename encoding differ). If you then know how to interpret the filenames, you could then re-encode them to another charset. However, the only effect this will have is that the webserver will not find the files unless the two character sets are identical. Even then, due to normalization, this ight not work. The only safe way is not (mis-)interpreting the filenames in any way. It would make sense to allow one to specify a filename encoding, namely for setting the encoding of the generated webpages (or, better, using utf-8 and converting all filenames). I am not asking for that, that is just a wishlist. The message, however, is a bug (and they lead to data loss, as these are the only options that igal provides). You are well-advised to read the rfc reference provided by igal and the reference you provided in this mail very carefully. Then you will understand that uris have no attached encoding, but they still work fine without that information. Only when you try to interpret uris you will need encoding information. Same case is with filenames. So all what you want to do is complicate matters by artifically forcing uris and filenames into some encoding which is unnatural, and broken. _Be good, be well_ -- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ [EMAIL PROTECTED] --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]