Hi,

Monday, November 4, 2002, 2:58:22 PM, you wrote:
ahsb> After considerable investigation into the form input of non-Latin 1 
ahsb> characters to be processed by PHP on a Linux box, I've been able to 
ahsb> distill the issue down considerably, though a solution (and one oddity) 
ahsb> remains confusing.

ahsb> I found a very helpful web page entitled "On the use of some MS Windows 
ahsb> characters in HTML" that explains my problem rather well at 
ahsb> http://www.cs.tut.fi/~jkorpela/www/windows-chars.html. Recommended 
ahsb> reading for anyone displaying text that may have been entered by 
ahsb> Windows users, especially text pasted in from word-processing apps.

ahsb> Basically, the problem is this: on a Windows machine using Windows 1252 
ahsb> ("Windows Latin 1"), a pair of smart quotes are ASCII characters 147 
ahsb> and 148. There are a number of other "special" characters that Windows 
ahsb> maps onto ASCII 128-159, like em dashes and trademark symbols.

ahsb> Unfortunately, _true_ Latin 1 (iso-8859-1) reserves chars 128-159 for 
ahsb> control characters. So, while you may type ALT-0147 to type a smart 
ahsb> quote into your word processing app (or allow Word to create them 
ahsb> automagically when you type a quote), when that very same character is 
ahsb> pasted into a web page form set to accept iso-8859-1 or UTF-8 encoding, 
ahsb> it DOES NOT MAP to chr(147) when processed by PHP on a Linux box.

ahsb> Strangely, pasting in a Word-created smart quote character into a web 
ahsb> form and processing it with PHP produces VERY ODD results. Take the 
ahsb> string

ahsb> ="=

ahsb> where the quotation mark is a curly-style quote. Tell PHP to step 
ahsb> through the characters and print their ASCII value. The two equal signs 
ahsb> are fine (char 61), but the curly quote comes across as THREE 
ahsb> characters: (226)(128)(156). Where this comes from, I do not understand.

ahsb> I'm inclined to think that if I _don't_ try to specify the 
ahsb> accept-charset parameter on the form, and _don't_ try to convert em 
ahsb> dashes, curly quotes, etc that I'll probably end up with cleaner text 
ahsb> than I do now.

ahsb> Still, if anyone has any really helpful input on this topic, please 
ahsb> write me and let me know. We're getting into the ugly guts of page 
ahsb> charset vs. form accept-charset vs. browser input charset vs. latin 1 
ahsb> vs. Windows latin 1 vs. MacRoman here, but I'm surprised that no one 
ahsb> has chimed in on this. Does anyone else ever run into this problem, or 
ahsb> does everyone else's forms just handle all of this magically without 
ahsb> any intervention?

ahsb> spud.

ahsb> -------------------------------------------------------------------
ahsb> a.h.s. boy
ahsb> spud(at)nothingness.org            "as yes is to if,love is to yes"
ahsb> http://www.nothingness.org/
ahsb> -------------------------------------------------------------------

I ran into that problem inserting word tables into xml, i ran the
stuff through iconv() to clean it up.

$title = iconv("ISO-8859-1","UTF-8",$title);

Not sure if that was the right way to do it but it worked :)

-- 
regards,
Tom


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to