On Thu, Dec 03, 2009 at 08:50:08PM +0100, Renat Golubchyk wrote: > I'd suggest you use a unicode library. BTW, what about cyrillic > letters or other alphabets? Those may have nothing to do with ASCII. Or > is your project restricted to latin letters?
The data is already in normalized Unicode. My problem is eliminating errors from near misses :-( Cyrillic doesn't look like the same problem -- no accents that I can see. Chinese, Japanese, etc, same as far as I know. Arabic has lots of tricks on combining letters and leaving out vowels, so it is probably an entirely different problem. One thing I did not make clear is that this is for place names only, like cities and whatever the equivalent of a US state or Canadian province is, such as Busingen. So do people type in Busingen different ways depending on how they feel, do some people always leave off the umlaut, do some always use it? My biggest annoyance is that a lot of the google results come from Americans full of theory about languages they only know from the W3C recommendations. Maybe email or real documents follow proper usage much more closely than addresses on a web form, but I don't care about them. Maybe web forms in Germany, where they want a district, do as many web sites do in English and have a menu of possible districts, in which case no one types in umlauts anyway :-) -- ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._. Felix Finch: scarecrow repairman & rocket surgeon / fe...@crowfix.com GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933 I've found a solution to Fermat's Last Theorem but I see I've run out of room o