Frank, About the question "Do we need to convert to UCS-16 to do parsing or can we safely assume that special characters like '/', '.', '\' and ':' never occur as part of UTF-8 multi-byte sequences?", I was unclear what you really meant, but here are my findings/beliefs : * In a UTF-8 string, any (unsigned) byte whose value is <= 127 is guaranteed to be a single-character that is the one defined in the ASCII set (so any multibyte character has bytes whose value is >= 128) * Now, if we look at a UCS-16 byte stream, I had a hard time to find the answer. But basically any unicode character >= 0x000 && <= 0xFFFF is directly "converted" into a single 16 bit code-point in UTF-16. If we consider the point character, it's ascii value is 0x2E, and in unicode/UTF-16 it is thus "0x002E". Bit "0x2E2E" is a valid unicode character( '⸮' , the reversed quotation mark in spanish), we cannot trust the "0x2E" byte found in the UCS-16 byte stream to be always a point character.
About command line applications, I was a bit concerned about backward compatibility because there are existing working applications that pass ANSI non-ascii filenames to gdal command line utilities. I've tested your preliminary implementation with a file containing a 'é' (e-acute) character on a Windows platform that I think uses CP1252 (~ ISO-LATIN-1) as the codepage. I expected a failure in the UTF-8 -> UCS-16 translation since the passed filename wasn't UTF-8. It turns out that it actually works, since the utf-8 -> wide char conversion routine in cpl_recode_stub.cpp has a special case : when it doesn't manage to translate a (apparently) multibyte UTF-8 character, it assumes they are CP1252 and convert them into UCS-16 correctly. So this is good news for people having CP1252 as their current code page ! But apparently, there's a way for command line Windows applications to get their arguments as UCS-16 strings. This could be used to convert them reliably to UTF-8 just afterwards to feed it into GDAL. Here's what I found in MSDN : * GetCommandLineW : http://msdn.microsoft.com/en- us/library/ms683156%28VS.85%29.aspx * CommandLineToArgvW : http://msdn.microsoft.com/en- us/library/bb776391%28VS.85%29.aspx The issue is that, if we go on this route, drivers that still use the old VSI API (posix one) won't work anymore with non-ASCII filenames... So I'm not sure if it's worth the pain : it would be only worth for people using currently successfully the command line utilities with a non-CP1252 code page. About Java bindings, nothing to change. Java strings are encoded in unicode (UTF-16) and the typemaps we use already automatically converts to/from UTF-8 on the C side (with GetStringUTFChars() from Java to C, and NewStringUTF() from C to Java) Best regards, Even _______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/gdal-dev