Thanks for the quick reply.

Let's back up a bit. Just take the following two methods of the Feature class 
in the OGR wrapper:
public string GetFieldAsString(int id);
public void SetField(int id, string value);

For these two the wrapper also uses no special encoding, which means that ANSI 
is used when strings are marshaled from/to unmanaged code.

As I understood it, this is incorrect for some time now. As stated in 
http://trac.osgeo.org/gdal/wiki/rfc23_ogr_unicode:
"It is declared that OGR string attribute values will be in UTF-8. This means 
that OGR drivers are responsible for translating format specific 
representations to UTF-8 when reading, and back to the format specific 
representation when writing."

I created a workaround for these two methods in the wrapper for myself many 
months ago, that's why I forgot to mention these in my previous post.

Only if we agree that the wrapper is doing it wrong for these two essential 
methods it makes sense to discuss the other methods.
For instance if you take the Attribute Filter: Assuming you have a feature with 
attribute Name='München'. Because GetFieldAsString uses wrong Encoding you will 
get 'München'. Using this for an Attribute Filter "Name = 'München'" will 
work, because SetAttributeFilter also uses ANSI encoding, meaning that the 
problem will cancel itself out. But if GetFieldAsString would correctly use 
UTF8 encoding, it only makes sense that SetAttributeFilter would also use UTF8 
encoding, otherwise "Name = 'München'" will obviously not work.

I somehow hoped or assumed that this 'redesign' mentioned in rfc23 has 
progressed since then, so that a consistent encoding is also used for the other 
methods.

And it seemed to have apart from the C# wrapper.

I might be wrong, but if I am I still think it would be greatly beneficial if a 
consistent encoding is also used for Field Names, Attibute Filter, SQL 
statements, etc...

Regarding marshaling strings. Marshal.Copy and pinning is not explicitly 
needed, because when managed arrays are marshaled they are automatically pinned.
So just converting strings to UTF8 encoded managed byte arrays (byte[]) and 
passing the array instead of the string will work fine.
One just needs to be careful that the byte array needs to be zero-terminated.
This is what I use:
        public static byte[] StringToUtf8Bytes(string str)
        {
            if (str == null)
            {
                return null;
            }

            Encoding encoder = Encoding.UTF8;
            int strLen = str.Length;
            int nativeLength = encoder.GetMaxByteCount(strLen);
            byte[] bytes = new byte[nativeLength + 1]; // zero terminated
            encoder.GetBytes(str, 0, str.Length, bytes, 0);
            return bytes;
        }
                
As far as examples go: Any Shapefile should do. Although the DBF might be in 
different encodings OGR internally always converts the field attribute strings 
to UTF8, but the C# wrapper then interprets the strings as being in ANSI.
(Funny enough when you have an ANSI encoded DBF and set the Config Option 
SHAPE_ENCODING to UTF8 the C# wrapper will 'correct' the mistake done by the 
Config Option which forces the wrong encoding in this case. Before I understood 
that the problem was actually the C# wrapper this was driving me crazy.)
                
Best regards,
Dennis
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to