On 15/04/2011 09:37, Max Bolingbroke wrote:
On 14 April 2011 13:40, Simon Marlow<marlo...@gmail.com> wrote:
Suffice to say, this conversation is now over my head :-) So I defer to you
guys; I'm happy with whatever solution you come up with.
I was hoping to get your input on the general structure of the code
changes - Mark's proposal only relates to exactly how we encode
surrogates, and can be implemented on top of my existing patches just
by small local changes to the GHC.IO.Encoding.* modules.
Here are some particular issues that come to mind:
1. On Windows, my code ignores the argc and argv arguments to
hs_init, preferring to use GetCommandLineW
2. I decided to bake the CodingFailureMode into the TextEncoding.
Another option would be to supply it as an argument to "encode"
itself. The nice thing about my current solution is that in the future
we could potentially e.g. use the //IGNORE feature of GNU iconv to
implement the Ignore CodingFailureMode. This would be tricker with the
alternative design.
I'm not sure about the motivation for the factoring here, You've added
an extra member to BufferCodec:
+ recover :: Buffer from -> Buffer to -> IO (Buffer from, Buffer to),
but this always seems to be instantiated by either recoverDecode or
recoverEncode. Furthermore the Handle implementation always calls
streamEncode, which uses encode and recover in a particular way.
In general I like the factoring of encodings into functions that return
a CodingProgress and a wrapper that decides what to do on failure. But
I wonder whether that needs to be visible in the BufferCodec, or whether
you can compose them together earlier and just have a single method for
encoding in BufferCodec.
So instead of adding the recover member, the encode member would be
wrapped appropriately (e.g. as streamEncode). Then peekCString would
have to be coded a bit differently, but it looks like it could be done.
3. Naming and location of the new exports: is GHC.Foreign OK?
Fine.
4. Some parts of the RTS appear to be encoding-unsafe, but I haven't
tried to fix them yet. For example, HPC stores a module name -- I
presume that GHC dumps this out in UTF-8, but the RTS interprets it in
the locale encoding. There are similar problems with e.g. rtsopts
baked in via -with-rtsopts. Furthermore, the RTS command line options
parser is only safe for ASCII. All of these are pretty minor issues,
and quite annoying to fix, so I'm tempted just to leave them there.
Agreed.
If you are generally OK with the changes then I will merge them into
master (after making the modifications required for Mark's proposal).
The notable exceptions being console and file output, which are probably
also the most visible aspects. For console output at least it would be nice
if we used the wide console API (indeed, it would be really nice if we
didn't go through the horrible msvcrt/mingw layers for I/O at all...)
True, but I'm not keen on tackling that problem just yet :-)
Cheers,
Max
One more thing: in peekEncodedCString you're using
let cHUNK_SIZE = 4096 -- Decode buffer chunk size in characters
yet we're given the size of the input in bytes, so we can do much
better. Using a whole new 16k chunk each time will be bad for small
strings.
Cheers,
Simon
_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc