Re: [PATCH] Better encoding/decoding for GHC

2011-05-25 Thread Ian Lynagh
On Tue, May 24, 2011 at 05:52:23PM +0100, Max Bolingbroke wrote: > On 24 May 2011 02:16, Ian Lynagh wrote: > > On Wed, May 18, 2011 at 11:14:08PM +0100, Max Bolingbroke wrote: > >> On 18 May 2011 22:54, Mark Lentczner wrote: > >> > The range is U+EF80 through U+EFFF, called "Reserved for encoding

Re: [PATCH] Better encoding/decoding for GHC

2011-05-24 Thread Max Bolingbroke
On 24 May 2011 02:16, Ian Lynagh wrote: > On Wed, May 18, 2011 at 11:14:08PM +0100, Max Bolingbroke wrote: >> On 18 May 2011 22:54, Mark Lentczner wrote: >> > The range is U+EF80 through U+EFFF, called "Reserved for encoding hacks". >> >> OK, I've applied another patch so we match this. > > So ho

Re: [PATCH] Better encoding/decoding for GHC

2011-05-23 Thread Ian Lynagh
On Wed, May 18, 2011 at 11:14:08PM +0100, Max Bolingbroke wrote: > On 18 May 2011 22:54, Mark Lentczner wrote: > > The range is U+EF80 through U+EFFF, called "Reserved for encoding hacks". > > OK, I've applied another patch so we match this. So how do I fix the "rm a*" program below now? (if the

Re: [PATCH] Better encoding/decoding for GHC

2011-05-18 Thread Max Bolingbroke
On 18 May 2011 22:54, Mark Lentczner wrote: > The range is U+EF80 through U+EFFF, called "Reserved for encoding hacks". OK, I've applied another patch so we match this. Here's hoping that has finally put this issue to rest :-) > On a related note, If we want to be able to round trip file names t

Re: [PATCH] Better encoding/decoding for GHC

2011-05-18 Thread Mark Lentczner
On Wed, May 18, 2011 at 2:28 AM, Max Bolingbroke wrote: > > U+F1E00 ~ U+F1EFF -- for "Fie! we need to encode bad encodings!" > > > > We can (I'll be happy to) register this with the unofficial registory(2). > I've prepared a draft for the registry and submitted it…. Only to have it pointed out t

Re: [PATCH] Better encoding/decoding for GHC

2011-05-18 Thread Bryan O'Sullivan
On Wed, May 18, 2011 at 1:36 AM, Max Bolingbroke wrote: > > Aha! You go out of your way to detect and replace them. Interesting! > Yes, and necessary, as otherwise text is open to data corruption. ___ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www

Re: [PATCH] Better encoding/decoding for GHC

2011-05-18 Thread Max Bolingbroke
On 15 May 2011 18:08, Mark Lentczner wrote: > We can increase the unlikeliness of colliding with them by using > an known unused range. I've looked at relevant on-line sources(1,2) and > suggest: > > U+F1E00 ~ U+F1EFF -- for "Fie! we need to encode bad encodings!" > > We can (I'll be happy to) reg

Re: [PATCH] Better encoding/decoding for GHC

2011-05-18 Thread Max Bolingbroke
On 17 May 2011 19:50, Bryan O'Sullivan wrote: > Any attempt to pack a String into a Text will replace UTF-16 surrogates with > U+FFFD: > https://github.com/bos/text/blob/master/Data/Text/Internal.hs#L87 > https://github.com/bos/text/blob/master/Data/Text.hs#L363 Aha! You go out of your way to det

Re: [PATCH] Better encoding/decoding for GHC

2011-05-17 Thread Bryan O'Sullivan
On Mon, May 16, 2011 at 9:22 AM, Max Bolingbroke wrote: > This is a key point - I wonder whether you have in mind a particular > bit of code using the "text" package that will fail if we use lone > surrogates as escapes? > Any attempt to pack a String into a Text will replace UTF-16 surrogates w

Re: [PATCH] Better encoding/decoding for GHC

2011-05-16 Thread Max Bolingbroke
On 15 May 2011 18:08, Mark Lentczner wrote: > other hand, Haskell software generally does presume valid Unicode, and the > broken surrogates will break things, for example the Text package. PUA > characters will work with all Haskell software. This is a key point - I wonder whether you have in mi

Re: [PATCH] Better encoding/decoding for GHC

2011-05-15 Thread Mark Lentczner
I'll push back... and apologize for perhaps making this all seem more complicated that it probably is :-) I think, all things given, the use of private use area (PUA) characters is far preferable. With the exception of small ranges used by Apple & Microsoft, PUA characters exchanging in the wi

Re: [PATCH] Better encoding/decoding for GHC

2011-05-14 Thread Max Bolingbroke
On 11 May 2011 13:36, Max Bolingbroke wrote: > I thought you were arguing against choice 1 and in favour of 2 in your > initial message? I've pushed my implementation pretty much as it was at the beginning of this thread to master so it can go into 7.2. Please let me know of any problems you enco

Re: [PATCH] Better encoding/decoding for GHC

2011-05-11 Thread Max Bolingbroke
On 11 May 2011 00:40, Mark Lentczner wrote: > That is why the Python approach hides these beasts in a non-legal part of > the code space. Naturally. The choice is clear. The escapes should use either: 1. The surrogate code points, in which case we can roundtrip any string but we might confuse U

Re: [PATCH] Better encoding/decoding for GHC

2011-05-10 Thread Mark Lentczner
> File paths that don't decode. > File paths with a small range of private use characters. > > It was always my intention to allow roundtripping of arbitrary > bytestrings through String. I don't think that the middle ground > (where you can *read in* a filename without error but not write it out

Re: [PATCH] Better encoding/decoding for GHC

2011-05-10 Thread Simon Marlow
On 10/05/2011 15:29, Max Bolingbroke wrote: On 18 April 2011 12:48, Simon Marlow wrote: I'm not sure about the motivation for the factoring here, You've added an extra member to BufferCodec: + recover :: Buffer from -> Buffer to -> IO (Buffer from, Buffer to), but this always seems to be

Re: [PATCH] Better encoding/decoding for GHC

2011-05-10 Thread Max Bolingbroke
On 18 April 2011 12:48, Simon Marlow wrote: > I'm not sure about the motivation for the factoring here,  You've added an > extra member to BufferCodec: > > +  recover :: Buffer from -> Buffer to -> IO (Buffer from, Buffer to), > > but this always seems to be instantiated by either recoverDecode or

Re: [PATCH] Better encoding/decoding for GHC

2011-05-09 Thread Max Bolingbroke
On 7 May 2011 17:38, Mark Lentczner wrote: > We have a choice. The current proposal maps the two following classes of > file paths onto the same string, and so when encoding back to the system we > must choose which it is -- the other class getting the short-end of the > stick: > > File paths that

[PATCH] Better encoding/decoding for GHC

2011-05-07 Thread Mark Lentczner
(Crud - Simon just pointed out that I accidentally sent my reply to just him, not the list. D'oh! -- Sorry for the tardy reply-all, all!) On Wed, Apr 20, 2011 at 2:59 AM, Simon Marlow wrote: > So that means filenames that are not legal in the current encoding won't > round-trip? But wasn't that

Re: [PATCH] Better encoding/decoding for GHC

2011-04-27 Thread Ian Lynagh
On Tue, Apr 12, 2011 at 01:05:41PM +0100, Max Bolingbroke wrote: > > As you may know, I've been working on improving GHC's support for > Unicode. I think we're happy to have someone who knows what they're doing working on this, but I'm not sure what the status is. Are you waiting for us to do som

Re: [PATCH] Better encoding/decoding for GHC

2011-04-20 Thread Simon Marlow
On 18/04/2011 21:46, Mark Lentczner wrote: (A minor point: I think your definition D10, rather than D76, is closest to what GHC implements as Char, since you can for example evaluate (length "\xD800") with no complaints Yikes - I thought earlier versions of GHC wouldn't evaluate "\xD

Re: [PATCH] Better encoding/decoding for GHC

2011-04-18 Thread Mark Lentczner
> > (A minor point: I think your definition D10, rather than D76, is closest to > what GHC implements as Char, since you can for example evaluate (length > "\xD800") with no complaints Yikes - I thought earlier versions of GHC wouldn't evaluate "\xD800". So you are right - GHC seems to be D10, but

Re: [PATCH] Better encoding/decoding for GHC

2011-04-18 Thread Simon Marlow
On 15/04/2011 09:37, Max Bolingbroke wrote: On 14 April 2011 13:40, Simon Marlow wrote: Suffice to say, this conversation is now over my head :-) So I defer to you guys; I'm happy with whatever solution you come up with. I was hoping to get your input on the general structure of the code chan

Re: [PATCH] Better encoding/decoding for GHC

2011-04-15 Thread Max Bolingbroke
On 14 April 2011 13:40, Simon Marlow wrote: > Suffice to say, this conversation is now over my head :-) So I defer to you > guys; I'm happy with whatever solution you come up with. I was hoping to get your input on the general structure of the code changes - Mark's proposal only relates to exactl

Re: [PATCH] Better encoding/decoding for GHC

2011-04-14 Thread Simon Marlow
On 12/04/2011 22:04, Max Bolingbroke wrote: Hi Mark, Thanks for your detailed response. (A minor point: I think your definition D10, rather than D76, is closest to what GHC implements as Char, since you can for example evaluate (length "\xD800") with no complaints - this comes back to Bryan's e

Re: [PATCH] Better encoding/decoding for GHC

2011-04-12 Thread Max Bolingbroke
Hi Mark, Thanks for your detailed response. (A minor point: I think your definition D10, rather than D76, is closest to what GHC implements as Char, since you can for example evaluate (length "\xD800") with no complaints - this comes back to Bryan's earlier reply to this thread. Of course, you ca

Re: [PATCH] Better encoding/decoding for GHC

2011-04-12 Thread Mark Lentczner
Indeed, POSIX has made a mess of things, hasn't it? That said, I don't think applying PEP-383 here would make things better for Haskell. Please bear with this background: *Background* Haskell 98 and Haskell 2000 both define the type Char this way: > The character type Char is an enumeration whos

Re: [PATCH] Better encoding/decoding for GHC

2011-04-12 Thread Bryan O'Sullivan
On Tue, Apr 12, 2011 at 5:05 AM, Max Bolingbroke wrote: > a) When decoding a byte sequence to a String (which in GHC is > typically a sequence of 16-bit values representing a UTF-16 encoded > Unicode string), any bytes in the input which are undecodable are > represented in the String as a unico

[PATCH] Better encoding/decoding for GHC

2011-04-12 Thread Max Bolingbroke
Hi, As you may know, I've been working on improving GHC's support for Unicode. In particular, I have been trying to achieve the following: 1. Use the locale encoding to decode command line arguments, environment variables and file names from e.g. the System.Directory functions 2. Implement F