Re: [Python-Dev] Bytes path related questions for Guido

2014-08-28 Thread MRAB

On 2014-08-28 05:56, Glenn Linderman wrote:

On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:

Glenn Linderman writes:
  > On 8/26/2014 4:31 AM, MRAB wrote:
  > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
  > >> Nick Coghlan writes:

  > > How about:
  > >
  > > replace_surrogate_escapes(s, replacement='\uFFFD')
  > >
  > > If you want them removed, just pass an empty string as the
  > > replacement.

That seems better to me (I had too much C for breakfast, I think).

  > And further, replacement could be a vector of 128 characters, to do
  > immediate transcoding,

Using what encoding?


The vector would contain the transcoding. Each lone surrogate would map
to a character in the vector.


If you knew that much, why didn't you use
(write, if necessary) an appropriate codec?  I can't envision this
being useful.


If the data format describes its encoding, possibly containing data from
several encodings in various spots, then perhaps it is best read as
binary, and processed as binary until those definitions are found.

But an alternative would be to read with surrogate escapes, and then
when the encoding is determined, to transcode the data. Previously, a
proposal was made to reverse the surrogate escapes to the original
bytes, and then apply the (now known) appropriate codec. There are not
appropriate codecs that can convert directly from surrogate escapes to
the desired end result. This technique could be used instead, for
single-byte, non-escaped encodings. On the other hand, writing specialty
codecs for the purpose would be more general.


There'll be a surrogate escape if a byte couldn't be decoded, but just
because a byte could be decoded, it doesn't mean that it's correct.

If you picked the wrong encoding, the other codepoints could be wrong
too.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)

2014-08-28 Thread Nick Coghlan
On 26 Aug 2014 21:34, "MRAB"  wrote:
>
> On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>>
>> Nick Coghlan writes:
>>
>>   > "purge_surrogate_escapes" was the other term that occurred to me.
>>
>> "purge" suggests removal, not replacement.  That may be useful too.
>>
>> neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')
>>
> How about:
>
> replace_surrogate_escapes(s, replacement='\uFFFD')
>
> If you want them removed, just pass an empty string as the replacement.

The current proposal on the issue tracker is to instead take advantage of
the existing error handlers:

def convert_surrogateescape(data, errors='replace'):
return data.encode('utf-8', 'surrogateescape').decode('utf-8',
errors)

That code is short, but semantically dense - it took a few iterations to
come up with that version. (Added bonus: once you're alerted to the
possibility, it's trivial to write your own version for existing Python 3
versions. The standard name just makes it easier to look up when you come
across it in a piece of code, and provides the option of optimising it
later if it ever seems worth the extra work)

I also filed a separate RFE to make backslashreplace usable on input, since
that allows the option of separating the replacement operation from the
encoding operation.

Cheers,
Nick.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Windows Unicode console support [Was: Bytes path support]

2014-08-28 Thread Paul Moore
On 27 August 2014 10:46, Paul Moore  wrote:
> If I come up with anything worth commenting on, I will do so (I assume
> that comments of the form "+1 me too!" are not needed ;-))

Nevertheless, here's a "Me, too". I've just been writing some PyPI
interrogation scripts, and it's absolutely awful having to deal with
random encoding errors in the output. Being able to just print
*anything* is a HUGE benefit. This is how sys.stdout should behave -
presumably the Unix guys are now all rolling their eyes and saying
"but it does - just use a proper OS" :-)

Enlightened-ly y'rs,
Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path related questions for Guido

2014-08-28 Thread Glenn Linderman

On 8/28/2014 12:30 AM, MRAB wrote:

On 2014-08-28 05:56, Glenn Linderman wrote:

On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:

Glenn Linderman writes:
  > On 8/26/2014 4:31 AM, MRAB wrote:
  > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
  > >> Nick Coghlan writes:

  > > How about:
  > >
  > > replace_surrogate_escapes(s, replacement='\uFFFD')
  > >
  > > If you want them removed, just pass an empty string as the
  > > replacement.

That seems better to me (I had too much C for breakfast, I think).

  > And further, replacement could be a vector of 128 characters, to do
  > immediate transcoding,

Using what encoding?


The vector would contain the transcoding. Each lone surrogate would map
to a character in the vector.


If you knew that much, why didn't you use
(write, if necessary) an appropriate codec?  I can't envision this
being useful.


If the data format describes its encoding, possibly containing data from
several encodings in various spots, then perhaps it is best read as
binary, and processed as binary until those definitions are found.

But an alternative would be to read with surrogate escapes, and then
when the encoding is determined, to transcode the data. Previously, a
proposal was made to reverse the surrogate escapes to the original
bytes, and then apply the (now known) appropriate codec. There are not
appropriate codecs that can convert directly from surrogate escapes to
the desired end result. This technique could be used instead, for
single-byte, non-escaped encodings. On the other hand, writing specialty
codecs for the purpose would be more general.


There'll be a surrogate escape if a byte couldn't be decoded, but just
because a byte could be decoded, it doesn't mean that it's correct.

If you picked the wrong encoding, the other codepoints could be wrong
too.


Aha! Thanks for pointing out the flaw in my reasoning. But that means it 
is also pretty useless to "replace_surrogate_escapes" at all, because it 
only cleans out the non-decodable characters, not the incorrectly 
decoded characters.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path related questions for Guido

2014-08-28 Thread R. David Murray
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman  
wrote:
> On 8/28/2014 12:30 AM, MRAB wrote:
> > On 2014-08-28 05:56, Glenn Linderman wrote:
> >> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
> >>> Glenn Linderman writes:
> >>>   > On 8/26/2014 4:31 AM, MRAB wrote:
> >>>   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
> >>>   > >> Nick Coghlan writes:
> >>>
> >>>   > > How about:
> >>>   > >
> >>>   > > replace_surrogate_escapes(s, replacement='\uFFFD')
> >>>   > >
> >>>   > > If you want them removed, just pass an empty string as the
> >>>   > > replacement.
> >>>
> >>> That seems better to me (I had too much C for breakfast, I think).
> >>>
> >>>   > And further, replacement could be a vector of 128 characters, to do
> >>>   > immediate transcoding,
> >>>
> >>> Using what encoding?
> >>
> >> The vector would contain the transcoding. Each lone surrogate would map
> >> to a character in the vector.
> >>
> >>> If you knew that much, why didn't you use
> >>> (write, if necessary) an appropriate codec?  I can't envision this
> >>> being useful.
> >>
> >> If the data format describes its encoding, possibly containing data from
> >> several encodings in various spots, then perhaps it is best read as
> >> binary, and processed as binary until those definitions are found.
> >>
> >> But an alternative would be to read with surrogate escapes, and then
> >> when the encoding is determined, to transcode the data. Previously, a
> >> proposal was made to reverse the surrogate escapes to the original
> >> bytes, and then apply the (now known) appropriate codec. There are not
> >> appropriate codecs that can convert directly from surrogate escapes to
> >> the desired end result. This technique could be used instead, for
> >> single-byte, non-escaped encodings. On the other hand, writing specialty
> >> codecs for the purpose would be more general.
> >>
> > There'll be a surrogate escape if a byte couldn't be decoded, but just
> > because a byte could be decoded, it doesn't mean that it's correct.
> >
> > If you picked the wrong encoding, the other codepoints could be wrong
> > too.
> 
> Aha! Thanks for pointing out the flaw in my reasoning. But that means it 
> is also pretty useless to "replace_surrogate_escapes" at all, because it 
> only cleans out the non-decodable characters, not the incorrectly 
> decoded characters.

Well, replace would still be useful for ASCII+surrogateescape.  Also for
cases where the data stream is *supposed* to be in a given encoding, but
contains undecodable bytes.  Showing the stuff that incorrectly decodes
as whatever it decodes to is generally what you want in that case.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path related questions for Guido

2014-08-28 Thread Glenn Linderman

On 8/28/2014 10:41 AM, R. David Murray wrote:

On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman  
wrote:

On 8/28/2014 12:30 AM, MRAB wrote:

On 2014-08-28 05:56, Glenn Linderman wrote:

On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:

Glenn Linderman writes:
   > On 8/26/2014 4:31 AM, MRAB wrote:
   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
   > >> Nick Coghlan writes:

   > > How about:
   > >
   > > replace_surrogate_escapes(s, replacement='\uFFFD')
   > >
   > > If you want them removed, just pass an empty string as the
   > > replacement.

That seems better to me (I had too much C for breakfast, I think).

   > And further, replacement could be a vector of 128 characters, to do
   > immediate transcoding,

Using what encoding?

The vector would contain the transcoding. Each lone surrogate would map
to a character in the vector.


If you knew that much, why didn't you use
(write, if necessary) an appropriate codec?  I can't envision this
being useful.

If the data format describes its encoding, possibly containing data from
several encodings in various spots, then perhaps it is best read as
binary, and processed as binary until those definitions are found.

But an alternative would be to read with surrogate escapes, and then
when the encoding is determined, to transcode the data. Previously, a
proposal was made to reverse the surrogate escapes to the original
bytes, and then apply the (now known) appropriate codec. There are not
appropriate codecs that can convert directly from surrogate escapes to
the desired end result. This technique could be used instead, for
single-byte, non-escaped encodings. On the other hand, writing specialty
codecs for the purpose would be more general.


There'll be a surrogate escape if a byte couldn't be decoded, but just
because a byte could be decoded, it doesn't mean that it's correct.

If you picked the wrong encoding, the other codepoints could be wrong
too.

Aha! Thanks for pointing out the flaw in my reasoning. But that means it
is also pretty useless to "replace_surrogate_escapes" at all, because it
only cleans out the non-decodable characters, not the incorrectly
decoded characters.

Well, replace would still be useful for ASCII+surrogateescape.


How?


Also for
cases where the data stream is *supposed* to be in a given encoding, but
contains undecodable bytes.  Showing the stuff that incorrectly decodes
as whatever it decodes to is generally what you want in that case.
Sure, people can learn to recognize mojibake for what it is, and maybe 
even learn to recognize it for what it was intended to be, in limited 
domains. But suppressing/replacing the surrogates doesn't help with 
that... would it not be better to replace the surrogates with an escape 
sequence that shows the original, undecodable, byte value?  Like  \xNN ?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path related questions for Guido

2014-08-28 Thread R. David Murray
On Thu, 28 Aug 2014 10:54:44 -0700, Glenn Linderman  
wrote:
> On 8/28/2014 10:41 AM, R. David Murray wrote:
> > On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman  
> > wrote:
> >> On 8/28/2014 12:30 AM, MRAB wrote:
> >>> There'll be a surrogate escape if a byte couldn't be decoded, but just
> >>> because a byte could be decoded, it doesn't mean that it's correct.
> >>>
> >>> If you picked the wrong encoding, the other codepoints could be wrong
> >>> too.
> >> Aha! Thanks for pointing out the flaw in my reasoning. But that means it
> >> is also pretty useless to "replace_surrogate_escapes" at all, because it
> >> only cleans out the non-decodable characters, not the incorrectly
> >> decoded characters.
> > Well, replace would still be useful for ASCII+surrogateescape.
> 
> How?

Because there "can't" be any incorrectly decoded bytes in the ASCII part,
so all undecodable bytes turning into 'unrecognized character' glyphs
is useful. "can't" is in quotes because of course if you decode random
binary data as ASCII+surrogate escape you could get a mess just like any
other encoding, so this is really a "more *likely* to be useful" version
of my second point, because "real" ASCII with some junk bytes mixed in
is much more likely to be encountered in the wild than, say, utf-8 with
some junk bytes mixed in (although is probably changing as use of utf-8
becomes more widespread, so this point applies to utf-8 as well).

> > Also for
> > cases where the data stream is *supposed* to be in a given encoding, but
> > contains undecodable bytes.  Showing the stuff that incorrectly decodes
> > as whatever it decodes to is generally what you want in that case.
>
> Sure, people can learn to recognize mojibake for what it is, and maybe 
> even learn to recognize it for what it was intended to be, in limited 
> domains. But suppressing/replacing the surrogates doesn't help with 

Well, it does if the alternative is not being able to display the string
to the user at all.  And yeah, people being able to recognize mojibake
in specific problem domains is what I'm talking about...not perhaps a
great use case, but it is a use case.

> that... would it not be better to replace the surrogates with an escape 
> sequence that shows the original, undecodable, byte value?  Like  \xNN ?

Yeah, that idea has been floated as well, and I think it would indeed be
more useful than the 'unknown character' glyph.  I've also seen fonts
that display the hex code inside a box character when the code point is
unknown, which would be cool...but that can hardly be part of unicode,
can it? :)

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)

2014-08-28 Thread Stephen J. Turnbull
Nick Coghlan writes:

 > The current proposal on the issue tracker is to instead take advantage of
 > the existing error handlers:
 > 
 > def convert_surrogateescape(data, errors='replace'):
 > return data.encode('utf-8', 'surrogateescape').decode('utf-8', 
 > errors)
 > 
 > That code is short, but semantically dense

And it doesn't implement your original suggestion of replacement with
'?' (and another possibility for history buffs is 0x1A, ASCII SUB).  At
least, AFAICT from the docs there's no way to specify the replacement
character; decoding always uses U+FFFD.  (If I knew how to do that, I
would have suggested this.)

 > (Added bonus: once you're alerted to the possibility, it's trivial
 > to write your own version for existing Python 3 versions.

I'm not sure that's true.  At least, to me that code was obvious -- I
got the exact definition (except for the function name) on the first
try -- but I ruled it out because it didn't implement your suggestion
of replacement with '?', even as an option.

OTOH, I think a lot of the resistance to codec-based solutions is the
misconception that en/decoding streams is expensive, or the
misconception that Python's internal representation of text as an
array of code points (rather than an array of "characters" or
"grapheme clusters") is somehow insufficient for text processing.

Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...]

2014-08-28 Thread Stephen J. Turnbull
In the process of booking up for my other post in this thread, I
noticed the 'surrogatepass' handler.

Is there a real use case for the 'surrogatepass' error handler?  It
seems like a horrible break in the abstraction.  IMHO, if there's a
need, the application should handle this.  Python shouldn't provide
it on encoding as the resulting streams are not Unicode conformant,
nor on decoding UTF-16, as conversion of surrogate pairs is a
requirement of all Unicode versions since about 1995.

Steve

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)

2014-08-28 Thread Nick Coghlan
On 29 August 2014 10:32, Stephen J. Turnbull  wrote:
> Nick Coghlan writes:
>
>  > The current proposal on the issue tracker is to instead take advantage of
>  > the existing error handlers:
>  >
>  > def convert_surrogateescape(data, errors='replace'):
>  > return data.encode('utf-8', 'surrogateescape').decode('utf-8', 
> errors)
>  >
>  > That code is short, but semantically dense
>
> And it doesn't implement your original suggestion of replacement with
> '?' (and another possibility for history buffs is 0x1A, ASCII SUB).  At
> least, AFAICT from the docs there's no way to specify the replacement
> character; decoding always uses U+FFFD.  (If I knew how to do that, I
> would have suggested this.)

If that actually matters in a given context, I can do an ordinary
string replacement later. I couldn't think of a case where it actually
mattered though - if "must be ASCII" was a requirement, then
backslashreplace was a suitable alternative that lost less information
(hence the RFE to make that also usable on input).

>  > (Added bonus: once you're alerted to the possibility, it's trivial
>  > to write your own version for existing Python 3 versions.
>
> I'm not sure that's true.  At least, to me that code was obvious -- I
> got the exact definition (except for the function name) on the first
> try -- but I ruled it out because it didn't implement your suggestion
> of replacement with '?', even as an option.

Yeah, part of the tracker discussion involved me realising that part
wasn't a necessary requirement - the key is being able to get rid of
the surrogates, or replace them with something readily identifiable,
and less about being able to control exactly what they get replaced
by.

> OTOH, I think a lot of the resistance to codec-based solutions is the
> misconception that en/decoding streams is expensive, or the
> misconception that Python's internal representation of text as an
> array of code points (rather than an array of "characters" or
> "grapheme clusters") is somehow insufficient for text processing.

We don't actually have any technical deep dives into how Python 3's
text handling works readily available online, so there's a lot of
speculation and misinformation floating around. My recent article
gives the high level context, but it really needs to be paired up with
a piece (or pieces) that go deep into the details of codec
optimisation, the UTF-8 caching, how it integrates with the UTF-16-LE
Windows APIs, how the internal storage structure is determined at
allocation time, how it maintains compatibility with the legacy C
extension APIs, etc. The only current widely distributed articles on
those topics are written from a perspective that assumes we don't know
anything about Unicode, and are just making things unnecessarily
complicated (rather than solving hard cross platform compatibility and
text processing performance problems). That perspective is incorrect,
but "trust me, they're wrong" doesn't work very well with people that
are already angry.

Text manipulation is one of the most sophisticated subsystems in the
interpreter, though, so it's hard to know where to start on such a
series (and easy to get intimidated by the sheer magnitude of the work
involved in doing it right).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com