Re: [Python-Dev] Bytes path related questions for Guido
On 2014-08-28 05:56, Glenn Linderman wrote: On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: Glenn Linderman writes: > On 8/26/2014 4:31 AM, MRAB wrote: > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: > >> Nick Coghlan writes: > > How about: > > > > replace_surrogate_escapes(s, replacement='\uFFFD') > > > > If you want them removed, just pass an empty string as the > > replacement. That seems better to me (I had too much C for breakfast, I think). > And further, replacement could be a vector of 128 characters, to do > immediate transcoding, Using what encoding? The vector would contain the transcoding. Each lone surrogate would map to a character in the vector. If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful. If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found. But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general. There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct. If you picked the wrong encoding, the other codepoints could be wrong too. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)
On 26 Aug 2014 21:34, "MRAB" wrote: > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >> >> Nick Coghlan writes: >> >> > "purge_surrogate_escapes" was the other term that occurred to me. >> >> "purge" suggests removal, not replacement. That may be useful too. >> >> neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD') >> > How about: > > replace_surrogate_escapes(s, replacement='\uFFFD') > > If you want them removed, just pass an empty string as the replacement. The current proposal on the issue tracker is to instead take advantage of the existing error handlers: def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) That code is short, but semantically dense - it took a few iterations to come up with that version. (Added bonus: once you're alerted to the possibility, it's trivial to write your own version for existing Python 3 versions. The standard name just makes it easier to look up when you come across it in a piece of code, and provides the option of optimising it later if it ever seems worth the extra work) I also filed a separate RFE to make backslashreplace usable on input, since that allows the option of separating the replacement operation from the encoding operation. Cheers, Nick. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Unicode console support [Was: Bytes path support]
On 27 August 2014 10:46, Paul Moore wrote: > If I come up with anything worth commenting on, I will do so (I assume > that comments of the form "+1 me too!" are not needed ;-)) Nevertheless, here's a "Me, too". I've just been writing some PyPI interrogation scripts, and it's absolutely awful having to deal with random encoding errors in the output. Being able to just print *anything* is a HUGE benefit. This is how sys.stdout should behave - presumably the Unix guys are now all rolling their eyes and saying "but it does - just use a proper OS" :-) Enlightened-ly y'rs, Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path related questions for Guido
On 8/28/2014 12:30 AM, MRAB wrote: On 2014-08-28 05:56, Glenn Linderman wrote: On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: Glenn Linderman writes: > On 8/26/2014 4:31 AM, MRAB wrote: > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: > >> Nick Coghlan writes: > > How about: > > > > replace_surrogate_escapes(s, replacement='\uFFFD') > > > > If you want them removed, just pass an empty string as the > > replacement. That seems better to me (I had too much C for breakfast, I think). > And further, replacement could be a vector of 128 characters, to do > immediate transcoding, Using what encoding? The vector would contain the transcoding. Each lone surrogate would map to a character in the vector. If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful. If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found. But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general. There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct. If you picked the wrong encoding, the other codepoints could be wrong too. Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path related questions for Guido
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman wrote: > On 8/28/2014 12:30 AM, MRAB wrote: > > On 2014-08-28 05:56, Glenn Linderman wrote: > >> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: > >>> Glenn Linderman writes: > >>> > On 8/26/2014 4:31 AM, MRAB wrote: > >>> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: > >>> > >> Nick Coghlan writes: > >>> > >>> > > How about: > >>> > > > >>> > > replace_surrogate_escapes(s, replacement='\uFFFD') > >>> > > > >>> > > If you want them removed, just pass an empty string as the > >>> > > replacement. > >>> > >>> That seems better to me (I had too much C for breakfast, I think). > >>> > >>> > And further, replacement could be a vector of 128 characters, to do > >>> > immediate transcoding, > >>> > >>> Using what encoding? > >> > >> The vector would contain the transcoding. Each lone surrogate would map > >> to a character in the vector. > >> > >>> If you knew that much, why didn't you use > >>> (write, if necessary) an appropriate codec? I can't envision this > >>> being useful. > >> > >> If the data format describes its encoding, possibly containing data from > >> several encodings in various spots, then perhaps it is best read as > >> binary, and processed as binary until those definitions are found. > >> > >> But an alternative would be to read with surrogate escapes, and then > >> when the encoding is determined, to transcode the data. Previously, a > >> proposal was made to reverse the surrogate escapes to the original > >> bytes, and then apply the (now known) appropriate codec. There are not > >> appropriate codecs that can convert directly from surrogate escapes to > >> the desired end result. This technique could be used instead, for > >> single-byte, non-escaped encodings. On the other hand, writing specialty > >> codecs for the purpose would be more general. > >> > > There'll be a surrogate escape if a byte couldn't be decoded, but just > > because a byte could be decoded, it doesn't mean that it's correct. > > > > If you picked the wrong encoding, the other codepoints could be wrong > > too. > > Aha! Thanks for pointing out the flaw in my reasoning. But that means it > is also pretty useless to "replace_surrogate_escapes" at all, because it > only cleans out the non-decodable characters, not the incorrectly > decoded characters. Well, replace would still be useful for ASCII+surrogateescape. Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path related questions for Guido
On 8/28/2014 10:41 AM, R. David Murray wrote: On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman wrote: On 8/28/2014 12:30 AM, MRAB wrote: On 2014-08-28 05:56, Glenn Linderman wrote: On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: Glenn Linderman writes: > On 8/26/2014 4:31 AM, MRAB wrote: > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: > >> Nick Coghlan writes: > > How about: > > > > replace_surrogate_escapes(s, replacement='\uFFFD') > > > > If you want them removed, just pass an empty string as the > > replacement. That seems better to me (I had too much C for breakfast, I think). > And further, replacement could be a vector of 128 characters, to do > immediate transcoding, Using what encoding? The vector would contain the transcoding. Each lone surrogate would map to a character in the vector. If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful. If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found. But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general. There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct. If you picked the wrong encoding, the other codepoints could be wrong too. Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters. Well, replace would still be useful for ASCII+surrogateescape. How? Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bytes path related questions for Guido
On Thu, 28 Aug 2014 10:54:44 -0700, Glenn Linderman wrote: > On 8/28/2014 10:41 AM, R. David Murray wrote: > > On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman > > wrote: > >> On 8/28/2014 12:30 AM, MRAB wrote: > >>> There'll be a surrogate escape if a byte couldn't be decoded, but just > >>> because a byte could be decoded, it doesn't mean that it's correct. > >>> > >>> If you picked the wrong encoding, the other codepoints could be wrong > >>> too. > >> Aha! Thanks for pointing out the flaw in my reasoning. But that means it > >> is also pretty useless to "replace_surrogate_escapes" at all, because it > >> only cleans out the non-decodable characters, not the incorrectly > >> decoded characters. > > Well, replace would still be useful for ASCII+surrogateescape. > > How? Because there "can't" be any incorrectly decoded bytes in the ASCII part, so all undecodable bytes turning into 'unrecognized character' glyphs is useful. "can't" is in quotes because of course if you decode random binary data as ASCII+surrogate escape you could get a mess just like any other encoding, so this is really a "more *likely* to be useful" version of my second point, because "real" ASCII with some junk bytes mixed in is much more likely to be encountered in the wild than, say, utf-8 with some junk bytes mixed in (although is probably changing as use of utf-8 becomes more widespread, so this point applies to utf-8 as well). > > Also for > > cases where the data stream is *supposed* to be in a given encoding, but > > contains undecodable bytes. Showing the stuff that incorrectly decodes > > as whatever it decodes to is generally what you want in that case. > > Sure, people can learn to recognize mojibake for what it is, and maybe > even learn to recognize it for what it was intended to be, in limited > domains. But suppressing/replacing the surrogates doesn't help with Well, it does if the alternative is not being able to display the string to the user at all. And yeah, people being able to recognize mojibake in specific problem domains is what I'm talking about...not perhaps a great use case, but it is a use case. > that... would it not be better to replace the surrogates with an escape > sequence that shows the original, undecodable, byte value? Like \xNN ? Yeah, that idea has been floated as well, and I think it would indeed be more useful than the 'unknown character' glyph. I've also seen fonts that display the hex code inside a box character when the code point is unknown, which would be cool...but that can hardly be part of unicode, can it? :) --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)
Nick Coghlan writes: > The current proposal on the issue tracker is to instead take advantage of > the existing error handlers: > > def convert_surrogateescape(data, errors='replace'): > return data.encode('utf-8', 'surrogateescape').decode('utf-8', > errors) > > That code is short, but semantically dense And it doesn't implement your original suggestion of replacement with '?' (and another possibility for history buffs is 0x1A, ASCII SUB). At least, AFAICT from the docs there's no way to specify the replacement character; decoding always uses U+FFFD. (If I knew how to do that, I would have suggested this.) > (Added bonus: once you're alerted to the possibility, it's trivial > to write your own version for existing Python 3 versions. I'm not sure that's true. At least, to me that code was obvious -- I got the exact definition (except for the function name) on the first try -- but I ruled it out because it didn't implement your suggestion of replacement with '?', even as an option. OTOH, I think a lot of the resistance to codec-based solutions is the misconception that en/decoding streams is expensive, or the misconception that Python's internal representation of text as an array of code points (rather than an array of "characters" or "grapheme clusters") is somehow insufficient for text processing. Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...]
In the process of booking up for my other post in this thread, I noticed the 'surrogatepass' handler. Is there a real use case for the 'surrogatepass' error handler? It seems like a horrible break in the abstraction. IMHO, if there's a need, the application should handle this. Python shouldn't provide it on encoding as the resulting streams are not Unicode conformant, nor on decoding UTF-16, as conversion of surrogate pairs is a requirement of all Unicode versions since about 1995. Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)
On 29 August 2014 10:32, Stephen J. Turnbull wrote: > Nick Coghlan writes: > > > The current proposal on the issue tracker is to instead take advantage of > > the existing error handlers: > > > > def convert_surrogateescape(data, errors='replace'): > > return data.encode('utf-8', 'surrogateescape').decode('utf-8', > errors) > > > > That code is short, but semantically dense > > And it doesn't implement your original suggestion of replacement with > '?' (and another possibility for history buffs is 0x1A, ASCII SUB). At > least, AFAICT from the docs there's no way to specify the replacement > character; decoding always uses U+FFFD. (If I knew how to do that, I > would have suggested this.) If that actually matters in a given context, I can do an ordinary string replacement later. I couldn't think of a case where it actually mattered though - if "must be ASCII" was a requirement, then backslashreplace was a suitable alternative that lost less information (hence the RFE to make that also usable on input). > > (Added bonus: once you're alerted to the possibility, it's trivial > > to write your own version for existing Python 3 versions. > > I'm not sure that's true. At least, to me that code was obvious -- I > got the exact definition (except for the function name) on the first > try -- but I ruled it out because it didn't implement your suggestion > of replacement with '?', even as an option. Yeah, part of the tracker discussion involved me realising that part wasn't a necessary requirement - the key is being able to get rid of the surrogates, or replace them with something readily identifiable, and less about being able to control exactly what they get replaced by. > OTOH, I think a lot of the resistance to codec-based solutions is the > misconception that en/decoding streams is expensive, or the > misconception that Python's internal representation of text as an > array of code points (rather than an array of "characters" or > "grapheme clusters") is somehow insufficient for text processing. We don't actually have any technical deep dives into how Python 3's text handling works readily available online, so there's a lot of speculation and misinformation floating around. My recent article gives the high level context, but it really needs to be paired up with a piece (or pieces) that go deep into the details of codec optimisation, the UTF-8 caching, how it integrates with the UTF-16-LE Windows APIs, how the internal storage structure is determined at allocation time, how it maintains compatibility with the legacy C extension APIs, etc. The only current widely distributed articles on those topics are written from a perspective that assumes we don't know anything about Unicode, and are just making things unnecessarily complicated (rather than solving hard cross platform compatibility and text processing performance problems). That perspective is incorrect, but "trust me, they're wrong" doesn't work very well with people that are already angry. Text manipulation is one of the most sophisticated subsystems in the interpreter, though, so it's hard to know where to start on such a series (and easy to get intimidated by the sheer magnitude of the work involved in doing it right). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com