On Thu, Oct 11, 2012 at 5:04 AM, Dave Angel <d...@davea.name> wrote: > > Actually, the upper limit for a decoded utf-8 character is at least 6 > bytes. I think it's 6, but it's no less than 6.
Yes, but what would be the point? Unicode only has 17 planes, up to code 0x10ffff. It's limited by UTF-16. > 2) There are many more byte formats, most of them predating Unicode > entirely. Many of these are specific to a particular language or > national environment, and contain just those extensions to ASCII that > the particular language deems useful. Python provides encoders and > decoders to many of these as well. I mentioned 3 common formats that can completely represent Unicode since this thread is mostly about Python 3 strings and repr -- at least it started that way. > 3) There are many things read and written in byte format that have no > relationship to characters. The notion of using text formats for all > data (eg. xml) is a fairly recent one. Binary files are quite common, > and many devices require binary transfers to work at all. So byte > strings are not necessarily strings at all. Sure, other than encoded strings, there are also more obvious examples of data represented as bytes -- at least I hope they're obvious -- such as multimedia audio/video/images, sensor data, spreadsheets, and so on. In main memory these exist as data structures/objects (bytes, but not generally in a form suitable for transmission or storage). Before being saved to files or network streams, the data is transformed to serialize and pack it as a byte stream (e.g. the struct module, or pickle which defaults to a binary protocol in Python 3), possibly compress it to a smaller size and add error correction (e.g. the gzip module), and possibly encrypt it for security (e.g. PyCrypto). _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor