Re: [Tutor] encode

Python Tue, 18 Apr 2006 20:42:34 -0700

On Wed, 2006-04-19 at 10:10 +0700, kakada wrote:
> Hi again folks,
> 
> I wonder if we can check the encoding of text in one text file.
> user is free to encode the file whether Latin1, utf-8, ANSI...


> Any ideas?

def decode_file(filepath):
    '''Order of codecs is important.
    ASCII is most restrictive to decode - no byte values > 127.
    UTF8 is next most restrictive.  There are illegal byte values and illegal 
sequences.
    LATIN will accept anything since all 256 byte values are OK.
    The final decision still depends on human inspection.
    '''
    buff = open(filepath,'rb').read()
    for charset in (ASCII,UTF8,LATIN,):
        try:
            unistr = buff.decode(charset,'strict')
        except UnicodeDecodeError:
            pass
        else:
            break
    else:
        unistr,charset = u'',None
    return unistr, charset

Also note that the unicode character
        u'\ufffd'
represents an error placeholder.  It can be decoded from UTF8 inputs and
reflects earlier processing problems.


DO NOT USE THIS CODE BLINDLY.  It simply offers a reasonable, first cut
where those are the likely encodings.  It is impossible to distinguish
the various LATINx encodings by simply looking at bits.  All 8 bit bytes
are valid, but their meanings change based on the encoding used.

> 
> Thx
> 
> da
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
-- 
Lloyd Kvam
Venix Corp

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encode

Reply via email to