On Wed, 2006-04-19 at 10:10 +0700, kakada wrote: > Hi again folks, > > I wonder if we can check the encoding of text in one text file. > user is free to encode the file whether Latin1, utf-8, ANSI...
> Any ideas? def decode_file(filepath): '''Order of codecs is important. ASCII is most restrictive to decode - no byte values > 127. UTF8 is next most restrictive. There are illegal byte values and illegal sequences. LATIN will accept anything since all 256 byte values are OK. The final decision still depends on human inspection. ''' buff = open(filepath,'rb').read() for charset in (ASCII,UTF8,LATIN,): try: unistr = buff.decode(charset,'strict') except UnicodeDecodeError: pass else: break else: unistr,charset = u'',None return unistr, charset Also note that the unicode character u'\ufffd' represents an error placeholder. It can be decoded from UTF8 inputs and reflects earlier processing problems. DO NOT USE THIS CODE BLINDLY. It simply offers a reasonable, first cut where those are the likely encodings. It is impossible to distinguish the various LATINx encodings by simply looking at bits. All 8 bit bytes are valid, but their meanings change based on the encoding used. > > Thx > > da > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor -- Lloyd Kvam Venix Corp _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor