On Mon, Oct 28, 2013 at 06:13:59PM -0400, SM wrote: > Hello, > I have an extremely simple piece of code which reads a .csv file, which has > 1000 lines of fixed fields, one line at a time, and tries to print some > values. > > 1 #!/usr/bin/python3 > 2 # > 3 import sys, time, re, os > 4 > 5 if __name__=="__main__": > 6 > 7 ifd = open("infile.csv", 'r')
By default Python 3 uses UTF-8 when reading files. As the error below shows, your file actually isn't UTF-8. What are you using to generate the CSV file? Consult the documentation for that program and see what it is using. If it has an option to save using UTF-8, use that. See below for more discussion. > 8 > 9 linenum = 0 > 10 for line in ifd: > 11 line1 = re.split(",", line) > 12 total = 0 > 13 if linenum == 0: > 14 linenum = linenum + 1 > 15 continue [snip many more lines of code] All of this manual effort is unnecessary, as Python comes standard with a library to read CSV files. It is much better to use that: http://docs.python.org/3/library/csv.html > 31 ifd.close This line is buggy. To close the file, you need to *call* the close method by using parentheses, that is, you must write: ifd.close() Without the parentheses, you just get a reference to the close methof but don't do anything with it. > It works fine till it parses the 1st 126 lines in the input file. For the > 127th line (irrespective of the contents of the actual line), it prints the > following error: > Traceback (most recent call last): > File "p1.py", line 10, in <module> > for line in ifd: > File "/usr/lib/python3.2/codecs.py", line 300, in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1173: > invalid continuation byte > $ > > I am not able to figure out the cause of this error. Any clues as to why I > am seeing this error, are appreciated. As mentioned earlier, the error is that the CSV file is not encoded using UTF-8. Best solution is to go back to the source where the file comes from and pick the option to always save using UTF-8. Second best solution is to identify what codec is actually being used. If you tell us what program generates the CSV file in the first place, and the operating system you are using (Windows? Mac? Linux?), we might be able to identify the codec being used. If you can't identify the codec, you can guess. Guessing is bad, for two reasons: - you can waste a lot of time with bad guesses; - worse, some bad guesses won't give you an error, but will just give you bad data. Nevertheless, you can try using a different encoding when you open the file. Try this: ifd = open("infile.csv", 'r', encoding='latin-1') "Latin 1" is an encoding which should not fail, but it might give back rubbish data. Such rubbish data is often called "moji-bake": en.wikipedia.org/wiki/Mojibake Another option is to cover up the errors by passing an error handler: ifd = open("infile.csv", 'r', errors='replace') which will replace any undecodable bytes in the file with a "missing character". -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor