On Fri, Jan 17, 2014 at 10:58:25AM -0500, josef.p...@gmail.com wrote: > On Fri, Jan 17, 2014 at 10:26 AM, Oscar Benjamin > <oscar.j.benja...@gmail.com> wrote: > > On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote: > > > > You don't show how you created the file. I think that in your case the > > content of 'filenames.txt' is correctly encoded latin-1. > > I had created it with os.listdir but deleted some lines
You used os.listdir to generate the unicode strings that you write to the file. The underlying Win32 API returns filenames encoded as utf-16 but Python takes care of decoding them under the hood so you just get abstract unicode strings here in Python 3. It is the write method of the file object that encodes the unicode strings and hence determines the byte content of 'filenames5.txt'. You can check the fout.encoding attribute to see what encoding it uses by default. > Running the full script again I still get the same correct answer for fn > ------------ > import os > if 1: > with open('filenames5.txt', 'w') as fout: > fout.writelines([f + '\n' for f in os.listdir('.')]) > with open('filenames.txt') as fin: > print(fin.read()) > > import numpy > > #filenames = numpy.loadtxt('filenames.txt') > filenames = numpy.loadtxt('filenames5.txt', dtype='S') > fn = open(filenames[-1]) The question is what do you get when you do: In [1]: with open('tmp.txt', 'w') as fout: ...: print(fout.encoding) ...: UTF-8 I get utf-8 by default if no encoding is specified. This means that when I write to the file like so In [2]: with open('tmp.txt', 'w') as fout: ...: fout.write('Õscar') ...: If I read it back in binary I get different bytes from you: In [3]: with open('tmp.txt', 'rb') as fin: ...: print(fin.read()) ...: b'\xc3\x95scar' Numpy.loadtxt will correctly decode those bytes as utf-8: In [5]: b'\xc3\x95scar'.decode('utf-8') Out[5]: 'Õscar' But then it reencodes them with latin-1 before storing them in the array: In [6]: b'\xc3\x95scar'.decode('utf-8').encode('latin-1') Out[6]: b'\xd5scar' This byte string will not be recognised by my Linux OS (POSIX uses bytes for filenames and an exact match is needed). So if I pass that to open() it will fail. <snip> > > I get similar problems when I use a file that someone else has > written, however I haven't seen much problems if I do everything on > Windows. If you use a proper explicit encoding then you can savetxt from any system and loadtxt on any other without corruption. > The main problems I get and where I don't know how it's supposed to > work in the best way is when we get "foreign" data. Text data needs to have metadata specifying the encoding. This is something that people who pass data around need to think about. Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion