On Fri, Jan 17, 2014 at 2:18 PM, Julian Taylor <jtaylor.deb...@googlemail.com> wrote: > On 17.01.2014 15:12, Julian Taylor wrote: >> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin >> <oscar.j.benja...@gmail.com <mailto:oscar.j.benja...@gmail.com>> wrote: >> >> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: >> > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin >> > <oscar.j.benja...@gmail.com <mailto:oscar.j.benja...@gmail.com>>wrote: >> > >> > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: >> > > > Julian Taylor <jtaylor.debian <at> googlemail.com >> <http://googlemail.com>> writes: >> > > > [clip] >> > > >> > >> > > > > For backward compatibility we *cannot* change S. >> > > >> > > Do you mean to say that loadtxt cannot be changed from decoding >> using >> > > system >> > > default, splitting on newlines and whitespace and then encoding the >> > > substrings >> > > as latin-1? >> > > >> > >> > unicode dtypes have nothing to do with the loadtxt issue. They are not >> > related. >> >> I'm talking about what loadtxt does with the 'S' dtype. As I showed >> earlier, >> if the file is not encoded as ascii or latin-1 then the byte strings are >> corrupted (see below). >> >> This is because loadtxt opens the file with the default system >> encoding (by >> not explicitly specifying an encoding): >> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732 >> >> It then processes each line with asbytes() which encodes them as >> latin-1: >> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784 >> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 >> >> >> >> wow this is just horrible, it might be the source of the bug. >> >> >> >> >> Being an English speaker I don't normally use non-ascii characters in >> filenames but my system (Ubuntu Linux) still uses utf-8 rather than >> latin-1 or >> (and rightly so!). >> >> > > >> > > An obvious improvement would be along the lines of what Chris Barker >> > > suggested: decode as latin-1, do the processing and then reencode as >> > > latin-1. >> > > >> > >> > no, the right solution is to add an encoding argument. >> > Its a 4 line patch for python2 and a 2 line patch for python3 and >> the issue >> > is solved, I'll file a PR later. >> >> What is the encoding argument for? Is it to be used to decode, >> process the >> text and then re-encode it for an array with dtype='S'? >> >> >> it is only used to decode the file into text, nothing more. >> loadtxt is supposed to load text files, it should never have to deal >> with bytes ever. >> But I haven't looked into the function deeply yet, there might be ugly >> surprises. >> >> The output of the array is determined by the dtype argument and not by >> the encoding argument. >> >> Lets please let the loadtxt issue go to rest. >> We know the issue, we know it can be fixed without adding anything >> complicated to numpy. >> We just have to use what python already provides us. >> The technical details of the fix can be discussed in the github issue. >> (Plan to have a look this weekend, but if someone else wants to do it >> let me know). >> > > Work in progress PR: > https://github.com/numpy/numpy/pull/4208 > > I also seem to have fixed the original bug, while wasn't even my > intention with that PR :) > apparently it was indeed one of the broken asbytes calls. > > if you have applications using loadtxt please give it a try, but > genfromtxt is still completely broken (and a much larger fix, asbytes > everywhere)
does this still work? >>> numpy.loadtxt(open('Õscar_3.txt',"rb"), 'S') array([b'1,2,3,hello', b'5,6,7,\xc3\x95scarscar', b'15,2,3,hello', b'20,2,3,\xc3\x95scar'], dtype='|S16') to compare >>> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'), delimiter=',') Traceback (most recent call last): File "<pyshell#251>", line 1, in <module> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'), delimiter=',') File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", line 1828, in recfromtxt output = genfromtxt(fname, **kwargs) File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", line 1351, in genfromtxt first_values = split_line(first_line) File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py", line 207, in _delimited_splitter line = line.split(self.comments)[0] TypeError: Can't convert 'bytes' object to str implicitly >>> numpy.recfromtxt(open('Õscar_3.txt',"rb"), delimiter=',') rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'), (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10')]) Josef > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion