On 17.01.2014 15:12, Julian Taylor wrote: > On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin > <oscar.j.benja...@gmail.com <mailto:oscar.j.benja...@gmail.com>> wrote: > > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote: > > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin > > <oscar.j.benja...@gmail.com <mailto:oscar.j.benja...@gmail.com>>wrote: > > > > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote: > > > > Julian Taylor <jtaylor.debian <at> googlemail.com > <http://googlemail.com>> writes: > > > > [clip] > > > > > > > > > > For backward compatibility we *cannot* change S. > > > > > > Do you mean to say that loadtxt cannot be changed from decoding > using > > > system > > > default, splitting on newlines and whitespace and then encoding the > > > substrings > > > as latin-1? > > > > > > > unicode dtypes have nothing to do with the loadtxt issue. They are not > > related. > > I'm talking about what loadtxt does with the 'S' dtype. As I showed > earlier, > if the file is not encoded as ascii or latin-1 then the byte strings are > corrupted (see below). > > This is because loadtxt opens the file with the default system > encoding (by > not explicitly specifying an encoding): > https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732 > > It then processes each line with asbytes() which encodes them as > latin-1: > https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784 > https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 > > > > wow this is just horrible, it might be the source of the bug. > > > > > Being an English speaker I don't normally use non-ascii characters in > filenames but my system (Ubuntu Linux) still uses utf-8 rather than > latin-1 or > (and rightly so!). > > > > > > > An obvious improvement would be along the lines of what Chris Barker > > > suggested: decode as latin-1, do the processing and then reencode as > > > latin-1. > > > > > > > no, the right solution is to add an encoding argument. > > Its a 4 line patch for python2 and a 2 line patch for python3 and > the issue > > is solved, I'll file a PR later. > > What is the encoding argument for? Is it to be used to decode, > process the > text and then re-encode it for an array with dtype='S'? > > > it is only used to decode the file into text, nothing more. > loadtxt is supposed to load text files, it should never have to deal > with bytes ever. > But I haven't looked into the function deeply yet, there might be ugly > surprises. > > The output of the array is determined by the dtype argument and not by > the encoding argument. > > Lets please let the loadtxt issue go to rest. > We know the issue, we know it can be fixed without adding anything > complicated to numpy. > We just have to use what python already provides us. > The technical details of the fix can be discussed in the github issue. > (Plan to have a look this weekend, but if someone else wants to do it > let me know). >
Work in progress PR: https://github.com/numpy/numpy/pull/4208 I also seem to have fixed the original bug, while wasn't even my intention with that PR :) apparently it was indeed one of the broken asbytes calls. if you have applications using loadtxt please give it a try, but genfromtxt is still completely broken (and a much larger fix, asbytes everywhere) _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion