On Fri, Jan 17, 2014 at 3:17 PM, Chris Barker <chris.bar...@noaa.gov> wrote: > >>> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'), > delimiter=',') >> >> Traceback (most recent call last): >> File "<pyshell#251>", line 1, in <module> >> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'), >> delimiter=',') >> File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", >> line 1828, in recfromtxt >> output = genfromtxt(fname, **kwargs) >> File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py", >> line 1351, in genfromtxt >> first_values = split_line(first_line) >> File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py", >> line 207, in _delimited_splitter >> line = line.split(self.comments)[0] >> TypeError: Can't convert 'bytes' object to str implicitly > > > That's pretty broken -- if you know the encoding, you should certainly be > able to get a proper unicode string out of it.. > >> >> >>> numpy.recfromtxt(open('Õscar_3.txt',"rb"), delimiter=',') >> rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'), >> (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')], >> dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10')]) > > > So the problem here is that recfromtxt is making all "text" bytes objects. > ('S' ?) -- which is probably not what you want particularly if you specify > an encoding. Though I can't figure out at the moment why the previous one > failed -- where did the bytes object come from when the encoding was > specified?
Yes, it's a utf-8 file with nonascii. I don't know what I **should** want. For now I do want bytes, because that's how I changed statsmodels in the py3 conversion. This was just based on the fact that recfromtxt doesn't work with strings on python 3, so I switched to using bytes following the lead of numpy. I'm mainly worried about backwards compatibility, since we have been using this for 2 or 3 years. It would be easy to change in statsmodels when gen/recfromtxt is fixed, but I assume there is lots of other code using similar interpretation of S/bytes in numpy. Josef > > By the way -- this is apparently a utf-file with some non-ascii text in it. > By my proposal, without an encoding specified, it should default to latin-1: > > In that case, you might get unicode string objects that are incorrectly > decoded. But: > > it would not raise an exception > > you could recover the proper text with: > > the_text.encode(latin-1).decode('utf-8') > > On the other hand, if this was as ascii-compatible non-utf8 encoding file, > and we tried to read it as utf-8, it could barf on the non-ascii text > altogether, and if it didn't the non-ascii text would be corrupted and > impossible to recover. > > I think the issue is that I'm not really proposing latin-1 -- I'm proposing > "a ascii compatible encoding that will do the right thing with ascii bytes, > and pass through any other bytes untouched" - latin-1, at least as > implemented by Python, satisfies that criterium. > > -Chris > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion