On Thu, Jan 23, 2014 at 11:58 AM, <josef.p...@gmail.com> wrote: > On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin > <oscar.j.benja...@gmail.com> wrote: >> On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.p...@gmail.com wrote: >>> >>> another curious example, encode utf-8 to latin-1 bytes >>> >>> >>> b >>> array(['Õsc', 'zxc'], >>> dtype='<U3') >>> >>> b[0].encode('utf8') >>> b'\xc3\x95sc' >>> >>> b[0].encode('latin1') >>> b'\xd5sc' >>> >>> b.astype('S') >>> Traceback (most recent call last): >>> File "<pyshell#40>", line 1, in <module> >>> b.astype('S') >>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in >>> position 0: ordinal not in range(128) >>> >>> c = b.view('S4').astype('S1').view('S3') >>> >>> c >>> array([b'\xd5sc', b'zxc'], >>> dtype='|S3') >>> >>> c[0].decode('latin1') >>> 'Õsc' >> >> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses >> ascii: >> >>>>> np.array(['Õsc']).astype('S4') >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position >> 0: ordinal not in range(128) >>>>> np.array(['Õsc']).view('S4') >> array([b'\xd5', b's', b'c'], >> dtype='|S4') > > > No, a view doesn't change the memory, it just changes the > interpretation and there shouldn't be any conversion involved. > astype does type conversion, but it goes through ascii encoding which fails. > >>>> b = np.array(['Õsc', 'zxc'], dtype='<U3') >>>> b.tostring() > b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>> b.view('S12') > array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], > dtype='|S12') > > The conversion happens somewhere in the array creation, but I have no > idea about the memory encoding for uc2 and the low level layouts.
utf8 encoded bytes >>> a = np.array(['Õsc'.encode('utf8'), 'zxc'], dtype='S') >>> a array([b'\xc3\x95sc', b'zxc'], dtype='|S4') >>> a.tostring() b'\xc3\x95sczxc\x00' >>> a.view('S8') array([b'\xc3\x95sczxc'], dtype='|S8') >>> a[0].decode('latin1') 'Ã\x95sc' >>> a[0].decode('utf8') 'Õsc' Josef > > Josef > >> >>> -------- >>> The original numpy py3 conversion used latin-1 as default >>> (It's still used in statsmodels, and I haven't looked at the structure >>> under the common py2-3 codebase) >>> >>> if sys.version_info[0] >= 3: >>> import io >>> bytes = bytes >>> unicode = str >>> asunicode = str >> >> These two functions are an abomination: >> >>> def asbytes(s): >>> if isinstance(s, bytes): >>> return s >>> return s.encode('latin1') >>> def asstr(s): >>> if isinstance(s, str): >>> return s >>> return s.decode('latin1') >> >> >> Oscar >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion