On Thu, Sep 15, 2011 at 8:50 AM, "Martin v. Löwis" <mar...@v.loewis.de> wrote: > In reviewing memory usage, I found potential for saving more memory for > ASCII-only strings. Both Victor and Guido commented that something like > this be done; Antoine had asked whether there was anything that could > be done. Here is the idea: > > In an ASCII-only string, the UTF-8 representation is shared with the > canonical one-byte representation. This would allow to drop the > UTF-8 pointer and the UTF-8 length field; instead, a flag in the state > would indicate that these fields are not there. > > Likewise, the wchar_t/Py_UNICODE length can be shared (even though the > data cannot), since the ASCII-only string won't contain any surrogate > pairs. > > To comply with the C aliasing rules, the structures would look like this: > > typedef struct { > PyObject_HEAD > Py_ssize_t length; > union { > void *any; > Py_UCS1 *latin1; > Py_UCS2 *ucs2; > Py_UCS4 *ucs4; > } data; > Py_hash_t hash; > int state; /* may include SSTATE_SHORT_ASCII flag */ > wchar_t *wstr; > } PyASCIIObject; > > > typedef struct { > PyASCIIObject _base; > Py_ssize_t utf8_length; > char *utf8; > Py_ssize_t wstr_length; > } PyUnicodeObject; > > Code that directly accesses the structures would become more > complex; code that use the accessor macros wouldn't notice. > > As a result, ASCII-only strings would lose three pointers, > and shrink to their 3.2 structure size. Since they also save > in the individual characters, strings with more than > 3 characters (16-bit Py_UNICODE) or more than one character > (32-bit Py_UNICODE) would see a total size reduction compared > to 3.2. > > Objects created throught the legacy API (PyUnicode_FromUnicode) > that are only later found to be ASCII-only (in PyUnicode_Ready) > would still have the UTF-8 pointer shared with the data pointer, > but keep including separate fields for pointer & size. > > What do you think? > > Regards, > Martin > > P.S. There are similar reductions that could be applied > to the wstr_length in general: on 32-bit wchar_t systems, > it could be always dropped, on a 16-bit wchar_t system, > it could be dropped for UCS-2 strings. However, I'm not > proposing these, as I think the increase in complexity > is not worth the savings.
This sounds like a good plan. -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com