[Python-Dev] PEP 393 Summer of Code Project
Hello all, I have implemented an initial version of PEP 393 -- "Flexible String Representation" as part of my Google Summer of Code project. My patch is hosted as a repository on bitbucket [1] and I created a related issue on the bug tracker [2]. I posted documentation for the current state of the development in the wiki [3]. Current tests show a potential reduction of memory by about 20% and CPU by 50% for a join micro benchmark. Starting a new interpreter still causes 3244 calls to create compatibility Py_UNICODE representations, 263 strings are created using the old API while 62719 are created using the new API. More measurements are on the wiki page [3]. If there is interest, I would like to continue working on the patch with the goal of getting it into Python 3.3. Any and all feedback is welcome. Regards, Torsten [1]: http://www.python.org/dev/peps/pep-0393 [2]: http://bugs.python.org/issue12819 [3]: http://wiki.python.org/moin/SummerOfCode/2011/PEP393 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Mon, Aug 22, 2011 at 18:14, Antoine Pitrou wrote: > - You could trim the debug results from the benchmark results, this may > make them more readable. Good point, I removed them from the wiki page. On Tue, Aug 23, 2011 at 18:38, Victor Stinner wrote: > Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit : >> - You could try to run stringbench, which can be found at >> http://svn.python.org/projects/sandbox/trunk/stringbench (*) >> and there's iobench (the text mode benchmarks) in the Tools/iobench >> directory. > > Some raw numbers. > [...] Thank you Victor for running stringbench, I did not get to it in time. Regards, Torsten ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Tue, Aug 23, 2011 at 08:15, Antoine Pitrou wrote: > So why would you need three separate implementation of the unrolled > loop? You already have a macro named WRITE_FLEXIBLE_OR_WSTR. The WRITE_FLEXIBLE_OR_WSTR macro does a check for kind and then writes. Using this macro for the fast path would be inefficient, to have a real fast path, you would need a outer if to check for kind and then in each condition body the matching access to the string (1, 2, or 4 bytes) and for each body also write 4 or 8 times (guarded by #ifdef, depending on platform). As all these cases bloated up the C code, we went for the simple solution with the goal of profiling the code again afterwards to see where the new performance bottlenecks would be. > Even without taking into account the unrolled loop, I wonder how much > slower UTF-8 decoding becomes with that approach, by the way. Instead of > testing the "kind" variable at each loop iteration, using a > stringlib-like approach may be a better deal IMO. To me this feels like this would complicate the C source code and decrease readability. For each function you would need a wrapper which does the kind checking logic and then, in a separate file, the implementation of the function which then gets included three times for each character width. Regards, Torsten ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Tue, Aug 23, 2011 at 18:27, Victor Stinner wrote: > I posted a patch to re-add it: > http://bugs.python.org/issue12819#msg142867 Thank you for the patch! Note that this patch adds the fast path only to the helper function which determines the length of the string and the maximum character. The decoding part is still without a fast path for ASCII runs. Regards, Torsten ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou wrote: > Macros are useful to shield the abstraction from the implementation. If > you access the members directly, and the unicode object is represented > differently in some future version of Python (say e.g. with tagged > pointers), your code doesn't compile anymore. I agree with Antoine, from the experience of porting C code from 3.2 to the PEP 393 unicode API, the additional encapsulation by macros made it much easier to change the implementation of what is a field, what is a field's actual name, and what needs to be calculated through a function. So, I would like to keep primary access as a macro but I see the point that it would make the struct clearer to access and I would not mind changing the struct to use a union. But then most access currently is through macros so I am not sure how much benefit the union would bring as it mostly complicates the struct definition. Also, common, now simple, checks for "unicode->str == NULL" would look more ambiguous with a union ("unicode->str.latin1 == NULL"). Regards, Torsten ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Tue, Aug 23, 2011 at 18:56, Victor Stinner wrote: >> kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still >> necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape(). > > If it can be removed, it would be nice to have kind in [0; 2] instead of kind > in [1; 2], to be able to have a list (of 3 items) => callback or label. It is also used in PyUnicode_DecodeUTF8Stateful() and there might be some cases which I missed converting checks for 0 when I introduced the macro. The question was more if this should be written as 0 or as a named constant. I preferred the named constant for readability. An alternative would be to have kind values be the same as the number of bytes for the string representation so it would be 0 (wstr), 1 (1-byte), 2 (2-byte), or 4 (4-byte). I think the value for wstr/uninitialized/reserved should not be removed. The wstr representation is still used in the error case in the utf8 decoder because these strings can be resized. Also having one designated value for "uninitialized" limits comparisons in the affected functions to the kind value, otherwise they would need to check the str field for NULL to determine in which buffer to write a character. > I suppose that compilers prefer a switch with all cases defined, 0 a first > item > and contiguous values. We may need an enum. During the Summer of Code, Martin and I did a experiment with GCC and it did not seem to produce a jump table as an optimization for three cases but generated comparison instructions anyway. I am not sure how much we should optimize for potential compiler optimizations here. Regards, Torsten ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Okay, I am convinced. :) If Martin does not object, I would change the "void *str" field to union { void *any; unsigned char *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; Regards, Torsten On Wed, Aug 24, 2011 at 02:57, Stefan Behnel wrote: > Torsten Becker, 24.08.2011 04:41: >> >> Also, common, now simple, checks for "unicode->str == NULL" would look >> more ambiguous with a union ("unicode->str.latin1 == NULL"). > > You could just add yet another field "any", i.e. > > union { > unsigned char* latin1; > Py_UCS2* ucs2; > Py_UCS4* ucs4; > void* any; > } str; > > That way, the above test becomes > > if (!unicode->str.any) > > or > > if (unicode->str.any == NULL) > > Or maybe even call it "initialised" to match the intended purpose: > > if (!unicode->str.initialised) > > That being said, I don't mind "unicode->str.latin1 == NULL" either, given > that it will (as mentioned by others) be hidden behind a macro most of the > time anyway. > > Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com