[Python-Dev] New public PyUnicodeBuilder C API

2022-05-16 Thread Victor Stinner
Hi,

I propose adding a new C API to "build an Unicode string". What do you
think? Would it be efficient with any possible Unicode string storage
and any Python implementation?

PyPy has an UnicodeBuilder type in Python, but here I only propose C
API. Later, if needed, it would be easy to add a Python API for it.
PyPy has UnicodeBuilder to replace "str += str" pattern which is
inefficient in PyPy: CPython has a micro-optimization (in ceval.c) to
keep this pattern performance interesting. Adding a Python API was
discussed in 2020, see the LWN article:
https://lwn.net/Articles/816415/

Example without error handling, naive implementation which doesn't use
known length of key and value strings (calling Preallocate may be more
efficient):
---
// Format "key=value"
PyObject *format_with_builder(PyObject *key, PyObject *value)
{
assert(PyUnicode_Check(key));
assert(PyUnicode_Check(value));

// Allocated on the stack
PyUnicodeBuilder builder;
PyUnicodeBuilder_Init(&builder);

//  Overallocation is more efficient if the final length is unknown
PyUnicodeBuilder_EnableOverallocation(&builder);
PyUnicodeBuilder_WriteStr(&builder, key);
PyUnicodeBuilder_WriteChar(&builder, '=');

// Disable overallocation before the last write
PyUnicodeBuilder_DisableOverallocation(&builder);
PyUnicodeBuilder_WriteStr(&builder, value);

PyUnicode *str = PyUnicodeBuilder_Finish(&builder);
// ... use str ...
return

error:
PyUnicodeBuilder_Clear(&builder);
...
}
---

Proposed API (11 functions, 1 type):
---
typedef struct { ... } PyUnicodeBuilder;

void PyUnicodeBuilder_Init(PyUnicodeBuilder *builder);

int PyUnicodeBuilder_Preallocate(PyUnicodeBuilder *builder,
Py_ssize_t length, uint32_t maxchar);

void PyUnicodeBuilder_EnableOverallocation(PyUnicodeBuilder *builder);
void PyUnicodeBuilder_DisableOverallocation(PyUnicodeBuilder *builder);

int PyUnicodeBuilder_WriteChar(PyUnicodeBuilder *builder, uint32_t ch);
int PyUnicodeBuilder_WriteStr(PyUnicodeBuilder *builder, PyObject *str);
int PyUnicodeBuilder_WriteSubstr(PyUnicodeBuilder *builder,
PyObject *str, Py_ssize_t start, Py_ssize_t end);

int PyUnicodeBuilder_WriteASCIIStr(PyUnicodeBuilder *builder,
const char *str, Py_ssize_t len);
int PyUnicodeBuilder_WriteLatin1Str(PyUnicodeBuilder *builder,
const char *str, Py_ssize_t len);

PyObject* PyUnicodeBuilder_Finish(PyUnicodeBuilder *builder);
void PyUnicodeBuilder_Clear(PyUnicodeBuilder *builder);
---

The proposed API is based on the private _PyUnicodeWriter C API that I
added in Python 3.3 to optimize PEP 393 implementation.

PyUnicodeBuilder_WriteASCIIStr() is an optimization: in release mode,
the function doesn't have to check if the string contains non-ASCII
characters. In debug mode, it must fail. If you consider that this API
is too likely to introduce bugs in release mode, it can be removed.

PyUnicodeBuilder_Preallocate() maxchar can be zero, but for the
current Python implementation (PEP 393 compact string), it's more
efficient if maxchar matchs the expected Unicode storage: 127 for
ASCII, 255 for Latin1, 0x for UCS2, or 0x10 for UCS4. The
value doesn't have to the exact, for example, it can be any valiue in
[128; 255] for Latin1. The problem is that computing maxchar (need to
read characters, decode a byte strings from a codec, etc.) can be
expensive and an PyUnicodeBuilder_Preallocate() implementation can
ignore maxchar depending on the chosen Unicode string storage.
PyUnicode_MAX_CHAR_VALUE(str) can be used to create maxchar, but this
function is specific to the current CPython implementation.

Maybe a second "preallocate" function without maxchar should be added
(more convenient, but less efficient). I don't know.

Rationale for adding a new public C API.

Currently, the Python C API is designed to allocate an Unicode string
on the heap memory with uninitialized characters, and then basically
gives a direct access to these characters. Since Python 3.3, creating
an Unicode string in a C extension became more complicated: the caller
must know in advance what will be the optimal storage for characters:
ASCII, Latin1 UCS1 [U+; U+00FF], BMP UCS-2 [U+; U+],or
full Unicode character set UCS4 [U+; U+10]. When writing a
codec decoder (like decoding UTF-8), the maximum code point is not
known in advance and so the decoder may need to change the buffer
format while decoding (start with UCS1, switch to UCS2, maybe switch a
second time to UCS4).

The current C API has multiple drawbacks:

* It is designed to target the exact format "PEP 393 compact strings"
("kind + data").
* It is inefficient for PyPy which uses UTF-8 internally. So it would
also be inefficient if Python is modified to also use UTF-8
internally.
* It leaks too many implementat

[Python-Dev] Re: New public PyUnicodeBuilder C API

2022-05-16 Thread dw-git
Cython used the private _PyUnicodeWriter API (and stopped using it on Py3.11 
when it was hidden more thoroughly) so would probably make use of a public API 
to do the same thing. It's an optimization rather than an essential of course
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/2PWRAACTQVGDYEYBQXERBHOMSAOKIKF3/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: New public PyUnicodeBuilder C API

2022-05-16 Thread Victor Stinner
On Mon, May 16, 2022 at 11:40 AM  wrote:
> Cython used the private _PyUnicodeWriter API (and stopped using it on Py3.11 
> when it was hidden more thoroughly)

I'm not aware of any change in the the private _PyUnicodeWriter API in
Python 3.11. Is it just that Cython no longer wants to use private
APIs?

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/4HIQ2O3BNABK5XA5DW5HSLJELMO3XARP/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: New public PyUnicodeBuilder C API

2022-05-16 Thread dw-git
Victor Stinner wrote:
> On Mon, May 16, 2022 at 11:40 AM dw-...@d-woods.co.uk wrote:
> > Cython used the private _PyUnicodeWriter API (and stopped using it on 
> > Py3.11 when it was hidden more thoroughly)
> > I'm not aware of any change in the the private _PyUnicodeWriter API in
> Python 3.11. 

It was _PyFloat_FormatAdvancedWriter and _PyLong_FormatAdvancedWriter that got 
moved internally to somewhere Cython couldn't easily get them I think. 
(https://github.com/python/cpython/commit/0a883a76cda8205023c52211968bcf87bd47fd6e
 and 
https://github.com/python/cpython/commit/5f09bb021a2862ba89c3ecb53e7e6e95a9e07e1d).
 Obviously it would be possible to include the internal headers and re-enable 
it though - just turning it off was the quickest way to get it working at the 
time though

> Is it just that Cython no longer wants to use private
> APIs?

No such luck I'm afraid! The current policy is something like: if possible we 
should have a back-up option to avoid the private API, ideally controlled by a 
C define. I think that's a fairly good compromise - it lets Cython benefit from 
the internal APIs but provides an easy fix if they change. Obviously that 
policy isn't applied perfectly yet...
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/STZOK7UZ64NMZNYZFOQZ25HNGVVURIE7/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: New public PyUnicodeBuilder C API

2022-05-16 Thread Victor Stinner
On Mon, May 16, 2022 at 12:51 PM  wrote:
>
> Victor Stinner wrote:
> > On Mon, May 16, 2022 at 11:40 AM dw-...@d-woods.co.uk wrote:
> > > Cython used the private _PyUnicodeWriter API (and stopped using it on 
> > > Py3.11 when it was hidden more thoroughly)
> > > I'm not aware of any change in the the private _PyUnicodeWriter API in
> > Python 3.11.
>
> It was _PyFloat_FormatAdvancedWriter and _PyLong_FormatAdvancedWriter that 
> got moved internally to somewhere Cython couldn't easily get them I think. 
> (https://github.com/python/cpython/commit/0a883a76cda8205023c52211968bcf87bd47fd6e
>  and 
> https://github.com/python/cpython/commit/5f09bb021a2862ba89c3ecb53e7e6e95a9e07e1d).
>  Obviously it would be possible to include the internal headers and re-enable 
> it though - just turning it off was the quickest way to get it working at the 
> time though

I moved these "advanced formatter" functions to the internal C API in
batch of changes which moved most private functions to the internal C
API.

If you consider that they are useful outside Python, please open an
issue to request expose them as public functions. Right now, the
problem is that they use the _PyUnicodeWriter API which is also
private. If a public API is added to "build a string", maybe it would
make sense to add these "advanced formatter" functions to the public C
API?

My proposed API targets Python 3.12, it's too late for Python 3.11.
Maybe for Python 3.11, it's ok to add back private
_PyFloat_FormatAdvancedWriter and _PyLong_FormatAdvancedWriter
functions to the public C API to restore Cython performance.

Sadly, Cython still has to be changed at each Python release because
it still uses many private functions and private functions change
often. We have to go through this process to think about these APIs
and decide which ones should become public C functions, and which ones
are fine to be fully internal.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OASHJSB5HI2VN3RBCV5T3CYFTP4TZYOC/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: New public PyUnicodeBuilder C API

2022-05-16 Thread Antoine Pitrou
On Mon, 16 May 2022 11:13:56 +0200
Victor Stinner  wrote:
> Hi,
> 
> I propose adding a new C API to "build an Unicode string". What do you
> think? Would it be efficient with any possible Unicode string storage
> and any Python implementation?
> 
> PyPy has an UnicodeBuilder type in Python, but here I only propose C
> API. Later, if needed, it would be easy to add a Python API for it.
> PyPy has UnicodeBuilder to replace "str += str" pattern which is
> inefficient in PyPy: CPython has a micro-optimization (in ceval.c) to
> keep this pattern performance interesting. Adding a Python API was
> discussed in 2020, see the LWN article:
> https://lwn.net/Articles/816415/
> 
> Example without error handling, naive implementation which doesn't use
> known length of key and value strings (calling Preallocate may be more
> efficient):
> ---
> // Format "key=value"
> PyObject *format_with_builder(PyObject *key, PyObject *value)
> {
> assert(PyUnicode_Check(key));
> assert(PyUnicode_Check(value));
> 
> // Allocated on the stack
> PyUnicodeBuilder builder;
> PyUnicodeBuilder_Init(&builder);
> 
> //  Overallocation is more efficient if the final length is unknown
> PyUnicodeBuilder_EnableOverallocation(&builder);
> PyUnicodeBuilder_WriteStr(&builder, key);
> PyUnicodeBuilder_WriteChar(&builder, '=');
> 
> // Disable overallocation before the last write
> PyUnicodeBuilder_DisableOverallocation(&builder);

Having to manually enable or disable overallocation doesn't sound right.
Overallocation should be done *before* writing, not after. If there are
N bytes remaining and you write N bytes, then no reallocation should
occur.

Regards

Antoine.


___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/XOBUBUUCUS252CHFZA7I2HXEDUQ2G45P/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: New public PyUnicodeBuilder C API

2022-05-16 Thread Victor Stinner
On Mon, May 16, 2022 at 2:11 PM Antoine Pitrou  wrote:
> > PyUnicodeBuilder_Init(&builder);
> >
> > //  Overallocation is more efficient if the final length is unknown
> > PyUnicodeBuilder_EnableOverallocation(&builder);
> > PyUnicodeBuilder_WriteStr(&builder, key);
> > PyUnicodeBuilder_WriteChar(&builder, '=');
> >
> > // Disable overallocation before the last write
> > PyUnicodeBuilder_DisableOverallocation(&builder);
>
> Having to manually enable or disable overallocation doesn't sound right.
> Overallocation should be done *before* writing, not after. If there are
> N bytes remaining and you write N bytes, then no reallocation should
> occur.

Calling these functions has no immediate effect on the current buffer.
EnableOverallocation() doesn't enlarge the buffer. Even if the buffer
is currently "over allocated", DisableOverallocation() leaves the
buffer unchanged. Only the next writes will use a different strategy
depending on the current setting. Only the Finish() function shrinks
the buffer.

Currently, it's the _PyUnicodeWriter.overallocate member. If possible,
I would prefer to not expose the structure members in the public C
API.

Overallocation should be enabled before writing and disabled before
the last write. It's disabled by default. For some use cases, it's
more efficient to not enable overallocation (default).

Always enabling overallocation makes the code less efficient. For
example, a single write of 10 MB allocates 15 MB on Windows and then
shinks the final string to 10 MB.

Note: The current _PyUnicodeWriter implementation also has an
optimization when there is exactly one single WriteStr(obj) operation,
Finish() returns the input string object unchanged, even if
overallocation is enabled.

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/POMWWUW6DH2Y3OZOGAHPIFX4JDFYQ2SK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: New public PyUnicodeBuilder C API

2022-05-16 Thread Antoine Pitrou
On Mon, 16 May 2022 14:22:44 +0200
Victor Stinner  wrote:
> On Mon, May 16, 2022 at 2:11 PM Antoine Pitrou  wrote:
> > > PyUnicodeBuilder_Init(&builder);
> > >
> > > //  Overallocation is more efficient if the final length is 
> > > unknown
> > > PyUnicodeBuilder_EnableOverallocation(&builder);
> > > PyUnicodeBuilder_WriteStr(&builder, key);
> > > PyUnicodeBuilder_WriteChar(&builder, '=');
> > >
> > > // Disable overallocation before the last write
> > > PyUnicodeBuilder_DisableOverallocation(&builder);  
> >
> > Having to manually enable or disable overallocation doesn't sound right.
> > Overallocation should be done *before* writing, not after. If there are
> > N bytes remaining and you write N bytes, then no reallocation should
> > occur.  
> 
> Calling these functions has no immediate effect on the current buffer.
> EnableOverallocation() doesn't enlarge the buffer. Even if the buffer
> is currently "over allocated", DisableOverallocation() leaves the
> buffer unchanged. Only the next writes will use a different strategy
> depending on the current setting. Only the Finish() function shrinks
> the buffer.

Hmm, it appears I had misread the example. Sorry for the noise.

Regards

Antoine.


___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/DJU5TQN5CPKYITWU7R5AJOT7MF7A5V3X/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Problems building 3.7.13 on Windows

2022-05-16 Thread Joseph L. Casale
> I commented out line 182, and that allowed .\Tools\msibuild.bat to complete,
> but that appears to leave the bundle named as "python-3.7.13.7804-amd64.exe"\
> and it did not include the msi files (the final size was 1269 Kb).

After reading more documentation, I realized I needed to run buildrelease.bat, 
and
after downgrading jinja2, I was able to generate an installer.

Thanks,
jlc 
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/P47H7TMWCQVM2AYBBXDOJF5AUMR6NB4T/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Issue: 92359 - Python 3.10 IDLE 64-bit doesn't open any files names code (code.py, code.pyw) - found a partial fix but looking for input

2022-05-16 Thread Steve Dower

On 5/14/2022 8:37 PM, Terry Reedy wrote:

On 5/14/2022 12:40 AM, Guido van Rossum wrote:
Probably "Edit with IDLE" should be changed. I have no idea where that 
is defined.


I presume somewhere in PCBuild.  Steve Dower knows and is in charge of 
the Windows installer.


FTR, the behaviour for the traditional installer lives in 
Tools/msi/tcltk/tcltk_reg.wxs, and the behaviour for the Store install 
is in PC/python_uwp.cpp.


Cheers,
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/DUTVAITI3PKENGS5HQVQB2JDXQNC6WRB/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: New public PyUnicodeBuilder C API

2022-05-16 Thread dw-git
Victor Stinner wrote:
> My proposed API targets Python 3.12, it's too late for Python 3.11.
> Maybe for Python 3.11, it's ok to add back private
> _PyFloat_FormatAdvancedWriter and _PyLong_FormatAdvancedWriter
> functions to the public C API to restore Cython performance.

I think at this stage they should be left where they are. I can see why they 
were made private

> If a public API is added to "build a string", maybe it would
> make sense to add these "advanced formatter" functions to the public C
> API?

I think that the Cython is most likely to use a public string builder API for 
string formatting (for example fstrings). To me that suggests it'd be useful 
for a public API to include number formatting.

Uses like concatenating strings in a loop are a little harder to optimize just 
because it'd be hard to identify when to switch a variable from stringbuilder 
to string seemlessly and invisibly to a user. So we'd probably only use it for 
single expressions (of which formatting is the most obvious)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HCEUV7BKAG4RBWFEX4FOWJKL3OKHKDT2/
Code of Conduct: http://python.org/psf/codeofconduct/