[Python-Dev] PyCapsule_Import semantics, relative imports, module names etc.
While porting several existing CPython extension modules that form a package to be 2.7 and 3.x compatible the existing PyObject_* API was replaced with PyCapsule_*. This introduced some issues the existing CPython docs are silent on. I'd like clarification on a few issues and wish to raise some questions. 1. Should an extension module name as provided in PyModule_Create (Py3) or Py_InitModule3 (Py2) be fully package qualified or just the module name? I believe it's just the module name (see item 5 below) Yes/No? 2. PyCapsule_Import does not adhere to the general import semantics. The module name must be fully qualified, relative imports are not supported. 3. PyCapsule_Import requires the package (e.g. __init__.py) to import *all* of it's submodules which utilize the PyCapsule mechanism preventing lazy on demand loading. This is because PyCapsule_Import only imports the top level module (e.g. the package). From there it iterates over each of the module names in the module path. However the parent module (e.g. globals) will not contain an attribute for the submodule unless it's already been loaded. If the submodule has not been loaded into the parent PyCapsule_Import throws an error instead of trying to load the submodule. The only apparent solution is for the package to load every possible submodule whether required or not just to avoid a loading error. The inability to load modules on demand seems like a design flaw and change in semantics from the prior use of PyImport_ImportModule in combination with PyObject. [One of the nice features with normal import loading is setting the submodule name in the parent, the fact this step is omitted is what causes PyCapsule_Import to fail unless all submodules are unconditionally loaded). Shouldn't PyCapsule_Import utilize PyImport_ImportModule? 4. Relative imports seem much more useful for cooperating submodules in a package as opposed to fully qualified package names. Being able to import a C_API from the current package (the package I'm a member of) seems much more elegant and robust for cooperating modules but this semantic isn't supported (in fact the leading dot syntax completely confuses PyCapsule_Import, doc should clarify this). 5. The requirement that a module specifies it's name as unqualified when it is initializing but then also has to use a fully qualified package name for PyCapsule_New, both of which occur inside the same initialization function seems like an odd inconsistency (documentation clarification would help here). Also, depending on your point of view package names could be considered a deployment/packaging decision, a module obtains it's fully qualified name by virtue of it's position in the filesystem, something at compile time the module will not be aware of, another reason why relative imports make sense. Note the identical comment regarding _Py_PackageContext in modsupport.c (Py2) and moduleobject.c (Py3) regarding how a module obtains it's fully qualified package name (see item 1). Thanks! -- John ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Unicode <--> UTF-8 in CPython extension modules
I've uncovered what seems to me to a problem with python Unicode string objects passed to extension modules. Or perhaps it's revealing a misunderstanding on my part :-) So I would like to get some clarification. Extension modules written in C receive strings from python via the PyArg_ParseTuple family. Most extension modules use the 's' or 's#' format parameter. Many C libraries in Linux use the UTF-8 encoding. The 's' format when passed a Unicode object will encode the string according to the default encoding which is immutably set to 'ascii' in site.py. Thus a C library expecting UTF-8 which uses the 's' format in PyArg_ParseTuple will get an encoding error when passed a Unicode string which contains any code points outside the ascii range. Now my questions: * Is the use of the 's' or 's*' format parameter in an extension binding expecting UTF-8 fundamentally broken and not expected to work? Instead should the binding be using a format conversion which specifies the desired encoding, e.g. 'es' or 'es#'? * The extension modules could successfully use the 's' or 's#' format conversion in a UTF-8 environment if the default encoding was UTF-8. Changing the default encoding to UTF-8 would in one easy stroke "fix" most extension modules, right? Why is the default encoding 'ascii' in UTF-8 environments and why is the default encoding prohibited from being changed from ascii? * Did Python 2.5 introduce anything which now makes this issue visible whereas before it was masked by some other behavior? Summary: Python programs which use Unicode string objects for their i18n and which "link" to C libraries expecting UTF-8 but which have a CPython binding which only uses 's' or 's#' formats programs seem to often fail with encoding errors. However, I have yet to see a CPython binding which does explicitly define it's encoding requirements. This suggests to me I either do not understand the issue in it's entirety or many CPython bindings in Linux UTF-8 environments are broken with respect to their i18n handling and the problem is currently not addressed. -- John Dennis <[EMAIL PROTECTED]> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode <--> UTF-8 in CPython extension modules
Colin Walters wrote: > On Fri, Feb 22, 2008 at 4:23 PM, John Dennis <[EMAIL PROTECTED]> wrote: > >> Python programs which use Unicode string objects for their i18n and >> which "link" to C libraries expecting UTF-8 but which have a CPython >> binding which only uses 's' or 's#' formats programs seem to often >> fail with encoding errors. > > One thing to be aware of is that PyGTK+ actually sets the Python > Unicode object encoding to UTF-8. > > http://bugzilla.gnome.org/show_bug.cgi?id=132040 > > I mention this because PyGTK is a very popular library related to > Python and Linux. So currently if you "import gtk", then libraries > which are using UTF-8 (as you say, the vast majority) will work with > Python unicode objects unmodified. Thank you Colin, your input was very helpful. The fact PyGTK's i18n handling worked was the counter example which made me doubt my analysis was correct but I can see from the Gnome bug report and Martin's subsequent comment that the analysis was sound. It had perplexed me enormously why in some circumstances i18n handling worked but failed in others. Apparently it was a side effect of importing gtk, a problem exacerbated when either the sequence of imports or the complete set of imports was not taken into account. I am aware of other python bindings (libxml2 is one example) which share the same mistake of not using the 'es' family of format conversions when the underlying library is UTF-8. At least I now understand why incorrectly coded bindings in some circumstances produced correct results when logic dictated they shouldn't. -- John Dennis <[EMAIL PROTECTED]> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com