[issue37751] In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores.

2021-06-12 Thread Bodo Graumann

Bodo Graumann  added the comment:

Unfortunately this is not quite finished yet.

First of all, the change is bigger than what is documented: “Changed in version 
3.9: Hyphens and spaces are converted to underscore.“

In reality, now
| Normalization works as follows: all non-alphanumeric
| characters except the dot used for Python package names are
| collapsed and replaced with a single underscore, e.g. '  -;#'
| becomes '_'. Leading and trailing underscores are removed.”
Cf. 
[encodings/__init__.py](https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Lib/encodings/__init__.py#L47-L50)

Secondly, this change breaks lots of iconv codecs with the python-iconv 
binding. E.g. `ASCII//TRANSLIT` is now normalized to `ascii_translit`, which 
iconv does not understand. Codec names which use hyphens also break and iinm 
not all of them have aliases in iconv without hyphens.
Cf. [python-iconv #4](https://github.com/bodograumann/python-iconv/issues/4)

The codecs api feels extremely well-fitting for integrating iconv in python and 
any alternative I can think of seems unsatisfactory.
Please advise.

--
nosy: +bodograumann

___
Python tracker 
<https://bugs.python.org/issue37751>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44723] Codec name normalization breaks custom codecs

2021-07-23 Thread Bodo Graumann

New submission from Bodo Graumann :

This is a follow up on https://bugs.python.org/issue37751 concerning 
normalization of codec names.

First of all, the changes made therein are not documented correctly.
In the implementation
| Normalization works as follows: all non-alphanumeric
| characters except the dot used for Python package names are
| collapsed and replaced with a single underscore, e.g. '  -;#'
| becomes '_'. Leading and trailing underscores are removed.”
Cf. 
[encodings/__init__.py](https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Lib/encodings/__init__.py#L47-L50)

The documentation however only states that:
| Search functions are expected to take one argument, being the encoding name 
in all lower case letters with hyphens and spaces converted to underscores
Cf. https://docs.python.org/3/library/codecs.html#codecs.register

Secondly, this change breaks lots of iconv codecs with the python-iconv 
binding. E.g. `ASCII//TRANSLIT` is now normalized to `ascii_translit`, which 
iconv does not understand. Codec names which use hyphens also break and iinm 
not all of them have aliases in iconv without hyphens.
Cf. [python-iconv #4](https://github.com/bodograumann/python-iconv/issues/4)

How about first looking up the given name and only then, if the given name 
could not be found, looking for the codec by its normalized name?

--
components: Unicode
messages: 398042
nosy: bodograumann, ezio.melotti, vstinner
priority: normal
severity: normal
status: open
title: Codec name normalization breaks custom codecs
type: behavior
versions: Python 3.9

___
Python tracker 
<https://bugs.python.org/issue44723>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com