[issue40845] idna encoding fails for Cherokee symbols

2020-06-02 Thread Roman Akopov

New submission from Roman Akopov :

For a specific Cherokee string of three symbols b'\\u13e3\\u13b3\\u13a9' 
generating punycode representation fails.

What steps will reproduce the problem?

Execute 'ꮳꮃꭹ'.encode('idna')
of even more reliable
Execute '\u13e3\u13b3\u13a9'.encode('idna')

What is the expected result?

'xn--f9dt7l'

What happens instead?

'xn--tz9ata7l'

Version affected.

Tested on Python 3.8.3 Windows and Python 3.6.8 CentOS.

Other information.

I was testing if our product supports internationalized domain names. So I had 
written a Python script which generated DNS zone file with punycode encoded 
names and JavaScript file for a browser to send requests to URLs containing 
internationalized domain names. Strings were taken from Common Locale Data 
Repository. 193 various URL, one per language.
 
When executed in Google Chrome, Mozilla Firefox and Microsoft EDGE, domain name 
'ꮳꮃꭹ.myhost.local' is converted to 'xn--f9dt7l.myhost.local', but we have 
'xn--tz9ata7l.myhost.local' in DNS zone file and this is how I had found the 
bug. For 192 other languages I have tested everything works just fine. hese are 
Afrikaans, Aghem, Akan, Amharic, Arabic, Assamese, Asu, Asturian, Azerbaijani, 
Basaa, Belarusian, Bemba, Bena, Bulgarian, Bambara, Bangla, Tibetan, Breton, 
Bodo, Bosnian, Catalan, Chakma, Chechen, Cebuano, Chiga, Czech, Church Slavic, 
Welsh, Danish, Taita, German, Zarma, Lower Sorbian, Duala, Jola-Fonyi, 
Dzongkha, Embu, Ewe, Greek, English, Esperanto, Spanish, Estonian, Basque, 
Ewondo, Persian, Fulah, Finnish, Filipino, Faroese, French, Friulian, Western 
Frisian, Irish, Scottish Gaelic, Galician, Swiss German, Gujarati, Gusii, Manx, 
Hausa, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, 
Interlingua, Indonesian, Sichuan Yi, Icelandic, Italian, Japanese, Ngomba, 
Machame, Javanese, Georgian, Kabyle, Kamba, Makonde, Kabuverdianu, Kikuyu, 
Kako, Kalaallisut, Kalenjin, Khmer, Kannada, Korean, Konkani, Kashmiri, 
Shambala, Bafia, Colognian, Kurdish, Cornish, Kyrgyz, Langi, Luxembourgish, 
Ganda, Lakota, Lingala, Lao, Lithuanian, Luba-Katanga, Luo, Luyia, Latvian, 
Maithili, Masai, Meru, Malagasy, Makhuwa-Meetto, Metaʼ, Maori, Macedonian, 
Malayalam, Mongolian, Manipuri, Marathi, Malay, Maltese, Mundang, Burmese, 
Mazanderani, Nama, North Ndebele, Low German, Nepali, Dutch, Kwasio, Norwegian 
Nynorsk, Nyankole, Oromo, Odia, Ossetic, Punjabi, Polish, Prussian, Pashto, 
Portuguese, Quechua, Romansh, Rundi, Romanian, Rombo, Russian, Kinyarwanda, 
Rwa, Samburu, Santali, Sangu, Sindhi, Northern Sami, Sena, Sango, Tachelhit, 
Sinhala, Slovak, Slovenian, Inari Sami, Shona, Somali, Albanian, Serbian, 
Swedish, Swahili, Tamil, Telugu, Teso, Tajik, Thai, Tigrinya, Turkish, Tatar, 
Uyghur, Ukrainian, Urdu, Uzbek, Vai, Volapük, Vunjo, Walser, Wolof, Xhosa, 
Soga, Yangben, Yiddish, Cantonese, Standard Moroccan Tamazight, Chinese, 
Traditional Chinese, Zulu.

Somehow specifically Cherokee code points trigger the bug.

On top of that, https://www.punycoder.com/ converts 'ꮳꮃꭹ' into 'xn--f9dt7l' and 
back. However 'xn--tz9ata7l' is reported as an invalid punycode.

--
components: Unicode
messages: 370615
nosy: Roman Akopov, ezio.melotti, vstinner
priority: normal
severity: normal
status: open
title: idna encoding fails for Cherokee symbols
type: behavior
versions: Python 3.6, Python 3.7, Python 3.8

___
Python tracker 
<https://bugs.python.org/issue40845>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40845] idna encoding fails for Cherokee symbols

2020-06-02 Thread Roman Akopov

Roman Akopov  added the comment:

This is how I extract data from Common Locale Data Repository v37
script assumes common\main working directory

from os import walk
from xml.etree import ElementTree

en_root = ElementTree.parse('en.xml')

for (dirpath, dirnames, filenames) in walk('.'):
for filename in filenames:
if filename.endswith('.xml'):
code = filename[:-4]
xx_root = ElementTree.parse(filename)
xx_lang = 
xx_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']')
en_lang = 
en_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']')

if en_lang.text == 'Cherokee':
print(en_lang.text)
print(xx_lang.text)
print(xx_lang.text.encode("unicode_escape"))
print(xx_lang.text.encode('idna'))
print(ord(xx_lang.text[0]))
print(ord(xx_lang.text[1]))
print(ord(xx_lang.text[2]))

script outputs

Cherokee
ᏣᎳᎩ
b'\\u13e3\\u13b3\\u13a9'
b'xn--tz9ata7l'
5091
5043
5033

If I change text to lower case

print(en_lang.text.lower())
print(xx_lang.text.lower())
print(xx_lang.text.lower().encode("unicode_escape"))
print(xx_lang.text.lower().encode('idna'))
print(ord(xx_lang.text.lower()[0]))
print(ord(xx_lang.text.lower()[1]))
print(ord(xx_lang.text.lower()[2]))

then script outputs

cherokee
ꮳꮃꭹ
b'\\uabb3\\uab83\\uab79'
b'xn--tz9ata7l'
43955
43907
43897

I am not sure where do you get '\u13e3\u13b3\u13a9' string. 
'\u13e3\u13b3\u13a9'.lower().encode('unicode_escape') gives 
b'\\uabb3\\uab83\\uab79'

--

___
Python tracker 
<https://bugs.python.org/issue40845>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com