[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-03-27 Thread Jean Christophe André

New submission from Jean Christophe André:

In Python version 2.x and at least 3.2 there no Vietnamese encoding support for 
TCVN 5712:1993.

This encoding is currently largely used in Vietnam and I think it would be 
usefull to add it to the python core encodings.

I already wrote some codec code, based on the codecs already available, that I 
successfully used in real life situation.

I would like to give it as a contribution to Python.

--
components: Unicode
files: vntime_tcvn.py
messages: 215012
nosy: ezio.melotti, haypo, progfou
priority: normal
severity: normal
status: open
title: missing vietnamese codec TCVN 5712:1993 in Python
type: enhancement
versions: Python 2.7, Python 3.2
Added file: http://bugs.python.org/file34644/vntime_tcvn.py

___
Python tracker 
<http://bugs.python.org/issue21081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-03-28 Thread Jean Christophe André

Jean Christophe André added the comment:

> * Please provide some background information how widely the encoding is used. 
> I get less than 1000 hits in Google when looking for "TCVN 5712:1993".

Here is the background for the need for this encoding.

The recent laws[0] in Vietnam have set TCVN 6909:2001 (Unicode based) as the 
standard encoding everybody should use. Still, there was more than 30 old 
Vietnamese encodings that were used for tenths of years before that, with some 
of them being still used (it takes times for people to accept the change and 
for technicians to do what's required to change technology). Among them, TCVN 
5712:1993 was (is) mostly used in the North of Vietnam and VNI (a private 
company encoding) in the South of Vietnam.

Worse than that, these old encodings use the C0 bank to store some Vietnamese 
letters (especially the 'ư', one of the most used in this language), which has 
the very unpleasant consequence to let some software (like 
OpenOffice/LibreOffice) being unable to render the texts correctly, even when 
using the correct fonts. Since this was a showstopper for Free Software 
adoption in Vietnam, I decided at that time to create a tool[1][2] to help in 
converting from these old encodings to Unicode. The project was then endorsed 
by the Ministry of Sciences and Technology of Vietnam, which asked me to make 
further developments[3].

Even if these old encodings are, hopefully, not the widest used in Vietnam now, 
there are still tons/plenty of old documents (sorry, I can't be more precise on 
the volume of administrative or private documents) that need to be 
read/modified or, best, converted to Unicode; and here is where the encodings 
are needed. Now every time some Vietnamese people (and Laotian people, I'll 
come back on this in another bug report) want to use OpenOffice/LibreOffice and 
still be able to open their old documents, they have to install this Python 
extension for this.

I foresee there will be not only plain documents to convert but also databases 
and other kind of data storage. And here is where Python has a great occasion 
to become the tool of choice.

[0] 
http://thuvienphapluat.vn/archive/Quyet-dinh-72-2002-QD-TTg-thong-nhat-dung-bo-ma-ky-tu-chu-Viet-TCVN-6909-2001-trao-doi-thong-tin-dien-tu-giua-to-chuc-dang-nha-nuoc-vb49528.aspx
[1] http://wiki.hanoilug.org/projects:ovniconv
[2] http://extensions.services.openoffice.org/project/ovniconv
[3] http://extensions.services.openoffice.org/en/project/b2uconverter


> Now, the encoding was a standard in Vietnam, but it has been updated in 1999 
> to TCVN 5712:1999.

I have to admit I missed this one. It may explain the differences I saw when I 
reversed engineered the TCVN encoding through the study the documents 
Vietnamese users provided to me. I will check this one and come back with more 
details.

> There's also an encoding called VSCII.

VSCII is the same as TCVN 5712:1993.

This page contains interesting information about these encodings: 
http://www.informatik.uni-leipzig.de/~duc/software/misc/tcvn.txt


> * In the file you write "kind of TCVN 5712:1993 VN3 with CP1252 additions". 
> This won't work, since we can only accept codecs which are based on set 
> standards.

I can understand that and I'll do my best to check if it's really based on one 
of the TCVN standards, be it 5712:1993 or 5712:1999. Still, after years of 
usage, I know perfectly that it's exactly the encoding we need (for the North 
part of Vietnam at least).


> It would be better to provide a link to an official Unicode character set 
> mapping table and then use the gencodec.py script on this table.

I saw a reference to this processing tool in the Python provided encodings and 
tried to find a Unicode mapping table at the Unicode website but failed up to 
now. I'll try harder.


> * For Vietnamese, Python already provides cp1258 - how much is this encoding 
> used in comparison to e.g. TCVN 5712:1993 ?

To be efficient at typing Vietnamese, you need a keyboard input software 
(Vietkey and Unikey being the most used). Microsoft tried to create dedicated 
Vietnamese encoding (cp1258) and keyboard, but I never saw or heard about its 
adoption at any place. Knowing the way Vietnamese users use their computer, I 
would say it probably has never been in real use.


> * Vietnamese encodings: 
> http://www.panl10n.net/english/outputs/Survey/Vietnamese.pdf

In this sentence you can see the most used old encodings in Vietnam: “On the 
Linux platform, fonts based on Unicode [6], TCVN, VNI and VPS [7] encodings can 
be adequately used to input Vietnamese text.”

This is not only the most used on Linux (in fact, on Linux we have to use 
Unicode, mostly because of the problem I explained before) but also on Windows. 
I don't know the situation for Mac OS or other OS though.

My goal is to add these encodings i

[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-03-28 Thread Jean Christophe André

Jean Christophe André added the comment:

I will prepare the official encoding map(s) based on the standard(s).

I'll also have to check which encoding correspond to my current encoding map, 
since this is the one useful in real life.

> Please also provide a patch for the documentation

I currently have no idea how to do this. Could you point me to a documentation 
sample or template please?

> and sign the Python contrib form:
> https://www.python.org/psf/contrib/contrib-form/

I did it yesterday. The form tells it can take days to be integrated, but I did 
receive the signed document as a confirmation.

Thanks for your concern, J.C.

--

___
Python tracker 
<http://bugs.python.org/issue21081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-10-20 Thread Jean Christophe André

Jean Christophe André added the comment:

A note to inform about my progress. (I had a long period without free time at 
hand)

While seeking (again) official documents on the topic, I mainly found a lot of 
non-official ones, but some are notorious enough to use them as references.

I am now in the process of creating the requested patch. I am currently 
studying the proper way to do it. I expect to get it ready this weekend, in the 
hope to have it accepted for Python 3.5.

--

___
Python tracker 
<http://bugs.python.org/issue21081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-10-28 Thread Jean Christophe André

Changes by Jean Christophe André :


Added file: http://bugs.python.org/file37054/TCVN5712-1.TXT

___
Python tracker 
<http://bugs.python.org/issue21081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-10-28 Thread Jean Christophe André

Changes by Jean Christophe André :


Added file: http://bugs.python.org/file37055/TCVN5712-2.TXT

___
Python tracker 
<http://bugs.python.org/issue21081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-10-28 Thread Jean Christophe André

Changes by Jean Christophe André :


Added file: http://bugs.python.org/file37056/TCVN5712-3.TXT

___
Python tracker 
<http://bugs.python.org/issue21081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-10-28 Thread Jean Christophe André

Jean Christophe André added the comment:

I failed to find anything about TCVN 5712:1999 except the official announcement 
of it superseding TCVN 5712:1993 on TCVN's website. I also was not able to find 
any material using TCVN 5712:1999. My guess is that TCVN 6909:2001 having been 
released only 2 years after, TCVN 5712:1999 probably had no time to get in real 
use.

Anyway, TCVN 5712:1993 is the real one, the one having been in used for almost 
2 decades. So this is why I provided codec tables for this one.

There is 3 flavors of it. The most used one for documents is the third one 
(TCVN 5712:1993 VN3). It is used with the so called “ABC fonts” which are of 
common knowledge in Vietnam. But the first one may be of use in databases. I 
never got access to real (large) Vietnamese databases so I can't confirm it for 
sure. I still provided the 3 flavors, just in case.

Still, since VN3 is a subset of VN2, which itself is a subset of VN1, you may 
choose to only include the first one, TCVN 5712:1993 VN1, I leave this up to 
you. FYI, GNU Recode and Glibc Iconv currently implement "tcvn" as VN1. (but 
the Epson printer company implement VN3…)

--

___
Python tracker 
<http://bugs.python.org/issue21081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-10-28 Thread Jean Christophe André

Jean Christophe André added the comment:

Marc-Andre, about “Please also provide a patch for the documentation”, could 
you please guide me on this?

I can write some documentation, but I simply don't know in what form you expect 
it. Could you point me to some examples please?

--

___
Python tracker 
<http://bugs.python.org/issue21081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21081] missing vietnamese codec TCVN 5712:1993 in Python

2014-10-28 Thread Jean Christophe André

Changes by Jean Christophe André :


Removed file: http://bugs.python.org/file34644/vntime_tcvn.py

___
Python tracker 
<http://bugs.python.org/issue21081>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22679] Add encodings of supported in glibc locales

2014-10-28 Thread Jean Christophe André

Changes by Jean Christophe André :


--
nosy: +progfou

___
Python tracker 
<http://bugs.python.org/issue22679>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com