Bug#778741: Asciidoc fails to convert inline markups in japanese utf-8 texts

Yoshihiro Koya Wed, 18 Feb 2015 23:34:02 -0800

Package: asciidoc
Version: 8.6.7-1

1. Diagnostics


Asciidoc can convert inline markup texts.
For example, the text

---
[line-through]*some erased words*
---

is converted into line striken form in the html/xhtml output
(with the default asciidoc.css).

However, for japanese utf-8 character whose third byte code is 0xa0,
asciidoc fails to do it.
For example, the japanese utf-8 character of the code E383A0 (the japanese
KATAKANA character "mu") cannot be convert.
That is, the text
---
[line-through]*<utf-8 char of the code E383A0>*
---
will be still remain.
This is not only one example.
It seems to happen for all utf-8 chars of the last
byte if 0xa0.
Also chinese and korean characters can cause same problem.

The reason is that asciidoc doesn't convert
the texts such
---
[line-through]*This is a some text.\s*
---
where "\s" means whitespace chars
including all characters
classified as space in the Unicode character properties database.
The character of the code A0 is a such one(non-breaking space char).

2. A Patch

The following patch seems to improve the problem.

----
*** /usr/bin/asciidoc   2012-03-31 16:45:59.000000000 +0900
--- /home/myhome/bin/asciidoc   2015-02-19 14:37:39.150689826 +0900
***************
*** 594,600 ****
              # enveloping quotes and punctuation e.g. a='x', ('x'), 'x', ['x'].
              reo =
re.compile(r'(?msu)(^|[^\w;:}])(\[(?P<attrlist>[^[\]]+?)\])?' \
                  + r'(?:' + re.escape(lq) + r')' \
!                 + r'(?P<content>\S|\S.*?\S)(?:'+re.escape(rq)+r')(?=\W|$)')
          pos = 0
          while True:
              mo = reo.search(text,pos)
--- 594,600 ----
              # enveloping quotes and punctuation e.g. a='x', ('x'), 'x', ['x'].
              reo =
re.compile(r'(?msu)(^|[^\w;:}])(\[(?P<attrlist>[^[\]]+?)\])?' \
                  + r'(?:' + re.escape(lq) + r')' \
!                 +
r'(?P<content>\S|\S.*?\S)(?:'+re.escape(rq)+r')(?=\W|$)', re.LOCALE)
          pos = 0
          while True:
              mo = reo.search(text,pos)
----

But I don't know whether or not the above is the best one.
The problem is closely related to the mode of space chars in the
regular expressions.

3. Environments

----
# uname -a
Linux yaya 3.2.0-4-686-pae #1 SMP Debian 3.2.65-1+deb7u1 i686 GNU/Linux
# python --version
Python 2.7.3
# asciidoc --version
asciidoc 8.6.7
----

koya

[line-through]*ム*
[line-through]*加*

Bug#778741: Asciidoc fails to convert inline markups in japanese utf-8 texts

Reply via email to