"Emad Nawfal (عماد نوفل)" <emadnaw...@gmail.com> wrote in message news:652641e90907250514m1566287aq75f675fd63360...@mail.gmail.com...
On 7/25/09, Dave Angel <da...@ieee.org> wrote:
Emad Nawfal (9E'/ FHAD) wrote:
Hi Tutors,
I have a bunch of text files that have many occurrences like the following
which I believe, given the context,  are numbers:

&#1633;&#1640;&#1639;&#1634;

&#1637;&#1639;

 &#1634;&#1632;&#1632;&#1640;

etc.

So, can somebody please explain what kind of numbers these are, and how I
can get the original numbers back. The files are in Arabic and were
downloaded from an Arabic website.
I'm running python2.6 on Ubuntu 9.04

Those are standard html encodings for some Unicode characters. [snip]

You might find re.sub() useful to process your text files. It will replace the HTML encodings with the actual Unicode character.

import re
data = u"&#1633;&#1640;&#1639;&#1634;&#1637;&#1639;&#1634;&#1632;&#1632;&#1640;"
s = re.sub(r'&#(\d+);',lambda m: unichr(int(m.group(1))),data)
s
u'\u0661\u0668\u0667\u0662\u0665\u0667\u0662\u0660\u0660\u0668'
print s
1872572008

And this can be helpful for identifying Unicode characters:

import unicodedata
for c in s:
...  print unicodedata.name(c)
...
ARABIC-INDIC DIGIT ONE
ARABIC-INDIC DIGIT EIGHT
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT FIVE
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT EIGHT

-Mark


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to