"Emad Nawfal (عماد نوفل)" <emadnaw...@gmail.com> wrote in message
news:652641e90907250514m1566287aq75f675fd63360...@mail.gmail.com...
On 7/25/09, Dave Angel <da...@ieee.org> wrote:
Emad Nawfal (9E'/ FHAD) wrote:
Hi Tutors,
I have a bunch of text files that have many occurrences like the
following
which I believe, given the context, are numbers:
١٨٧٢
٥٧
٢٠٠٨
etc.
So, can somebody please explain what kind of numbers these are, and how
I
can get the original numbers back. The files are in Arabic and were
downloaded from an Arabic website.
I'm running python2.6 on Ubuntu 9.04
Those are standard html encodings for some Unicode characters. [snip]
You might find re.sub() useful to process your text files. It will replace
the HTML encodings with the actual Unicode character.
import re
data =
u"١٨٧٢٥٧٢٠٠٨"
s = re.sub(r'&#(\d+);',lambda m: unichr(int(m.group(1))),data)
s
u'\u0661\u0668\u0667\u0662\u0665\u0667\u0662\u0660\u0660\u0668'
print s
1872572008
And this can be helpful for identifying Unicode characters:
import unicodedata
for c in s:
... print unicodedata.name(c)
...
ARABIC-INDIC DIGIT ONE
ARABIC-INDIC DIGIT EIGHT
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT FIVE
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT EIGHT
-Mark
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor