Re: [Tutor] What kind of number is this

Mark Tolonen Sat, 25 Jul 2009 09:10:37 -0700

"Emad Nawfal (عماد نوفل)" <emadnaw...@gmail.com> wrote in messagenews:652641e90907250514m1566287aq75f675fd63360...@mail.gmail.com...

On 7/25/09, Dave Angel <da...@ieee.org> wrote:

Emad Nawfal (9E'/ FHAD) wrote:
Hi Tutors,
I have a bunch of text files that have many occurrences like thefollowing
which I believe, given the context,  are numbers:
&#1633;&#1640;&#1639;&#1634;

&#1637;&#1639;

 &#1634;&#1632;&#1632;&#1640;

etc.
So, can somebody please explain what kind of numbers these are, and howI
can get the original numbers back. The files are in Arabic and were
downloaded from an Arabic website.
I'm running python2.6 on Ubuntu 9.04

Those are standard html encodings for some Unicode characters. [snip]

You might find re.sub() useful to process your text files. It will replacethe HTML encodings with the actual Unicode character.

import re
data =u"١٨٧٢٥٧٢٠٠٨"
s = re.sub(r'&#(\d+);',lambda m: unichr(int(m.group(1))),data)
s

u'\u0661\u0668\u0667\u0662\u0665\u0667\u0662\u0660\u0660\u0668'

print s

1872572008

And this can be helpful for identifying Unicode characters:

import unicodedata
for c in s:

...  print unicodedata.name(c)
...
ARABIC-INDIC DIGIT ONE
ARABIC-INDIC DIGIT EIGHT
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT FIVE
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT EIGHT

-Mark


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] What kind of number is this

Reply via email to