Re: [Tutor] Re: How to read unicode strings from a binary file and display them as plain ascii?

R. Alan Monroe Tue, 01 Mar 2005 03:43:06 -0800

> R. Alan Monroe wrote:
>> I started writing a program to parse the headers of truetype fonts to
>> examine their family info. But I can't manage to print out the strings
>> without the zero bytes in between each character (they display as a
>> black block labeled 'NUL' in Scite's output pane)
>> 
>> I tried:
>>      stuff = f.read(nlength)
>>      stuff = unicode(stuff, 'utf-8')


>    If there are embeded 0's in the string, it won't be utf8, it could be 
> utf16 or 32.
>    Try:
>         unicode(stuff, 'utf-16')
> or
>         stuff.decode('utf-16')

>>      print type(stuff), 'stuff', stuff.encode()
>> This prints:
>> 
>>      <type 'unicode'> stuff [NUL]C[NUL]o[NUL]p[NUL]y[NUL]r[NUL]i[NUL]g[NUL]

>    I don't understand what you tried to accomplish here.

That's evidence of what I failed to accomplish. My expected results
was to print the word "Copyright" and whatever other strings are
present in the font, with no intervening NUL characters.


>    Try the other encodings. It probably is utf-16.

Aha, after some trial and error I see that I'm running into an endian
problem. It's "\x00C" in the file, which needs to be swapped to
"C\x00". I cheated temporarily by just adding 1 to the file pointer
:^)

Alan

#~ 11/30/1998  03:45 PM            38,308 FUTURAB.TTF
#~ 11/30/1998  03:45 PM            38,772 FUTURABI.TTF
#~ 12/10/1998  06:24 PM            32,968 FUTURAK.TTF
#~ 12/30/1998  05:15 AM            36,992 FUTURAL.TTF
#~ 12/15/1998  11:39 PM            37,712 FUTURALI.TTF
#~ 01/05/1999  03:59 AM            38,860 FUTURAXK.TTF

#~ The OpenType font with the Offset Table. If the font file contains only one 
font, the Offset Table will begin at byte 0 of the file. If the font file is a 
TrueType collection, the beginning point of the Offset Table for each font is 
indicated in the TTCHeader.

#~ Offset Table Type    Name    Description
#~ Fixed        sfnt version    0x00010000 for version 1.0.
#~ USHORT       numTables       Number of tables.
#~ USHORT       searchRange     (Maximum power of 2 <= numTables) x 16.
#~ USHORT       entrySelector   Log2(maximum power of 2 <= numTables).
#~ USHORT       rangeShift      NumTables x 16-searchRange.

import struct

def grabushort():
    global f
    data = f.read(2)
    return int(struct.unpack('>H',data)[0])

def grabulong():
    global f
    data = f.read(4)
    return int(struct.unpack('>L',data)[0])

f=open('c:/windows/fonts/futurak.ttf', 'rb')

version=f.read(4)

numtables = grabushort()
print numtables

f.read(6) #skip searchrange, entryselector, rangeshift

#~ Table Directory Type         Name    Description
#~ ULONG        tag     4 -byte identifier.
#~ ULONG        checkSum        CheckSum for this table.
#~ ULONG        offset  Offset from beginning of TrueType font file.
#~ ULONG        length  Length of this table.

#for x in range(numtables):
for x in range(numtables):
    tag=f.read(4)
    checksum =grabulong()
    offset = grabulong()
    tlength = grabulong()
    print 'tag', tag,  'offset', offset, 'tlength', tlength
    if tag=='name':
        nameoffset = offset
        namelength = tlength

print 'nameoffset', nameoffset, 'namelength', namelength


#The Naming Table is organized as follows:
#~ Type         Name    Description
#~ USHORT       format  Format selector (=0).
#~ USHORT       count   Number of name records.
#~ USHORT       stringOffset    Offset to start of string storage (from start 
of table).
#~ NameRecord   nameRecord[count]       The name records where count is the 
number of records.
#~ (Variable)           Storage for the actual string data.


#~ Each NameRecord looks like this:
#~ Type         Name    Description
#~ USHORT       platformID      Platform ID.
#~ USHORT       encodingID      Platform-specific encoding ID.
#~ USHORT       languageID      Language ID.
#~ USHORT       nameID  Name ID.
#~ USHORT       length  String length (in bytes).
#~ USHORT       offset  String offset from start of storage area (in bytes).
print


f.seek(nameoffset)
format = grabushort()
count = grabushort()
stringoffset = grabushort()
print 'format', format, 'count', count, 'stringoffset', stringoffset

for x in range(count):
    platformid = grabushort()
    encodingid = grabushort()
    languageid = grabushort()
    nameid = grabushort()
    nlength = grabushort()
    noffset = grabushort()
    print 'platformid', platformid, 'encodingid', encodingid, 'languageid', 
languageid, 'nameid', nameid, 'nlength', nlength, 'noffset', noffset
    if platformid==3:# microsoft
        bookmark = f.tell()
        print 'bookmark', bookmark
        f.seek(nameoffset+stringoffset+noffset+1)
        stuff = f.read(nlength)
        #stuff = unicode(stuff, 'utf-16')
        stuff = stuff.decode( 'utf-16')
        print type(stuff), 'stuff', stuff
        f.seek(bookmark)


f.close()

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Re: How to read unicode strings from a binary file and display them as plain ascii?

Reply via email to