On Tue, 26 Oct 2004, Jim wrote:
> I have a binary file that I have been tasked to discover the format of
> and somehow convert the records to readable text. Is there any way I
> can find out what binary format the file is in, so I can create an
> template for unpack() to convert the binary to text?
The best place to start is with the `file` command, and the magic
numbers behind it, which not nearly enough people know about these days.
On Unix systems (or Cygwin on Windows), `file` uses a database of magic
numbers -- fingerprints for different file types -- to identify files,
regardless of how the file is named (i.e. the file extension doesn't
matter here). For example, consider this output:
% file ~/Movies/*
61980main_PIA06410-movie.mov: Apple QuickTime movie file (moov)
CoLC_fog.mov: Apple QuickTime movie file (mdat)
Don_Quijote_animation.avi: RIFF (little-endian) data, AVI, 320 x 240, 25.00
fps, video: DivX 5, audio: (mono, 8000 Hz)
Jon_Stewart_Crossfire.rm: RealMedia file
Mahnamahna.mpeg: MPEG system stream data
Movies: symbolic link to `/Volumes/d2/Movies'
Tenacious D - Tribute.mpeg: MPEG system stream data
The Incredibles - trailer.mov: Apple QuickTime movie file (moov)
crossfire-20041015.wmv: Microsoft ASF
crossfire-20041015001.mp4: Apple QuickTime movie file (ftyp)
crossfire-20041015001.mp4.html: XML document text
goingupriver.dmg: Apple Partition data block size: 512, first type:
Apple_partition_map, name: Apple, number of blocks: 63, second type: Apple_HFS, name:
disk image, number of blocks: 1325920,
goingupriver.mov: Apple QuickTime movie file (moov)
%
Note that this isn't looking at file extensions: there's multiple files
with the ".mov" extension, but the command is able to figure out that
they're actually different formats. It works via the magic (ahem) of the
magic database, which describes predicted markers for many file types.
To illustrate, consider the GIF format. Each GIF image begins with:
* a signature, the three character string "GIF"
* a version string, either "87a" or "89a"
* image width & height, two bytes each
* a color table, one byte
* a background color index, one byte
Here's what the magic database entry for GIF looks like:
# GIF
0 string GIF8 GIF image data
>4 string 7a \b, version 8%s,
>4 string 9a \b, version 8%s,
>6 leshort >0 %hd x
>8 leshort >0 %hd
You can puzzle out for yourself how this notation works, but it should
be plain to see that the GIF fingerprint is being represented here.
SO, long preamble aside, you want to do this in Perl, right?
It looks like the module you want is File::Type or File::MMagic:
use File::Type;
my $ft = File::Type->new();
# read in data from file to $data, then
my $type_from_data = $ft->checktype_contents($data);
# alternatively, check file from disk
my $type_from_file = $ft->checktype_filename($file);
# convenient method for checking either a file or data
my $type_1 = $ft->mime_type($file);
my $type_2 = $ft->mime_type($data);
-- or --
use File::MMagic;
use FileHandle;
$mm = new File::MMagic; # use internal magic file
# $mm = File::MMagic->new('/etc/magic'); # use external magic file
$res = $mm->checktype_filename("/somewhere/unknown/file");
$fh = new FileHandle "< /somewhere/unknown/file2";
$res = $mm->checktype_filehandle($fh);
$fh->read($data, 0x8564);
$res = $mm->checktype_contents($data);
See <http://search.cpan.org/~pmison/File-Type/lib/File/Type.pm> or
<http://search.cpan.org/~knok/File-MMagic-1.22/MMagic.pm> for details.
The File::Type page includes a brief overview of the different modules
availablee, with critiques of why the author felt that the others didn't
quite do the job (which you may or may not agree with, that's okay).
Take a look over these modules, then try writing some code (or cheat and
just look it up with the `file` command) and let us know how it goes.
--
Chris Devers
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>