Alexandre Enkerli wrote:
> Hello all,
> This one is probably very easy for most of you and it would help me a
> great deal if someone could tell me how to do it. I know there's a
> bunch of tutorials, perldocs and manuals out there, but I'm getting
> confused.
>
> I receive files with the following line format:
> <tr>
> <td> ENKA10577207
> </td>
> <td> p1234567
> </td>
> <td> Enkerli-Smith Tremblay, Alexandra Jean-S�bastien
> </td>
> <td> alexandre.jean-S�[EMAIL PROTECTED]
> </td>
> </tr>
>
> These are:
> Permanent code
> Access code
> Names, First names
> Email address (usually of the [EMAIL PROTECTED], but not always)
>
> The "Permanent code" is made up of:
> Last name's first three letters plus first name initial
> day (01-31)
> month+sex (month (01-12) for males, month+50 for females)
> two-digit year (00-99)
> extra digits (I don't know what they mean)
>
> What I'd like to get is a tab-delimited file with the following
> Permanent code
> Access code
> Names
> First names
> Sex
> Age (or, at least, formatted birthdate)
> Email
>
> And then do calculations by age and sex.
> Now, I've been doing this semi-manually, but I'm sure this is trivial
> to do in Perl and it looks like an ideal learning opportunity. What
> I've tried so far (with F[n]s, unpack, regexp...) doesn't really work.
> A complete script (likely a one-liner) would be wonderful.
>
> Thanks in advance for your help.
>
> Alexandre Enkerli
> Ph.D. Candidate
> Department of Folklore and Ethnomusicology
> Indiana University
given the well format-ness of your html code, you can try the following:
open(HTML,'html.file') || die $!;
while(<HTML>){
#-- found a row
if(/<tr>/){
#-- read the next 7 lines. bad...bad... see below for reason
push(@html,scalar <HTML>) foreach(1..7);
chomp(@html);
#-- get day/month/year from permanent code
#-- assume the first 6 digits will be it
my($day,$month,$year) = $html[0] =~ /(\d{2})(\d{2})(\d{2})/;
my $sex = $month > 50 ? 'female' : 'male';
$month -= 50;
$year += 1900;
$html[0] =~ s/^.+;\s*//; #-- leave only permanent code
$html[2] =~ s/^.+;\s*//; #-- leave only access code
#-- name and first name
my($name,$fname) = $html[4] =~ /^.+;\s*(.+),\s*(.+)$/;
#-- email
$html[6] =~ s/^.+;\s*//;
#-- print them
print "$html[0]\n$html[1]\n$name\n$fname\n$sex\n$month/$day/$year\n$html[6]\n";
#-- get ready for next round
@html = ();
}
}
close(HTML);
__END__
the above is totally untested. try it and see what happen. it't not very
reliable consider if you all the sudden have:
<tr>
<td>
something
</td>
<!-- more <td> ... </td> -->
</tr>
notice that the <td> ... </td> pair is in 3 lines instead of 2, so the above
will not work. Also, the empty lines between <tr> and the first <td> also
mess up the above code. those are pretty easy to cope with because you can
simply change the code to read each <td> ... </td> pair instead of blindly
assume the next 7 lines will have everything. for example:
if(/<td>/){
while(<HTML>){
/<\/td>/ ? last : push(@td,$_);
}
}
or something similar. but even if you code that in your script. it still
fails if you have something like:
<tr>
<!-- nested table -->
<td>
<table><tr><td></td></tr></table>
</td>
</tr>
for those, you will want a html parser. go to CPAN and you can find some.
hope this get you started.
david
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]