The editing dept does a Save As...*.html on all the MS-Word files we
publish. However, in the process, each line in the new HTML file now ends
with a paragraph mark. So, I am trying to write a script that deletes HTML
tags over new lines (which I got to work), but also over paragraph marks.
What I have so far is below, the 2nd and 3rd lines from the bottom are
examples of tags that span multi-lines, and in the process, span the
paragraph marks. Also, I know it is not actually *doing* anything now, I am
still in the testing phase, which is why all the COLOR constants are
specified...
____________________
#! /usr/bin/perl
use warnings;
use strict;
use Term::ANSIColor qw(:constants);
$Term::ANSIColor::AUTORESET = 1;
# $/ = ""; ###I tried it with this uncommented, the whole file becomes a
big "paragraph", and nothing matches.
while (<>) {
#remove weird paragraph marks
s/<\/?o:p>//msgi && print "$i: $`", ON_MAGENTA "|$&|", RESET "$'\n";
#remove unecessary closing tags
s/<\/b>//msgi && print "$i: $`", YELLOW "|$&|", RESET "$'\n";
s/<\/span>//msgi && print "$i: $`", ON_GREEN "|$&|", RESET "$'\n";
#remove mso-spaceruns
s/<span\s*(\S+\s*\S+)\">/ /msgi && print "$i: $`", ON_RED "|$&|", RESET
"$'\n"; #***this is one tag that spans multi lines
#remove mso image data
s/<!--\[if gte vml 1\]>.*<!\[endif\]-->//msgi && print "$i: $`", GREEN
"|$&|", RESET "$'\n"; #***this is one tag that spans multi lines
s/(v:shapes\S+\s)//msgi && print "$i: $`", ON_BLUE "|$&|", RESET "$'\n";
}
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]