http://www.amk.ca/python/howto/regex/
Of course, Jamie Zawinsky famously said, "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."
You can do a lot of cleanup with a few simple string substitutions:
test = ''' <app= let code=3D"fphover.class" height=3D"24" width=3D"138"><param name=3D"color"<applet code <ap= plet '''
test2 = test.replace('=\n', '') test2 = test2.replace('=3D"', '="') print test2
prints =>
<applet code="fphover.class" height="24" width="138"><param name="color"<applet code <applet
This is probably a good first step even if you want to use regular expressions to parse out the rest of the data from the applet tag.
OK, here is a brute-force regex that will find the text 'applet' with '=\n' perhaps between any pair of characters:
appRe = r'(=\n)?'.join(list('applet')) print appRe
=> a(=\n)?p(=\n)?p(=\n)?l(=\n)?e(=\n)?t
The (=\n)? between each pair of letters means, optionally match =\n here.
You can use re.finditer to show all the matches:
import re
for match in re.finditer(appRe, test): print print match.group(0)
=> app= let
applet
ap= plet
A couple other options:
elementtidy reads HTML, cleans it up and creates a tree model of the source. You can easily modify the tree model and write it out again. This has the bonus of giving you well-formed XHTML at the end of the process. It is based on HTML Tidy and Fredrik Lundh's elementtree package which is very easy to use.
http://www.effbot.org/zone/element-tidylib.htm
Beautiful Soup is an HTML parser that is designed to read bad HTML and give access to the tags. I'm not sure if it gives you any help for rewriting, though.
http://www.crummy.com/software/BeautifulSoup/
HTH Kent
Liam Clarke wrote:
Hi all,
I have a large amount of HTML that a previous person has liberally sprinkled a huge amount of applets through, instead of html links, which kills my browser to open.
So, want to go through and replace all applets with nice simple links, and want to use Python to find the applet, extract a name and an URL, and create the link.
My problem is, somewhere in my copying and pasting into the text file that the HTMl currently resides in, it got all messed up it would seem, and there's a bunch of strange '=' all through it. (Someone said that the code had been generated in Frontpage. Is that a good thing or bad thing?)
So, I want to search for <applet code=, but it may be in the file as
<app= let code
or <applet code
or <ap=
plet
etc. etc. (Full example of yuck here http://www.rafb.net/paste/results/WcKPCy64.html)
So, I want to be write a search that will match <applet code and <app=\nlet code (etc. etc.) without having to strip the file of '=' and '\n'.
I was thinking the re module is for this sort of stuff? Truth is, I wouldn't know where to begin with it, it seems somewhat powerful.
Or, there's a much easier way, which I'm missing totally. If there is, I'd be very grateful for pointers.
Thanks for any help you can offer.
Liam Clarke
_______________________________________________ Tutor maillist - [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/tutor