Re: [Tutor] String matching?

Kent Johnson Tue, 07 Dec 2004 05:00:19 -0800

Regular expressions are a bit tricky to understand but well worth the trouble - they are a powerful tool. The Regex HOW-TO is one place to start: http://www.amk.ca/python/howto/regex/

Of course, Jamie Zawinsky famously said, "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

You can do a lot of cleanup with a few simple string substitutions:

test = ''' <app=
let
 code=3D"fphover.class" height=3D"24" width=3D"138"><param name=3D"color"<applet
        code
<ap=
plet '''

test2 = test.replace('=\n', '')
test2 = test2.replace('=3D"', '="')
print test2

prints =>

 <applet
 code="fphover.class" height="24" width="138"><param name="color"<applet
        code
<applet

This is probably a good first step even if you want to use regular expressions to parse out the rest of the data from the applet tag.

OK, here is a brute-force regex that will find the text 'applet' with '=\n' perhaps between any pair of characters:

appRe = r'(=\n)?'.join(list('applet'))
print appRe

=> a(=\n)?p(=\n)?p(=\n)?l(=\n)?e(=\n)?t

The (=\n)? between each pair of letters means, optionally match =\n here.

You can use re.finditer to show all the matches:

import re

for match in re.finditer(appRe, test):
    print
    print match.group(0)

=>
app=
let

applet

ap=
plet

A couple other options: elementtidy reads HTML, cleans it up and creates a tree model of the source. You can easily modify the tree model and write it out again. This has the bonus of giving you well-formed XHTML at the end of the process. It is based on HTML Tidy and Fredrik Lundh's elementtree package which is very easy to use. http://www.effbot.org/zone/element-tidylib.htm

Beautiful Soup is an HTML parser that is designed to read bad HTML and give access to the tags. I'm not sure if it gives you any help for rewriting, though. http://www.crummy.com/software/BeautifulSoup/

HTH
Kent

Liam Clarke wrote:

Hi all,

I have a large amount of HTML that a previous person has liberally
sprinkled a huge amount of applets through, instead of html links,
which kills my browser to open.

So, want to go through and replace all applets with nice simple links,
and want to use Python to find the applet, extract a name and an URL,
and create the link.

My problem is, somewhere in my copying and pasting into the text file
that the HTMl currently resides in, it got all messed up it would
seem, and there's a bunch of strange '=' all through it. (Someone said
that the code had been generated in Frontpage. Is that a good thing or
bad thing?)

So, I want to search for <applet code=, but it may be in the file as

<app=
let
 code

or <applet
        code

or <ap= plet

etc. etc. (Full example of yuck here
http://www.rafb.net/paste/results/WcKPCy64.html)

So, I want to be write a search that will match <applet code and
<app=\nlet code (etc. etc.) without having to strip the file of '='
and '\n'.

I was thinking the re module is for this sort of stuff? Truth is, I
wouldn't know where to begin with it, it seems somewhat powerful.

Or, there's a much easier way, which I'm missing totally. If there is,
I'd be very grateful for pointers.

Thanks for any help you can offer.

Liam Clarke

_______________________________________________
Tutor maillist  -  [EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] String matching?

Reply via email to