On Mon, Jan 29, 2007 at 08:19:25AM -0700, Warren Michelsen wrote:
> I'm trying to use a grep find to eliminate the last column in a table. I'm
> not able to get grep to find the last instance of <td.*</td> before a </tr>.
>
> <td.*
> will find the opening tag
> and
> .*</tr> will find just the row closing tag but
>
> <td.*</td>
> will not find the entire column.
>
> I'd thought I could find something like:
> <td.*</td>.*</tr>
> and replace it with </tr>
I've ended up writing a long explanation. If you just want to see my
suggestion for a regex that should work, skip to the bottom. :)
So, the first thing you need to know is that . matches any character except
a newline. It sounds like your tags are on separate lines, so you need to
allow . to match newlines as well. Use (?s) for this. The 's' is
for single line; it treats the file like it's a single line, so newlines
are just regular characters. That gives us this regex:
(?s)<td.*</td>.*</tr>
The next thing you need to know is that a regular expression will
find the longest, left-most match. (?s)<td.*</td>.*</tr> finds the first
<td in the file, then the last </td> in the file, then the last </tr> in
the file. That's not right, so maybe we want non-greedy quantifiers. Now
we've got:
(?s)<td.*?</td>.*?</tr>
When I said that a regular expression will find the longest, left-most
match, that was a generalization. "Longest" actually means
"least-backtracking". For the greedy quantifiers (? + * {n,m}) those are
the same. For the non-greedy quantifiers (?? +? *? {n,m}?) it's the
shortest, left-most match instead.
Unfortunately, in either case it's still the leftmost match.
(?s)<td.*?</td>.*?</tr> finds the first <td in the file, then the first
</td> after that, then the first </tr> after that. So if you've got
multiple <td></td>s, it will match from the first <td> all the way through
to the </tr>. That's not what we want either.
At this point, we need to get tricky. The problem with the .*? in
(?s)<td.*?</td>.*?</tr> is that they can match <td> and </td>. We need to
only allow them to match strings that don't contain a <td> or </td>. This
can be done with a negative look-ahead assertion:
(?s)<td(?:(?!</?td).)*</td>(?:(?!</?td).)*</tr>
(?!</?td) is a negative look-ahead assertion. It says, match if the next
thing in the string is not </?td - that's the negative look-ahead part; but
don't consume any of the string - that's the assertion part. (^ and $ are
also assertions - they match without consuming any of the string.)
(?:(?!</?td).)* means check that the next thing in the string isn't
</?td, then match one character, then check again, and so on.
This regex should do what you want:
(?s)<td(?:(?!</?td).)*</td>(?:(?!</?td).)*</tr>
HTH,
Ronald
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <[EMAIL PROTECTED]>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to: <[EMAIL PROTECTED]>