Re: Delete Table Column via grep

Ronald J Kimball Mon, 29 Jan 2007 08:26:34 -0800

On Mon, Jan 29, 2007 at 08:19:25AM -0700, Warren Michelsen wrote:
> I'm trying to use a grep find to eliminate the last column in a table. I'm 
> not able to get grep to find the last instance of <td.*</td> before a </tr>. 
> 
> <td.* 
> will find the opening tag 
> and 
> .*</tr>  will find just the row closing tag but
> 
> <td.*</td> 
> will not find the entire column. 
> 
> I'd thought I could find something like:
> <td.*</td>.*</tr>
> and replace it with </tr>


I've ended up writing a long explanation.  If you just want to see my
suggestion for a regex that should work, skip to the bottom.  :)


So, the first thing you need to know is that . matches any character except
a newline.  It sounds like your tags are on separate lines, so you need to
allow . to match newlines as well.  Use (?s) for this.  The 's' is
for single line; it treats the file like it's a single line, so newlines
are just regular characters.  That gives us this regex:
  (?s)<td.*</td>.*</tr>

The next thing you need to know is that a regular expression will
find the longest, left-most match.  (?s)<td.*</td>.*</tr> finds the first
<td in the file, then the last </td> in the file, then the last </tr> in
the file.  That's not right, so maybe we want non-greedy quantifiers.  Now
we've got:
  (?s)<td.*?</td>.*?</tr>

When I said that a regular expression will find the longest, left-most
match, that was a generalization.  "Longest" actually means
"least-backtracking".  For the greedy quantifiers (? + * {n,m}) those are
the same.  For the non-greedy quantifiers (?? +? *? {n,m}?) it's the
shortest, left-most match instead.

Unfortunately, in either case it's still the leftmost match.
(?s)<td.*?</td>.*?</tr> finds the first <td in the file, then the first
</td> after that, then the first </tr> after that.  So if you've got
multiple <td></td>s, it will match from the first <td> all the way through
to the </tr>.  That's not what we want either.

At this point, we need to get tricky.  The problem with the .*? in
(?s)<td.*?</td>.*?</tr> is that they can match <td> and </td>.  We need to
only allow them to match strings that don't contain a <td> or </td>.  This
can be done with a negative look-ahead assertion:
  (?s)<td(?:(?!</?td).)*</td>(?:(?!</?td).)*</tr>

(?!</?td) is a negative look-ahead assertion.  It says, match if the next
thing in the string is not </?td - that's the negative look-ahead part; but
don't consume any of the string - that's the assertion part.  (^ and $ are
also assertions - they match without consuming any of the string.)

(?:(?!</?td).)* means check that the next thing in the string isn't
</?td, then match one character, then check again, and so on.


This regex should do what you want:

(?s)<td(?:(?!</?td).)*</td>(?:(?!</?td).)*</tr>


HTH,
Ronald

-- 
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <[EMAIL PROTECTED]>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to:  <[EMAIL PROTECTED]>

Re: Delete Table Column via grep

Reply via email to