Thanks for pointing out my mistake.  I oversimplified the real problem.

I'll try to post a version of it that comes closer: Suppose I have a string like this:

x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n"

If I cat() it, I see that it is really markdown source:

  ```html
  blah blah
  ```

  ```r
  blah blah
  ```

I want to find the part that includes the html block, but not the r block. So I want to match "```html", followed by a minimal number of characters, then "```". Then this pattern works:

  pattern <- "\n```html\n.*?\n```\n"

and we get the right answer:

  cat(regmatches(x, regexpr(pattern, x)))

  ```html
  blah blah
  ```

Okay, but this flavour of markdown says there can be more backticks, not just 3. So the block might look like

  ````html
  blah blah
  ````

I need to have the same number of backticks in the opening and closing marker. So I make the pattern more complicated, and it doesn't work:

  pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"

This matches all of x:

  > pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"
  > cat(regmatches(x, regexpr(pattern2, x)))

  ```html
  blah blah
  ```

  ```r
  blah blah
  ```


Is that a bug, or am I making a silly mistake again?

Duncan Murdoch



On 25/01/2023 7:34 p.m., Andrew Simmons wrote:
grep(value = TRUE) just returns the strings which match the pattern. You have to use regexpr() or gregexpr() if you want to know where the matches are:

```
x <- "abaca"

# extract only the first match with regexpr()
m <- regexpr("a.*?a", x)
regmatches(x, m)

# or

# extract every match with gregexpr()
m <- gregexpr("a.*?a", x)
regmatches(x, m)
```

You could also use sub() to remove the rest of the string: `sub("^.*(a.*?a).*$", "\\1", x)`
keeping only the match within the parenthesis.


On Wed, Jan 25, 2023, 19:19 Duncan Murdoch <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote:

    The docs for ?regexp say this:  "By default repetition is greedy, so
    the
    maximal possible number of repeats is used. This can be changed to
    ‘minimal’ by appending ? to the quantifier. (There are further
    quantifiers that allow approximate matching: see the TRE
    documentation.)"

    I want the minimal match, but I don't seem to be getting it.  For
    example,

    x <- "abaca"
    grep("a.*?a", x, value = TRUE)
    #> [1] "abaca"

    Shouldn't I have gotten "aba", which is the first match to "a.*a"?  If
    not, what would be the regexp that would give me the first match to
    "a.*a", without greedy expansion of the .*?

    Duncan Murdoch

    ______________________________________________
    R-help@r-project.org <mailto:R-help@r-project.org> mailing list --
    To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    <https://stat.ethz.ch/mailman/listinfo/r-help>
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    <http://www.R-project.org/posting-guide.html>
    and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to