A very common use case for regmatches is to extract regex matches into a new 
column in a data.frame (or data.table, etc.) or otherwise use the extracted 
strings alongside the input. However, the default behavior is to drop empty 
matches, which results in mismatches in column length if reassignment is done 
without subsetting.

For consistency with other R functions and compatibility with this use case, it 
would be nice if regmatches did not automatically drop empty matches and would 
instead insert an NA_character_ value (similar to stringr::str_extract). This 
alternative regmatches could be implemented through an optional drop argument, 
a new function, or mentioned in the documentation (a la resample in ?sample). 

Alternatively, at the moment, there is a non-exported function strextract in 
utils which is very similar to stringr::str_extract. It would be great if this 
function, once exported, were to include a drop argument to prevent dropping 
positions with no matches. 

An example solution (last option):

strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop = T) {
 m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
 result <- regmatches(x, m)
 
 if(isTRUE(drop)){
 unlist(result)
 } else if(isFALSE(drop)) {
 unlist({result[lengths(result)==0] <- NA_character_; result})
 } else {
 stop("Invalid argument for `drop`")
 }
}

Based on Ricardo Saporta's response to How to prevent regmatches drop non 
matches?

--CG

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to