It is also not catching the cases where the number of capture expressions does not match the number of entries in proto. I think all of the following should give an error about the mismatch.
> strcapture("(.)(.)", c("ab", "cde", "fgh", "ij", "lm"), proto=list(A="",B="",C="")) A B C 1 a b cd 2 d fg f 3 ij i j 4 l m ab Warning message: In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) : data length [15] is not a sub-multiple or multiple of the number of rows [4] > strcapture("(.)(.)(.)", c("abc", "def", "ghi", "jkl", "mno"), proto=list(A="",B="")) A B 1 a b 2 def d 3 f ghi 4 h i 5 j k 6 mno m 7 o abc Warning message: In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) : data length [20] is not a sub-multiple or multiple of the number of rows [7] > strcapture("(.)(.)(.)", c("abc", "def"), proto=list(A="")) A 1 a 2 c 3 d 4 f Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, Oct 4, 2016 at 2:21 PM, Michael Lawrence <lawrence.mich...@gene.com> wrote: > Hi Bill, > > This is a bug in regexec() and I will commit a fix. > > Thanks for the report, > Michael > > On Tue, Oct 4, 2016 at 1:40 PM, William Dunlap <wdun...@tibco.com> wrote: > > I noticed a problem in the strcapture from R-devel (2016-09-27 r71386), > when > > the text contains a missing value and perl=TRUE. > > > > { > > # NA in text input should map to row of NA's in output, without > > warning > > r9p <- strcapture(perl = TRUE, "(.).* ([[:digit:]]+)", c("One 1", > NA, > > "Fifty 50"), data.frame(Initial=factor(), Number=numeric())) > > e9p <- structure(list(Initial = structure(c(2L, NA, 1L), .Label = > > c("F", "O"), class = "factor"), > > Number = c(1, NA, 50)), > > row.names = c(NA, -3L), > > class = "data.frame") > > all.equal(e9p, r9p) > > } > > #Error in if (any(ind)) { : missing value where TRUE/FALSE needed > > > > > > Bill Dunlap > > TIBCO Software > > wdunlap tibco.com > > > > On Wed, Sep 21, 2016 at 2:32 PM, Michael Lawrence > > <lawrence.mich...@gene.com> wrote: > >> > >> The new behavior is that it yields NAs when the pattern does not match > >> (like strptime) and for empty captures in a matching pattern it yields > >> the empty string, which is consistent with regmatches(). > >> > >> Michael > >> > >> On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <wdun...@tibco.com> > wrote: > >> > If there are any matches then strcapture can see if the pattern has > the > >> > same > >> > number of capture expressions as the prototype has columns and give an > >> > error if not. That seems appropriate. > >> > > >> > If there are no matches, then there is no easy way to see if the > >> > prototype > >> > is compatible with the pattern, so should strcapture just assume the > >> > best > >> > and fill in the prototype with NA's? > >> > > >> > Should there be warnings? This is kind of like strptime(), which > >> > silently > >> > gives NA's when the format does not match the text input. > >> > > >> > > >> > Bill Dunlap > >> > TIBCO Software > >> > wdunlap tibco.com > >> > > >> > On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence > >> > <lawrence.mich...@gene.com> wrote: > >> >> > >> >> Hi Bill, > >> >> > >> >> Thanks, another good suggestion. strcapture() now returns NAs for > >> >> non-matches. It's nice to have someone kicking the tires on that > >> >> function. > >> >> > >> >> Michael > >> >> > >> >> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel > >> >> <r-devel@r-project.org> wrote: > >> >> > Michael, thanks for looking at my first issue with > utils::strcapture. > >> >> > > >> >> > Another issue is how it deals with lines that don't match the > >> >> > pattern. > >> >> > Currently it gives an error > >> >> > > >> >> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"), > >> >> > proto=list(Name="", Number=0)) > >> >> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three > >> >> > 3"), > >> >> > : > >> >> > number of matches does not always match ncol(proto) > >> >> > > >> >> > First, isn't the 'number of matches' the number of parenthesized > >> >> > subpatterns in the regular expression? I thought that if the > entire > >> >> > pattern matches then the subpatterns without matches would be > >> >> > shown as matches at position 0 with length 0. Hence either the > >> >> > pattern is compatible with the prototype or it isn't, it does not > >> >> > depend > >> >> > on the text input. E.g., > >> >> > > >> >> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12", > >> >> >> "Z280")) > >> >> > [[1]] > >> >> > [1] 1 1 1 0 > >> >> > attr(,"match.length") > >> >> > [1] 6 6 6 0 > >> >> > attr(,"useBytes") > >> >> > [1] TRUE > >> >> > > >> >> > [[2]] > >> >> > [1] 1 1 0 1 > >> >> > attr(,"match.length") > >> >> > [1] 2 2 0 2 > >> >> > attr(,"useBytes") > >> >> > [1] TRUE > >> >> > > >> >> > [[3]] > >> >> > [1] -1 > >> >> > attr(,"match.length") > >> >> > [1] -1 > >> >> > attr(,"useBytes") > >> >> > [1] TRUE > >> >> > > >> >> > Second, an error message like 'some lines were bad' is not very > >> >> > helpful. > >> >> > Should it put NA's in all the columns of the current output row if > >> >> > the > >> >> > input line didn't match the pattern and perhaps warn the user that > >> >> > there > >> >> > were problems? The user could then look for rows of NA's to see > >> >> > where > >> >> > the > >> >> > problems were. > >> >> > > >> >> > Bill Dunlap > >> >> > TIBCO Software > >> >> > wdunlap tibco.com > >> >> > > >> >> > [[alternative HTML version deleted]] > >> >> > > >> >> > ______________________________________________ > >> >> > R-devel@r-project.org mailing list > >> >> > https://stat.ethz.ch/mailman/listinfo/r-devel > >> > > >> > > > > > > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel