Once again, nice catch. I've committed a check for this. Michael
On Tue, Oct 4, 2016 at 2:37 PM, William Dunlap <[email protected]> wrote: > It is also not catching the cases where the number of capture expressions > does not match the number of entries in proto. I think all of the following > should give an error about the mismatch. > >> strcapture("(.)(.)", c("ab", "cde", "fgh", "ij", "lm"), >> proto=list(A="",B="",C="")) > A B C > 1 a b cd > 2 d fg f > 3 ij i j > 4 l m ab > Warning message: > In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) : > data length [15] is not a sub-multiple or multiple of the number of rows > [4] >> strcapture("(.)(.)(.)", c("abc", "def", "ghi", "jkl", "mno"), >> proto=list(A="",B="")) > A B > 1 a b > 2 def d > 3 f ghi > 4 h i > 5 j k > 6 mno m > 7 o abc > Warning message: > In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) : > data length [20] is not a sub-multiple or multiple of the number of rows > [7] >> strcapture("(.)(.)(.)", c("abc", "def"), proto=list(A="")) > A > 1 a > 2 c > 3 d > 4 f > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Tue, Oct 4, 2016 at 2:21 PM, Michael Lawrence <[email protected]> > wrote: >> >> Hi Bill, >> >> This is a bug in regexec() and I will commit a fix. >> >> Thanks for the report, >> Michael >> >> On Tue, Oct 4, 2016 at 1:40 PM, William Dunlap <[email protected]> wrote: >> > I noticed a problem in the strcapture from R-devel (2016-09-27 r71386), >> > when >> > the text contains a missing value and perl=TRUE. >> > >> > { >> > # NA in text input should map to row of NA's in output, without >> > warning >> > r9p <- strcapture(perl = TRUE, "(.).* ([[:digit:]]+)", c("One 1", >> > NA, >> > "Fifty 50"), data.frame(Initial=factor(), Number=numeric())) >> > e9p <- structure(list(Initial = structure(c(2L, NA, 1L), .Label = >> > c("F", "O"), class = "factor"), >> > Number = c(1, NA, 50)), >> > row.names = c(NA, -3L), >> > class = "data.frame") >> > all.equal(e9p, r9p) >> > } >> > #Error in if (any(ind)) { : missing value where TRUE/FALSE needed >> > >> > >> > Bill Dunlap >> > TIBCO Software >> > wdunlap tibco.com >> > >> > On Wed, Sep 21, 2016 at 2:32 PM, Michael Lawrence >> > <[email protected]> wrote: >> >> >> >> The new behavior is that it yields NAs when the pattern does not match >> >> (like strptime) and for empty captures in a matching pattern it yields >> >> the empty string, which is consistent with regmatches(). >> >> >> >> Michael >> >> >> >> On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <[email protected]> >> >> wrote: >> >> > If there are any matches then strcapture can see if the pattern has >> >> > the >> >> > same >> >> > number of capture expressions as the prototype has columns and give >> >> > an >> >> > error if not. That seems appropriate. >> >> > >> >> > If there are no matches, then there is no easy way to see if the >> >> > prototype >> >> > is compatible with the pattern, so should strcapture just assume the >> >> > best >> >> > and fill in the prototype with NA's? >> >> > >> >> > Should there be warnings? This is kind of like strptime(), which >> >> > silently >> >> > gives NA's when the format does not match the text input. >> >> > >> >> > >> >> > Bill Dunlap >> >> > TIBCO Software >> >> > wdunlap tibco.com >> >> > >> >> > On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence >> >> > <[email protected]> wrote: >> >> >> >> >> >> Hi Bill, >> >> >> >> >> >> Thanks, another good suggestion. strcapture() now returns NAs for >> >> >> non-matches. It's nice to have someone kicking the tires on that >> >> >> function. >> >> >> >> >> >> Michael >> >> >> >> >> >> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel >> >> >> <[email protected]> wrote: >> >> >> > Michael, thanks for looking at my first issue with >> >> >> > utils::strcapture. >> >> >> > >> >> >> > Another issue is how it deals with lines that don't match the >> >> >> > pattern. >> >> >> > Currently it gives an error >> >> >> > >> >> >> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"), >> >> >> > proto=list(Name="", Number=0)) >> >> >> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", >> >> >> > "Three >> >> >> > 3"), >> >> >> > : >> >> >> > number of matches does not always match ncol(proto) >> >> >> > >> >> >> > First, isn't the 'number of matches' the number of parenthesized >> >> >> > subpatterns in the regular expression? I thought that if the >> >> >> > entire >> >> >> > pattern matches then the subpatterns without matches would be >> >> >> > shown as matches at position 0 with length 0. Hence either the >> >> >> > pattern is compatible with the prototype or it isn't, it does not >> >> >> > depend >> >> >> > on the text input. E.g., >> >> >> > >> >> >> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12", >> >> >> >> "Z280")) >> >> >> > [[1]] >> >> >> > [1] 1 1 1 0 >> >> >> > attr(,"match.length") >> >> >> > [1] 6 6 6 0 >> >> >> > attr(,"useBytes") >> >> >> > [1] TRUE >> >> >> > >> >> >> > [[2]] >> >> >> > [1] 1 1 0 1 >> >> >> > attr(,"match.length") >> >> >> > [1] 2 2 0 2 >> >> >> > attr(,"useBytes") >> >> >> > [1] TRUE >> >> >> > >> >> >> > [[3]] >> >> >> > [1] -1 >> >> >> > attr(,"match.length") >> >> >> > [1] -1 >> >> >> > attr(,"useBytes") >> >> >> > [1] TRUE >> >> >> > >> >> >> > Second, an error message like 'some lines were bad' is not very >> >> >> > helpful. >> >> >> > Should it put NA's in all the columns of the current output row if >> >> >> > the >> >> >> > input line didn't match the pattern and perhaps warn the user that >> >> >> > there >> >> >> > were problems? The user could then look for rows of NA's to see >> >> >> > where >> >> >> > the >> >> >> > problems were. >> >> >> > >> >> >> > Bill Dunlap >> >> >> > TIBCO Software >> >> >> > wdunlap tibco.com >> >> >> > >> >> >> > [[alternative HTML version deleted]] >> >> >> > >> >> >> > ______________________________________________ >> >> >> > [email protected] mailing list >> >> >> > https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > >> >> > >> > >> > > > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
