On 11/14/2016 12:44 PM, Bert Gunter wrote:
(Sheepishly)... Yes, thank you Hervé. It would have been nice if I had given correct soutions. Fixed = TRUE could not have of course worked with ["a"] character class! Here's what I found with a 10 element vector each member of which is a 1e5 length string:system.time((lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))user system elapsed 0.013 0.000 0.013system.time(nchar(gsub("[^a]", "", x,fixed = FALSE)))user system elapsed 0.251 0.000 0.252 ## WAYYYY slowersystem.time(nchar(x) - nchar(gsub("a", "", x,fixed = TRUE)))user system elapsed 0.007 0.000 0.007 ## twice as fast Clearly and unsurprisingly, the message is to avoid fixed = FALSE; after that, it seems mostly to be: who cares?!
Another message is to pay attention to the "cost" of generating a big intermediate objects like the list returned by strsplit(). On a big character vector made of 5000 strings of about 1e5 random letters each, the strsplit-based solution uses more than 2Gb of RAM on my Ubuntu system. The gsub( , fixed=TRUE) solution uses less than 1Gb. Cheers, H.
Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Nov 14, 2016 at 12:26 PM, Hervé Pagès <hpa...@fredhutch.org> wrote:Hi, FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE) or strsplit( , fixed=TRUE): set.seed(1) Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "") system.time(res1 <- nchar(gsub("[^a]", "", Vec))) # user system elapsed # 0.585 0.000 0.586 system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L) # user system elapsed # 0.061 0.000 0.061 system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE))) # user system elapsed # 0.039 0.000 0.039 identical(res1, res2) # [1] TRUE identical(res1, res3) # [1] TRUE The gsub( , fixed=TRUE) solution also uses slightly less memory than the strsplit( , fixed=TRUE) solution. Cheers, H. On 11/14/2016 11:55 AM, Charles C. Berry wrote:On Mon, 14 Nov 2016, Marc Schwartz wrote:On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccbe...@ucsd.edu> wrote: On Mon, 14 Nov 2016, Bert Gunter wrote:[stuff deleted]Hi, Both gsub() and strsplit() are using regex based pattern matching internally. That being said, they are ultimately calling .Internal code, so both are pretty fast. For comparison: ## Create a 1,000,000 character vector set.seed(1) Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")nchar(Vec)[1] 1000000 ## Split the vector into single characters and tabulatetable(strsplit(Vec, split = "")[[1]])a b c d e f g h i j k l 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 m n o p q r s t u v w x 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 y z 38265 38299 ## Get just the count of "a"table(strsplit(Vec, split = "")[[1]])["a"]a 38664nchar(gsub("[^a]", "", Vec))[1] 38664 ## Check performancesystem.time(table(strsplit(Vec, split = "")[[1]])["a"])user system elapsed 0.100 0.007 0.107system.time(nchar(gsub("[^a]", "", Vec)))user system elapsed 0.270 0.001 0.272 So, the above would suggest that using strsplit() is somewhat faster than using gsub(). However, as Chuck notes, in the absence of more exhaustive benchmarking, the difference may or may not be more generalizable.Whether splitting on fixed strings rather than treating them as regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on what you split: First repeating what Marc did...system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])user system elapsed 0.132 0.010 0.139system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])user system elapsed 0.130 0.010 0.138 ... fixed=TRUE hardly matters. But the idiom I proposed...system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) - 1))user system elapsed 0.017 0.000 0.018system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1))user system elapsed 0.104 0.000 0.104... is 5 times faster with fixed=TRUE for this case. This result matchea Marc's count:sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)[1] 38664Chuck ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.-- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319 ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319 ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.