On Thu, Jul 1, 2010 at 11:08 PM, Ralf B <ralf.bie...@gmail.com> wrote: > Are there packages that allow improved String and URL processing? > E.g. extract parts of a URLs such as sub-domains, top-level domain, > protocols (e.g. https, http, ftp), file type based on endings, check > if a URL is valid or not, etc...
You are asking to match and extract by content rather than delimiter and you can do that with the strapply function in the gsubfn package. Here is an example. You can likely improve on the regular expression but this gives the idea. In the first example of strapply we just return the back references (the portions in parentheses) and in the second we display the various parts labelled with their names. To remember the arguments note that just as apply is object/modifier/function so is strapply; however, the modifier for strapply is a pattern rather than an array margin. In the first example the function was just c and in the second example we used a formula notation which strapply converts to a function. In this case it constructs the function function(...) cat("protocol:", ..1, "server:", ..3, "host:", ..4, "domain:", ..5, "path:", ..7, "\n")) which we could have used in place of the formula. > library(gsubfn) > myurl <- "http://abc.com/main/def.html" > pat <- "^(\\w+)://((\\w+)[.])?(\\w+)[.](\\w+)(/(.*))$" > > strapply(myurl, pat, c, simplify = unlist) [1] "http" "" "" "abc" [5] "com" "/main/def.html" "main/def.html" > > junk <- strapply(myurl, pat, ~ cat("protocol:", ..1, "server:", ..3, "host:", > ..4, "domain:", ..5, "path:", ..7, "\n")) protocol: http server: host: abc domain: com path: main/def.html gsubfn and strapply in the gsubfn package support ordinary regular expressions and perl regular expression as in R and also support tcl regular expressions. > > I am currently only using split and paste. Are there better and more > efficient ways to handle strings e.g. finding sub-strings or to do > pattern matching? Read the help pages of these commands: help.search(keyword = "character", package = "base") > What packages do you use if you have to do a lot of String processing > and you don't have the option to go to another language such as Perl > or Python? See the gsubfn home page at: http://gsubfn.googlecode.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.