Let's say that I have the following character vector with a series of url strings. I'm interested in extracting some information from each string.
url = c("http://www.mdd.com/food/pizza/index.html", " http://www.mdd.com/build-your-own/index.html", "http://www.mdd.com/special-deals.html", " http://www.genius.com/find-a-location.html", "http://www.google.com/hello.html") - First, I want to extract the domain name followed by .com. After struggling with this for a while, reading some regular expression tutorials, and reading through stack overflow, I came up with the following solution. Perfect! > parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", "", x), "/"), "[[", 1)) > parser(url) [1] "mdd.com" "mdd.com" "mdd.com" "genius.com" "google.com" - Second, I want to extract everything after .com in the original url. Unfortunately, I don't know the proper regular expression to assign in order to get the desired result. Can anyone help. Output should be /food/pizza/index.html build-your-own/index.html /special-deals.html If anyone has a solution using the stringr package, that'd be of interest also. Thanks. -- *Abraham Mathew**Analytics Strategist* *Minneapolis, MN* *720-648-0108* *abmathe...@gmail.com <abmathe...@gmail.com>* *Twitter <https://twitter.com/abmathewks> **LinkedIn <http://www.linkedin.com/pub/abraham-mathew/29/21b/212/> **Blog <https://mathewanalytics.wordpress.com/> **Tumblr <http://iwearstyle.tumblr.com/> Pinterest <http://pinterest.com/amathew123/>* [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.