On Jul 8, 2010, at 10:33 PM, Erik Iverson wrote:
I have a data frame:
id
url urlType
1 1 www.yahoo.com <http://
www.yahoo.com> 1
2 2 www.google.com/?search= <http://www.google.com/?
search=> 2
3 3 www.google.com <http://
www.google.com> 1
4 4 www.yahoo.com/?query= <http://www.yahoo.com/?
query=> 2
5 5 www.gmail.com <http://
www.gmail.com> 1
This is not output from ?dput, which means more work to read it in.
Yeah it was kind of pain, but ...
dta <- read.table(textConnection(' id
url urlType
1 1 "www.yahoo.com <http://www.yahoo.com>" 1
2 2 "www.google.com/?search= <http://www.google.com/?
search=>" 2
3 3 "www.google.com <http://www.google.com>" 1
4 4 "www.yahoo.com/?query= <http://www.yahoo.com/?query=>" 2
5 5 "www.gmail.com <http://www.gmail.com>" 1') )
Here is the definition for WHITELIST:-
WHITELIST = "[?]query=, [?]search="
WHITELIST <- unlist(trim(strsplit(trim(WHITELIST), ",")))
What is the 'trim' function? I do not have that defined.
Perhaps David's answer will work for you...
Seems to ... after I fixed my incorrect cmd-V paste of the function
name and guessing that trim was the one in gdata:
> require(gdata)
> checkBaseLine <- function(s){
+ for (listItem in WHITELIST){
+ if(regexpr(as.character(listItem), s)[1] > -1){
+ return(TRUE)
+ }
+ }
+ return(FALSE)
+ }
>
> #Here is the definition for WHITELIST:-
>
> WHITELIST = "[?]query=, [?]search="
> WHITELIST <- unlist(trim(strsplit(trim(WHITELIST), ",")))
> vcheck <- Vectorize(checkBaseLine)
>
> vcheck <- Vectorize(checkBaseLine)
>
> dta[ dta$urlType != 1 & vcheck(dta$url) , "url" ]
[1] www.google.com/?search= <http://www.google.com/?search=> www.yahoo.com/?query=
<http://www.yahoo.com/?query=>
5 Levels: www.gmail.com <http://www.gmail.com> www.google.com <http://www.google.com
> ... www.yahoo.com/?query= <http://www.yahoo.com/?query=>
--
David.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.