On Jul 8, 2010, at 10:33 PM, Erik Iverson wrote:


I have a data frame:
id url urlType 1 1 www.yahoo.com <http:// www.yahoo.com> 1 2 2 www.google.com/?search= <http://www.google.com/? search=> 2 3 3 www.google.com <http:// www.google.com> 1 4 4 www.yahoo.com/?query= <http://www.yahoo.com/? query=> 2 5 5 www.gmail.com <http:// www.gmail.com> 1

This is not output from ?dput, which means more work to read it in.


Yeah it was kind of pain, but ...

dta <- read.table(textConnection(' id url urlType
1     1      "www.yahoo.com <http://www.yahoo.com>"      1
2 2 "www.google.com/?search= <http://www.google.com/? search=>" 2
3     3      "www.google.com <http://www.google.com>" 1
4     4      "www.yahoo.com/?query= <http://www.yahoo.com/?query=>"   2
5     5      "www.gmail.com <http://www.gmail.com>" 1') )



Here is the definition for WHITELIST:-
WHITELIST = "[?]query=, [?]search="
WHITELIST <- unlist(trim(strsplit(trim(WHITELIST), ",")))

What is the 'trim' function?  I do not have that defined.

Perhaps David's answer will work for you...

Seems to ... after I fixed my incorrect cmd-V paste of the function name and guessing that trim was the one in gdata:

> require(gdata)
> checkBaseLine <- function(s){
+ for (listItem in WHITELIST){
+ if(regexpr(as.character(listItem), s)[1] > -1){
+ return(TRUE)
+ }
+ }
+ return(FALSE)
+ }
>
> #Here is the definition for WHITELIST:-
>
> WHITELIST = "[?]query=, [?]search="
> WHITELIST <- unlist(trim(strsplit(trim(WHITELIST), ",")))
> vcheck <- Vectorize(checkBaseLine)
>
> vcheck <- Vectorize(checkBaseLine)
>
> dta[ dta$urlType != 1 & vcheck(dta$url) , "url" ]
[1] www.google.com/?search= <http://www.google.com/?search=> www.yahoo.com/?query= <http://www.yahoo.com/?query=> 5 Levels: www.gmail.com <http://www.gmail.com> www.google.com <http://www.google.com > ... www.yahoo.com/?query= <http://www.yahoo.com/?query=>

--
David.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to