I'm trying to read some web tables directly into R. These are both genome sequencing projects (eukaryotes and metagenomes) from NCBI and look very similar; however, only the first one works.
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi I added ?dump=selected to the end of the url string to get a tab- delimited file (which is what happens if you click the Save button on either page). > options(internet.info=0) ## this one works > x1<-url("http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi? dump=selected") > read.delim(x1, skip=1, nrows=5)[,1:3] X...Columns. ProjectID Organism.Name 1 20303 Acanthamoeba castellanii Neff Protists 2 13657 Acyrthosiphon pisum LSR1 Animals 3 12434 Aedes aegypti Liverpool Animals 4 12635 Ajellomyces capsulatus G186AR Fungi 5 12653 Ajellomyces capsulatus G217B Fungi Warning messages: 1: connected to 'www.ncbi.nlm.nih.gov' on port 80. in: open.connection (file, "r") 2: -> GET /genomes/leuks.cgi?dump=selected HTTP/1.0 Host: www.ncbi.nlm.nih.gov Pragma: no-cache in: open.connection(file, "r") 3: <- HTTP/1.1 200 OK in: open.connection(file, "r") 4: <- Date: Wed, 14 Nov 2007 18:03:29 GMT in: open.connection(file, "r") 5: <- Server: Apache in: open.connection(file, "r") 6: <- Content-Disposition: attachment; filename="untitle.txt" in: open.connection(file, "r") 7: <- Content-Type: application/force-download in: open.connection (file, "r") 8: <- Vary: Accept-Encoding in: open.connection(file, "r") 9: <- Connection: close in: open.connection(file, "r") 10: Code 200, content-type 'application/force-download' in: open.connection(file, "r") ## this one fails to open a connection > x2<-url("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi? dump=selected") > read.delim(x2, skip=1, nrows=5)[,1:3] Error in open.connection(file, "r") : unable to open connection In addition: Warning messages: 1: connected to 'www.ncbi.nlm.nih.gov' on port 80. in: open.connection (file, "r") 2: -> GET /genomes/lenvs.cgi?dump=selected HTTP/1.0 Host: www.ncbi.nlm.nih.gov Pragma: no-cache in: open.connection(file, "r") 3: <- HTTP/1.1 500 Internal Server Error in: open.connection(file, "r") 4: <- Date: Wed, 14 Nov 2007 18:04:26 GMT in: open.connection(file, "r") 5: <- Server: Apache in: open.connection(file, "r") 6: <- Content-Type: text/html; charset=ISO-8859-1 in: open.connection (file, "r") 7: <- Vary: Accept-Encoding in: open.connection(file, "r") 8: <- Connection: close in: open.connection(file, "r") 9: Code 500, content-type 'text/html; charset=ISO-8859-1' in: open.connection(file, "r") 10: cannot open: HTTP status was '500 Internal Server Error' in: open.connection(file, "r") Also, I can't even read lines from the main page. > readLines("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi", n=10) Error in file(con, "r") : unable to open connection ... ## now I'm just guessing... > readLines("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi", n=10, encoding="ISO-8859-1") Error in file(con, "r") : unable to open connection ... Download.file works fine, but I would like to avoid this if possible. > capabilities()[5] http/ftp TRUE > download.file("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi? dump=selected", "lenvs.tab") > read.delim("lenvs.tab", skip=1, nrows=5)[,1:3] X...Columns. Parent.ProjectID ProjectID 1 19733 13694 Global Ocean Sampling Expedition Metagenome 2 20823 13696 5-Way (CG) Acid Mine Drainage Biofilm Metagenome 3 - 13699 Waseca County Farm Soil Metagenome 4 - 13702 Methane-Oxidizing Archaea from Deep- Sea Sediments 5 - 13729 Pacific Beach Sand Metagenome Thanks for your help. Hopefully this is something simple that I missed in the documentation/help. Chris -- ------------------- Chris Stubben Los Alamos National Lab BioScience Division MS M888 Los Alamos, NM 87545 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.