If the question is to how download and read into R an Excel file at a known location into an R data frame then try this:
library(gdata) URL <- "http://mirecords.umn.edu/miRecords/download_data.php?v=1" DF <- read.xls(URL) See ?read.xls for more info. On Mon, Jul 6, 2009 at 2:27 AM, <mau...@alice.it> wrote: > It helps. But it is overly sophisticated. > I have already downloaded and used the Excel file containing the validated > stuff. > > Since there are R commands to download gzip as well as FASTA files, I wonder > whether it is possible to > automatically download the Excel file from > http://mirecords.umn.edu/miRecords/download.php > Actually the latter may not be the actual file URL because it is necessary to > click on the word "here" to download the file. > > Thank you, > Maura > > -----Messaggio originale----- > Da: Martin Morgan [mailto:mtmor...@fhcrc.org] > Inviato: dom 05/07/2009 21.42 > A: mau...@alice.it > Cc: r-h...@stat.math.ethz.ch > Oggetto: Re: R: [R] Is there a way to extract some fields data from HTML > pages through any R function ? > > mau...@alice.it wrote: >> I tried to apply the scheme you suggested to open the web page on >> "http://mirecords.umn.edu/miRecords/index.php" and got the followiing: >> >>> result <- postForm("http://mirecords.umn.edu/miRecords/index.php", >> + searchType="miRNA", species="Homo sapiens", >> + searchBox="hsa-let-7a", submitButton="Search") > > What we are doing here is sometimes called 'screen scraping' -- figuring > out how to extract information from a web page when the information is > not presented in an alternative, more reliable, form. I offered this > route as a response to your specific question, how to extract some > fields from an HTML page, but maybe there is a better way that is > specific to the resources and information you are trying to extract. For > instance, I see on the web page above that there is a link 'Download > validated targets' that leads to an Excel-style spread sheet. Maybe that > is a better route for this resource? I don't know. > > In terms of the problem you are encountering above, the fields > searchType, species, searchBox, and submitButton were all defined on the > web page of the resource you mentioned in a previous email; here you > must look at the 'source' (e.g., right-click 'View Page Source' in > Firefox) of the web page you are trying to scrape, and figure out the > appropriate fields. This requires some familiarity with html and html > forms, so that you can recognize what you are looking for. I think on > this particular page you are likely to run in to additional > difficulties, because selection of a 'species' populates the 'mirna_acc' > field with allowable values that combine the miRNA name with the number > of validated targets that will be returned -- you almost need to know > the answer before you can programatically extract the data. > >>> html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE) >> Unexpected end tag : a >> error parsing attribute name >> Opening and ending tag mismatch: strong and font >> htmlParseStartTag: invalid element name >> Unexpected end tag : a > > htmlTreeParse is very forgiving of mal-formed html, and it is telling > you that it has parsed the document, even though it was formatted > incorrectly. > >>> html <- htmlTreeParse(result, asText=FALSE, useInternalNodes=TRUE) > > There are too many parameters involved to try changing them arbitrarily; > you must take it upon yourself to understand the functions and the > correct way to use them. > > Hoping this helps, > > Martin > >> Error in htmlTreeParse(result, asText = FALSE, useInternalNodes = TRUE) : >> File <html><!-- InstanceBegin template="/Templates/admin.dwt" >> codeOutsideHTMLIsLocked="false" --> >> >> <head> >> >> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> >> >> <link href="style/link.css" rel="stylesheet" type="text/css"> >> >> <!-- InstanceParam name="nav_1" type="boolean" value="true" --> >> >> <title>miRecords</title> >> >> </head> >> >> <body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" >> marginheight="0"> >> >> >> >> >> >> >> >> >> >> <table width="80" border="0" cellspacing="0" cellpadding="0"> >> >> <tr> >> >> <td colspan="3"><img src="images/title.jpg" alt="" width=900 >> height=79 border="0"></a></td> >> >> </tr> >> >> <tr> >> >> <td width="131" valign="bottom" bgcolor="#CCCCCC"menu""></td> >> >> <td width="769" align="right" valign="middle" bgcolor="#CCCCCC"><a >> href="redirect.php?s=l" class="menu">Validated Targets </a> | <a >> href="redirect.php?s=p" class="menu">Predicted Targets </a> | <a >> href="download.php" class="menu">Download Validated Targets </a> | <a >> href="submit.php" class="m >>> >> >> >> >> I am lost about how to proceed from the above. >> My goal is always to get the VALIDATED miRNA identified and string >> followed by its target gene's 3'utr sequence- >> >> Thank you in advance, >> Maura >> >> P:S. BioMart started to work fine since yesterday >> >> -----Messaggio originale----- >> Da: Martin Morgan [mailto:mtmor...@fhcrc.org] >> Inviato: mer 01/07/2009 17.51 >> A: mau...@alice.it >> Cc: r-h...@stat.math.ethz.ch >> Oggetto: Re: [R] Is there a way to extract some fields data from HTML >> pages through any R function ? >> >> Hi Maura -- >> >> mau...@alice.it wrote: >>> I deal with a huge amount of Biology data stored in different databases. >>> The databases belongig to Bioconductor organization can be accessed >> through Bioconductor packages. >>> Unluckily some useful data is stored in databases like, for instance, >> miRDB, miRecords, etc ... which offer just an >>> interactive HTML interface. See for instance >>> http://mirdb.org/cgi-bin/search.cgi, >>> >> http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search >> <http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search> >>> >>> Downloading data manually from the web pages is a painstaking >> time-consumung and error-prone activity. >>> I came across a Python script that downloads (dumps) whole web pages >> into a text file that is then parsed. >>> This is possible because Python has a library to access web pages. >>> But I have no experience with Python programming nor I like such a >> programming language whose syntax is indentation-sensitive. >>> >>> I am *hoping* that there exists some sort of web pages, HTML >> connection from R ... is there ?? >> >> Tools in R for this are the RCurl package and the XML package. >> >> library(RCurl) >> library(XML) >> >> Typically this involves manual exploration of the web form, Then you >> might query the web form >> >> result <- postForm("http://mirdb.org/cgi-bin/search.cgi", >> searchType="miRNA", species="Human", >> searchBox="hsa-let-7a", submitButton="Go") >> >> and parse the results into a convenient structure >> >> html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE) >> >> you can then use XPath (http://www.w3.org/TR/xpath, especially section >> 2.5) to explore and extract information, e.g., >> >> ## second table, first row >> getNodeSet(html, "//table[2]/tr[1]") >> ## second table, makes subsequent paths shorter >> tbl <- getNodeSet(html, "//table[2]")[[1]] >> xget <- function(xml, path) # a helper function >> unlist(xpathApply(xml, path, xmlValue))[-1] >> df <- data.frame(TargetRank=as.numeric(xget(tbl, "./tr/td[2]")), >> TargetScore=as.numeric(xget(tbl, "./tr/td[3]")), >> miRNAName=xget(tbl, "./tr/td[4]"), >> GeneSymbol=xget(tbl, "./tr/td[5]"), >> GeneDescription=xget(tbl, "./tr/td[6]")) >> >> There are many ways through this latter part, probably some much cleaner >> than presented above. There are fairly extensive examples on each of the >> relevant help pages, e.g., ?postForm. >> >> Martin >> >> >>> Thank you very much for any suggestion. >>> Maura >>> >>> >>> tutti i telefonini TIM! >>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> >> >> Alice Messenger ;-) chatti anche con gli amici di Windows Live Messenger >> e tutti i telefonini TIM! > > er >> > > > > > > > tutti i telefonini TIM! > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.