It helps. But it is overly sophisticated. I have already downloaded and used the Excel file containing the validated stuff.
Since there are R commands to download gzip as well as FASTA files, I wonder whether it is possible to automatically download the Excel file from http://mirecords.umn.edu/miRecords/download.php Actually the latter may not be the actual file URL because it is necessary to click on the word "here" to download the file. Thank you, Maura -----Messaggio originale----- Da: Martin Morgan [mailto:mtmor...@fhcrc.org] Inviato: dom 05/07/2009 21.42 A: mau...@alice.it Cc: r-h...@stat.math.ethz.ch Oggetto: Re: R: [R] Is there a way to extract some fields data from HTML pages through any R function ? mau...@alice.it wrote: > I tried to apply the scheme you suggested to open the web page on > "http://mirecords.umn.edu/miRecords/index.php" and got the followiing: > >> result <- postForm("http://mirecords.umn.edu/miRecords/index.php", > + searchType="miRNA", species="Homo sapiens", > + searchBox="hsa-let-7a", submitButton="Search") What we are doing here is sometimes called 'screen scraping' -- figuring out how to extract information from a web page when the information is not presented in an alternative, more reliable, form. I offered this route as a response to your specific question, how to extract some fields from an HTML page, but maybe there is a better way that is specific to the resources and information you are trying to extract. For instance, I see on the web page above that there is a link 'Download validated targets' that leads to an Excel-style spread sheet. Maybe that is a better route for this resource? I don't know. In terms of the problem you are encountering above, the fields searchType, species, searchBox, and submitButton were all defined on the web page of the resource you mentioned in a previous email; here you must look at the 'source' (e.g., right-click 'View Page Source' in Firefox) of the web page you are trying to scrape, and figure out the appropriate fields. This requires some familiarity with html and html forms, so that you can recognize what you are looking for. I think on this particular page you are likely to run in to additional difficulties, because selection of a 'species' populates the 'mirna_acc' field with allowable values that combine the miRNA name with the number of validated targets that will be returned -- you almost need to know the answer before you can programatically extract the data. >> html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE) > Unexpected end tag : a > error parsing attribute name > Opening and ending tag mismatch: strong and font > htmlParseStartTag: invalid element name > Unexpected end tag : a htmlTreeParse is very forgiving of mal-formed html, and it is telling you that it has parsed the document, even though it was formatted incorrectly. >> html <- htmlTreeParse(result, asText=FALSE, useInternalNodes=TRUE) There are too many parameters involved to try changing them arbitrarily; you must take it upon yourself to understand the functions and the correct way to use them. Hoping this helps, Martin > Error in htmlTreeParse(result, asText = FALSE, useInternalNodes = TRUE) : > File <html><!-- InstanceBegin template="/Templates/admin.dwt" > codeOutsideHTMLIsLocked="false" --> > > <head> > > <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> > > <link href="style/link.css" rel="stylesheet" type="text/css"> > > <!-- InstanceParam name="nav_1" type="boolean" value="true" --> > > <title>miRecords</title> > > </head> > > <body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" > marginheight="0"> > > > > > > > > > > <table width="80" border="0" cellspacing="0" cellpadding="0"> > > <tr> > > <td colspan="3"><img src="images/title.jpg" alt="" width=900 > height=79 border="0"></a></td> > > </tr> > > <tr> > > <td width="131" valign="bottom" bgcolor="#CCCCCC"menu""></td> > > <td width="769" align="right" valign="middle" bgcolor="#CCCCCC"><a > href="redirect.php?s=l" class="menu">Validated Targets </a> | <a > href="redirect.php?s=p" class="menu">Predicted Targets </a> | <a > href="download.php" class="menu">Download Validated Targets </a> | <a > href="submit.php" class="m >> > > > > I am lost about how to proceed from the above. > My goal is always to get the VALIDATED miRNA identified and string > followed by its target gene's 3'utr sequence- > > Thank you in advance, > Maura > > P:S. BioMart started to work fine since yesterday > > -----Messaggio originale----- > Da: Martin Morgan [mailto:mtmor...@fhcrc.org] > Inviato: mer 01/07/2009 17.51 > A: mau...@alice.it > Cc: r-h...@stat.math.ethz.ch > Oggetto: Re: [R] Is there a way to extract some fields data from HTML > pages through any R function ? > > Hi Maura -- > > mau...@alice.it wrote: >> I deal with a huge amount of Biology data stored in different databases. >> The databases belongig to Bioconductor organization can be accessed > through Bioconductor packages. >> Unluckily some useful data is stored in databases like, for instance, > miRDB, miRecords, etc ... which offer just an >> interactive HTML interface. See for instance >> http://mirdb.org/cgi-bin/search.cgi, >> > http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search > <http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search> >> >> Downloading data manually from the web pages is a painstaking > time-consumung and error-prone activity. >> I came across a Python script that downloads (dumps) whole web pages > into a text file that is then parsed. >> This is possible because Python has a library to access web pages. >> But I have no experience with Python programming nor I like such a > programming language whose syntax is indentation-sensitive. >> >> I am *hoping* that there exists some sort of web pages, HTML > connection from R ... is there ?? > > Tools in R for this are the RCurl package and the XML package. > > library(RCurl) > library(XML) > > Typically this involves manual exploration of the web form, Then you > might query the web form > > result <- postForm("http://mirdb.org/cgi-bin/search.cgi", > searchType="miRNA", species="Human", > searchBox="hsa-let-7a", submitButton="Go") > > and parse the results into a convenient structure > > html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE) > > you can then use XPath (http://www.w3.org/TR/xpath, especially section > 2.5) to explore and extract information, e.g., > > ## second table, first row > getNodeSet(html, "//table[2]/tr[1]") > ## second table, makes subsequent paths shorter > tbl <- getNodeSet(html, "//table[2]")[[1]] > xget <- function(xml, path) # a helper function > unlist(xpathApply(xml, path, xmlValue))[-1] > df <- data.frame(TargetRank=as.numeric(xget(tbl, "./tr/td[2]")), > TargetScore=as.numeric(xget(tbl, "./tr/td[3]")), > miRNAName=xget(tbl, "./tr/td[4]"), > GeneSymbol=xget(tbl, "./tr/td[5]"), > GeneDescription=xget(tbl, "./tr/td[6]")) > > There are many ways through this latter part, probably some much cleaner > than presented above. There are fairly extensive examples on each of the > relevant help pages, e.g., ?postForm. > > Martin > > >> Thank you very much for any suggestion. >> Maura >> >> >> tutti i telefonini TIM! >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > > > Alice Messenger ;-) chatti anche con gli amici di Windows Live Messenger > e tutti i telefonini TIM! er > tutti i telefonini TIM! [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.