[R] How to apply XPath query on XML nodes separately?

Asis Hallab Thu, 27 Dec 2012 19:28:05 -0800

Dear R experts,

I try to extract certain child nodes from an XML document and construct a
table in which the parent node names are the columns and the child id
values, joined in a list, are the cell content.


If I first apply an XPath query to extract all above parent nodes, then
iterate over those nodes and again apply a XPath query to select their
child nodes, I get *ALL* matching child nodes of the whole document, *not*
just those of the currently queried parent.
I know, this is because I prefix my XPath Query with // and apparently any
given XMLNode "knows" of his whole document,
but I seem not to be able to find a proper solution.

So, my question is:
How do I restrict a call of getNodeSet to just a XMLNode and not the whole
document it was retrieved from?

I use the XML and RCurl packages. The document I speak of is downloaded
from uniprot.org, a protein knowledge server well known to biologists.

The lamentably somewhat lengthy code follows:

library(XML)
library(RCurl)

getEntries <- function( uniprot.xml, uniprot.error.msg.regex='^ERROR' ) {
  # Uniprot's dbfetch can be asked to return several entry tags in the same
XML
  # document. This function uses XPath queries to extract all complete
uniprot
  # tags.
  #
  # Args:
  #  uniprot.xml             : The result of a web fetch to Uniprot i.e.
using
  #                            getURL.
  #  uniprot.error.msg.regex : A regular expression to avoid parsing an
error
  #                            returned from Uniprot.
  #
  # Returns: A list of extracted uniprot-entry-tags as returned by function
  # 'getNodeSet'.
  #
  if ( ! is.null( uniprot.xml ) && '' != uniprot.xml &&
    ! grepl( uniprot.error.msg.regex, uniprot.xml )
  ) {
    ns <- c( xmlns="http://uniprot.org/uniprot"; )
    getNodeSet(
      xmlInternalTreeParse( uniprot.xml ),
      "//xmlns:entry", namespaces=ns
    )
  } else {
    NULL
  }
}

extractExperimentallyVerifiedGoAnnos <- function( doc ) {
  # Uses XPath to extract those GO annotations that are experimentally
  # verified. Note, that warnings generated by calls to the XML library are
  # suppressed to not confuse the user, when no experimentally verified GO
  # annotations could be found.
  #
  # Args:
  #  doc : A XML tag of type entry as returned i.e. by function 'getEntries'
  #
  # Returns: A character vector of the extracted experimentally verified GO
  # annotations, or NULL, if none can be found.
  #
  block <- function() {
    ns <- c( xmlns="http://uniprot.org/uniprot"; )
    ndst <- suppressWarnings(
      getNodeSet( doc,
        "//xmlns:dbReference[@type='GO']//xmlns:property[@type='evidence'
and ( contains(@value, 'EXP') or contains(@value, 'IDA') or
contains(@value, 'IPI') or contains(@value, 'IMP') or contains(@value,
'IGI') or contains(@value, 'IEP') ) ]/..",
        namespaces=ns
      )
    )
    if ( ! is.null( ndst ) && length( ndst ) > 0 )
      vapply( ndst, xmlGetAttr, vector( mode='character', length=1 ), 'id' )
    else
      NULL
  }
  tryCatch( block(), error=function( err ) {
    warning( err, " caused by document ", doc )
  })
}

uniprotkb.url <- function( accession, frmt='xml' ) {
  # Returns valid URL to access Uniprot's RESTful Web-Service to download
  # data about the Protein as referenced by the argument 'accession'.
  # Note, that the accession is URL encoded before being pasted into the
  # Uniprot URL template.
  #
  # Args:
  #  accession : The Protein's Uniprot accession.
  #  frmt      : The format of the downloaded Uniprot Entry. Default is
'xml'.
  #
  # Returns: The Uniprot URL for the argument accession.
  #
  paste(
    'http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb/',
    URLencode( accession ),
    '/', frmt, sep=''
  )
}

retrieveExperimentallyVerifiedGOAnnotations <- function( uniprot.accessions
) {
  # Downloads and parses XML documents from Uniprot for each accession in
  # argument. Extracts all experimentally verified GO annotations.
  #
  # Args:
  #  uniprot.accessions : A character vector of Uniprot accessions.
  #
  # Returns: A matrix with row 'GO' and one column for each Uniprot
accession.
  # Each cell is either NULL or a character vector holding all
experimentally
  # verified GO annotations. NULL annotations are excluded, so the returned
  # matrix can be of zero columns and a single row.
  #
  fetch.url <- uniprotkb.url( paste( uniprot.accessions, collapse=",",
sep="" ) )
  uniprot.entries <- getEntries( getURL( fetch.url ) )
  if ( ! is.null(uniprot.entries) && length( uniprot.entries ) > 0 ) {
    annos <- do.call( 'cbind',
      lapply( uniprot.entries , function( d ) {
        list( 'GO'=extractExperimentallyVerifiedGoAnnos( d ) )
      })
    )
    colnames( annos ) <- uniprot.accessions
    # Exclude NULL columns:
    annos[ , as.character( annos[ 'GO', ] ) != 'NULL' , drop=F ]
  }
}

as.data.frame( retrieveExperimentallyVerifiedGOAnnotations(c("A0AEI7",
"Q9ZZX1")) )

Returns:
                   A0AEI7                 Q9ZZX1
GO GO:0004519, GO:0006316 GO:0004519, GO:0006316

But should only have a single column, because A0AEI7 does not have any
experimentally verified Gene Ontology annotations.

Thank you very much in advance for your kind help!

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] How to apply XPath query on XML nodes separately?

Reply via email to