Sorry to answer my own question - I guess here's one way to read this table. Other suggestions are still welcome.

Chris

------

x<-htmlParse("<table>
<tr><td rowspan=2>ab</td><td>X</td></tr>
<tr><td rowspan=2>YZ</td></tr>
<tr><td>c</td></tr>
</table>")

# split by rows
z <- getNodeSet(x, "//tr")

# create empty data.frame - probably not the best solution...
t1<- data.frame(matrix(NA, nrow = 3,  ncol = 2 ))

for (i in 1:3){
rowspan <- as.numeric( xpathSApply(z[[i]], ".//td", xmlGetAttr, "rowspan", 1) )
  val <- xpathSApply(z[[i]], ".//td", xmlValue)

  # fill values into empty cells
  n <- which(is.na(t1[i,]))
  t1[ i ,n] <- val

  if( any(rowspan > 1) ){
     for(j in 1:length( rowspan ) ){
        if(rowspan[j] > 1){
            ## repeat value down column
              t1[ (i+1):(i+ ( rowspan[j] -1) ) , n[j] ]   <- val[j]
        }
     }
  }
}


t1
 X1 X2
1 ab  X
2 ab YZ
3  c YZ


If you are interested, I used this code in the pmcTable function at https://github.com/cstubben/pubmed . To get Table 1, this now works...

doc<-pmc("PMC3544749")  # downloads XML from OAI service
t1 <- pmcTable(doc,1) # parse table... also saves caption and footnotes to attributes
t1[1:4,1:4]
Category Gen Name Rv number Description 1 Lipids and Fatty Acid Metabolism kasB Rv2246 3-oxoacyl-[acyl-carrier protein] synthase 2 kasb 2 Mycolic acid synthesis mmaA4 Rv0642c Methoxy mycolic acid synthase 4 3 Mycolic acid synthesis pcaA Rv0470c Mycolic acid synthase (cyclopropane synthase) 4 Mycolic acid synthesis pcaA Rv0470c Mycolic acid synthase (cyclopropane synthase)




--

Chris Stubben

Los Alamos National Lab
Bioscience Division
MS M888
Los Alamos, NM 87545

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to