On 10/6/2010 8:52 AM, Simon Kiss wrote:
Dear Colleagues,
I used this code to scrape data from the URL conatined within. This code
should be reproducible.
require("XML")
library(XML)
theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
class(tables)
test<-data.frame(tables, stringsAsFactors=FALSE)
test[16,c(2:5)]
as.numeric(test[16,c(2:5)])
quartz()
plot(c(1:4), test[15, c(2:5)])
calling the values from the row of interest using test[16, c(2:5)] can
bring them up as represented on the screen, plotting them or coercing
them to numeric changes the values and in a way that doesn't make sense
to me. My intuitino is that there is something going on with the way the
characters are coded or classed when they're scraped into R. I've looked
around the help files for converting from character to numeric but can't
find a solution.
I also tried this:
as.numeric(as.character(test[16,c(2:5)] and that also changed the values
from what they originally were.
I'm grateful for any suggestions.
Yours, Simon Kiss
str() gives you an indication of how things are stored and can help in
these situations.
> str(test)
'data.frame': 45 obs. of 10 variables:
$ NULL.V1 : Factor w/ 41 levels "","2006","Afghanistan/Military",..: 1
1 35 1 1 1 23 18 2 32 ...
$ NULL.V2 : Factor w/ 32 levels "","-","%","0",..: 28 1 27 30 1 1 1 1
32 3 ...
$ NULL.V3 : Factor w/ 30 levels "","-","0.2","0.4",..: 1 1 1 1 1 1 NA
NA 30 1 ...
$ NULL.V4 : Factor w/ 30 levels "","0.1","0.2",..: NA 1 NA NA 1 1 NA
NA 30 NA ...
$ NULL.V5 : Factor w/ 29 levels "","0","0.2","0.3",..: NA 1 NA NA 1 1
NA NA 29 NA ...
$ NULL.V6 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA 1 NA ...
$ NULL.V7 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
$ NULL.V8 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
$ NULL.V9 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
$ NULL.V10: Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
So columns 2-5 are factors, despite the stringsAsFactors=FALSE in the
data.frame call. That is because they were factors already in tables
> str(tables)
List of 1
$ NULL:'data.frame': 45 obs. of 10 variables:
..$ V1 : Factor w/ 41 levels "","2006","Afghanistan/Military",..: 1 1
35 1 1 1 23 18 2 32 ...
..$ V2 : Factor w/ 32 levels "","-","%","0",..: 28 1 27 30 1 1 1 1 32
3 ...
..$ V3 : Factor w/ 30 levels "","-","0.2","0.4",..: 1 1 1 1 1 1 NA NA
30 1 ...
..$ V4 : Factor w/ 30 levels "","0.1","0.2",..: NA 1 NA NA 1 1 NA NA
30 NA ...
..$ V5 : Factor w/ 29 levels "","0","0.2","0.3",..: NA 1 NA NA 1 1 NA
NA 29 NA ...
..$ V6 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA 1 NA ...
..$ V7 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
..$ V8 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
..$ V9 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
..$ V10: Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
So your idea that the "numbers" you see are really character
representations and not actually numbers is right. And you are almost
there with the as.numeric(as.character()) construct. That would work
for a single factor, but doesn't work for a data.frame.
> test[16,c(2:5)]
NULL.V2 NULL.V3 NULL.V4 NULL.V5
16 7.2 9.1 7.7 15.2
> as.character(test[16,c(2:5)])
[1] "25" "27" "26" "14"
You get a string representation of the underlying factor levels, not the
labels. If you do this column-by-column, it does work. Since
data.frames are special types of lists, you can use lapply:
> test[16,c(2:5)]
NULL.V2 NULL.V3 NULL.V4 NULL.V5
16 7.2 9.1 7.7 15.2
> lapply(test[16,c(2:5)], as.character)
$NULL.V2
[1] "7.2"
$NULL.V3
[1] "9.1"
$NULL.V4
[1] "7.7"
$NULL.V5
[1] "15.2"
> as.numeric(lapply(test[16,c(2:5)], as.character))
[1] 7.2 9.1 7.7 15.2
That said, I'd extract the responses part of the data out, clean it all,
and then do whatever you planned with it:
responses <- test[11:42,1:5]
responses[,1] <- factor(responses[,1])
responses[,2:5] <- lapply(responses[,2:5], function(x)
{as.numeric(as.character(x))})
names(responses) <- c("Response", "Q1", "Q2", "Q3", "Q4")
> str(responses)
'data.frame': 32 obs. of 5 variables:
$ Response: Factor w/ 32 levels "Afghanistan/Military",..: 5 6 4 8 9
10 11 12 14 15 ...
$ Q1 : num 2.4 2.1 NA 5.6 2.3 7.2 1 1.8 28.4 0.6 ...
$ Q2 : num 3.3 1.6 NA 5.6 1.8 9.1 0.4 2.4 19.4 2.1 ...
$ Q3 : num 3.4 1.3 0.3 5.3 2.6 7.7 0.3 1.3 21 1.7 ...
$ Q4 : num 2.7 1.5 0.6 5.1 1.3 15.2 0.2 0.7 16.7 2 ...
*********************************
Simon J. Kiss, PhD
Assistant Professor, Wilfrid Laurier University
73 George Street
Brantford, Ontario, Canada
N3T 2C9
Cell: +1 519 761 7606
--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.