Re: [R] reading data from web data sources

David Winsemius Sat, 27 Feb 2010 17:22:35 -0800


On Feb 27, 2010, at 6:17 PM, Phil Spector wrote:

Tim -
I don't understand what you mean about interleaving rows. I'mguessingthat you want a single large data frame with all the data, and not alist with each year separately. If that's the case:
x = read.table('http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat',
               header=FALSE,fill=TRUE,skip=13)
cts = apply(x,1,function(x)sum(is.na(x)))
wh = which(cts == 12)
start = wh+1
end = c(wh[-1] - 1,nrow(x))
ans = mapply(function(i,j)x[i:j,],start,end,SIMPLIFY=FALSE)
names(ans) = x[wh,1]
alldat = do.call(rbind,ans)
alldat$year = rep(names(ans),sapply(ans,nrow))
names(alldat) = c('day',month.name,'year')
On the other hand, if you want a long data frame with month, day,year and value:
longdat = reshape(alldat,idvar=c('day','year'),
varying=list(month.name),direction='long',times=month.name)
names(longdat)[c(3,4)] = c('Month','value')

Next , if you want to create a Date variable:
longdat = transform(longdat,date=as.Date(paste(Month,day,year),'%B%d %Y'))
longdat = na.omit(longdat)
longdat = longdat[order(longdat$date),]

and finally:

zoodat = zoo(longdat$value,longdat$date)

which should be suitable for time series analysis.


OK, I think I get it:

(From Gabor's DF)

> dta <- data.matrix(DF[, -c(1,14)])
> dtafrm <-data.frame(rdta=dta[!is.na(dta)],
                      d.o.m= DF[row(dta)[!is.na(dta)], 1],
                      month= col(dta)[!is.na(dta)],
                      year=DF[row(dta)[!is.na(dta)], 14])

> library(zoo)

> zoodat2 <- with(dtafrm, zoo(rdta, as.Date(paste(month, d.o.m,year), "%m %d %Y")))

> str(zoodat2)
‘zoo’ series from 1910-01-01 to 1919-12-31
  Data: num [1:3652] 6.4 6.5 6.3 6.7 6.7 6.8 7 7.1 7.1 7.2 ...

Index: Class 'Date' num [1:3652] -21915 -21914 -21913 -21912-21911 ...

Hope this helps.
                                                   - Phil

On Sat, 27 Feb 2010, Tim Coote wrote:
Thanks, Gabor. My take away from this and Phil's post is that I'mgoing to have to construct some code to do the parsing, rather thanuse a standard function. I'm afraid that neither approach works, yet:
Gabor's gets has an off-by-one error (days start on the 2nd, notthe first), and the years get messed up around the 29th day. Ithink that na.omit (DF) line is throwing out the baby with thebathwater. It's interesting that this approach is based onread.table, I'd assumed that I'd need read.ftable, which I couldn'tunderstand the documentation for. What is it that's removing the-999 and -888 values in this code -they seem to be gone, but Icannot see why.
Phil's reads in the data, but interleaves rows with just a year andall other values as NA.
Tim
On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:
Mark Leeds pointed out to me that the code wrapped around in thepost
so it may not be obvious that the regular expression in the grep is
(i.e. it contains a space):
"[^ 0-9.]"
On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
<ggrothendi...@gmail.com> wrote:
Try this. First we read the raw lines into R using grep toremove anylines containing a character that is not a number or space. Thenwelook for the year lines and repeat them down V1 using cumsum.Finally
we omit the year lines.
myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat"
raw.lines <- readLines(myURL)
DF <- read.table(textConnection(raw.lines[!grepl("[^
0-9.]",raw.lines)]), fill = TRUE)
DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
DF <- na.omit(DF)
head(DF)
On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project....@coote.org> wrote:
Hullo
I'm trying to read some time series data of meteorologicalrecords that are
available on the web (eg
http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat). I'dlike to be able to read in the digital data directly into R.However, Icannot work out the right function and set of parameters touse. It couldbe that the only practical route is to write a parser, possiblyin someother language, reformat the files and then read these into R.As far as I
can tell, the informal grammar of the file is:
<comments terminated by a blank line>
[<year number on a line on its own>
<daily readings lines> ]+
and the <daily readings> are of the form:
<whitespace> <day number> [<whitespace> <reading on day ofmonth>] 12Readings for days in months where a day does not exist havespecial values.
Missing values have a different special value.
And then I've got the problem of iterating over all relevantfiles to get a
whole timeseries.
Is there a way to read in this type of file into R? I've readall of theexamples that I can find, but cannot work out how to do it. Idon't thinkthat read.table can handle the separate sections of datarepresenting eachyear. read.ftable maybe can be coerced to parse the data, but Icannot seehow after reading the documentation and experimenting with theparameters.
I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
Any help/suggestions would be greatly appreciated. I can seethat this typeof issue is likely to grow in importance, and I'd also like togive the dataowners suggestions on how to reformat their data so that it iseasier to
consume by machines, while being easy to read for humans.
The early records are a serious machine parsing challenge asthey are tiff
images of old notebooks ;-)
tia
Tim
Tim Coote
t...@coote.org
vincit veritas
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Tim Coote
t...@coote.org
vincit veritas

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading data from web data sources

Reply via email to