On Dec 7, 2009, at 12:37 PM, Marshall Feldman wrote:
I totally agree with Barry, although it's sometimes convenient to
include data with analysis code for debugging and/or documentation
purposes.
However, the example actually applies equally to separate data
files. In
fact, the example is from the U.S. Bureau of Labor Statistics at
ftp://ftp.bls.gov/pub/time.series/sm/, which contains nothing but data
and documentation files. At issue is not where the data come from, but
rather how to parse relatively complex data organized inconsistently.
SAS has built-in the ability to parse five different organizations of
data: list (delimited), modified list, column, formatted, and mixed
(see
http://www.masil.org/sas/input.html). It seems R can parse such data,
but only with considerable work by the user. It would be great to
have a
function/package that implements something with as easy (hah!) and
flexible as SAS.
Marsh
Barry Rowlingson wrote:
On Mon, Dec 7, 2009 at 3:53 PM, Marshall Feldman <ma...@uri.edu>
wrote:
Regarding the various methods people have suggested, what if a
typical
tab-delimited data line looks like:
SMS11000000000000001 1990 M01 688.0
and the SAS INPUT statement is
INPUT survey $ 1-2 seasonal $ 3 state $ 4-5 area $ 6-10
supersector $
11-12 @13 industry $8. datatype $ 21-22 year period $ value
footnote $ ;
I was thinking of passing a FWF "chopped" input to scan to handle the
tabs but discovered that read.fwf will parse trailing tab-separated
fields.
First a bit of experimentation:
> testdat <- "45678\t567\t45\t6"
> read.fwf(textConnection(testdat), c(5,100))
V1 V2 V3 V4 V5
1 45678 NA 567 45 6
Then the test on your data source:
> testin <- read.fwf(url("ftp://ftp.bls.gov/pub/time.series/sm/sm.data.1.Alabama
", open="r"), c(2,1,2,5,2,8,2,100 ), header=F, n=100, skip=1)
#Need to throw away the header, since the fields no longer match after
parsing what you communicated were the divisions within of the
"series_id" field.
> str(testin)
'data.frame': 100 obs. of 12 variables:
$ V1 : Factor w/ 1 level "SM": 1 1 1 1 1 1 1 1 1 1 ...
$ V2 : Factor w/ 1 level "S": 1 1 1 1 1 1 1 1 1 1 ...
$ V3 : int 1 1 1 1 1 1 1 1 1 1 ...
$ V4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ V5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ V6 : int 1 1 1 1 1 1 1 1 1 1 ...
$ V7 : logi NA NA NA NA NA NA ...
$ V8 : logi NA NA NA NA NA NA ...
$ V9 : int 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
$ V10: Factor w/ 12 levels "M01","M02","M03",..: 1 2 3 4 5 6 7 8 9
10 ...
$ V11: num 1625 1625 1624 1635 1639 ...
$ V12: logi NA NA NA NA NA NA ...
Noted that the leading 7 fwf fields were parse and followed by
trailing tab separated fields, and the floating point field is also
complete:
> testin$V11
[1] 1625.0 1625.1 1623.7 1634.8 1639.1 1643.5 1641.0 1639.4 1641.2
1636.9 1639.8 1639.3 1636.2
[14] 1633.8 1637.0 1635.5 1638.3 1639.5 1643.5 1645.3 1647.0 1648.2
1649.3 1650.5 1657.9 1660.4
snipped
--
David
Note that most data lines have no footnote item, as in the sample.
Here (I think) we'd want all the character variables to be read as
factors,
possibly "year" as a date, and "value" as numeric.
Actually I'm surprised that nobody has yet said what a clearly
bonkers thing it is to mix up your data and your analysis code in a
single file. Now suppose you have another set of data you want to
analyse with the same code? Are you going to create a new file and
paste the new data in? You've now got two copies of your analysis
code
- good luck keeping corrections to that code synchronised.
This just seems like horrendously bad practice, which is one reason
it's kludgy in R. If it was good practice, someone would surely have
written a way to do it neatly.
Keep your data in data files, and your functions in .R function
files. You'll thank me later.
Barry
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.