Re: [R] SAS "datalines" or "cards" statement equivalent in R?

David Winsemius Mon, 07 Dec 2009 13:46:54 -0800


On Dec 7, 2009, at 12:37 PM, Marshall Feldman wrote:

I totally agree with Barry, although it's sometimes convenient to
include data with analysis code for debugging and/or documentationpurposes.
However, the example actually applies equally to separate datafiles. In
fact, the example is from the U.S. Bureau of Labor Statistics at
ftp://ftp.bls.gov/pub/time.series/sm/, which contains nothing but data
and documentation files. At issue is not where the data come from, but
rather how to parse relatively complex data organized inconsistently.
SAS has built-in the ability to parse five different organizations of
data: list (delimited), modified list, column, formatted, and mixed(see
http://www.masil.org/sas/input.html). It seems R can parse such data,
but only with considerable work by the user. It would be great tohave a
function/package that implements something with as easy (hah!) and
flexible as SAS.

   Marsh

Barry Rowlingson wrote:
On Mon, Dec 7, 2009 at 3:53 PM, Marshall Feldman <ma...@uri.edu>wrote:
Regarding the various methods people have suggested, what if atypical
tab-delimited data line looks like:

   SMS11000000000000001 1990 M01 688.0

and the SAS INPUT statement is
INPUT survey $ 1-2 seasonal $ 3 state $ 4-5 area $ 6-10supersector $11-12 @13 industry $8. datatype $ 21-22 year period $ valuefootnote $ ;

I was thinking of passing a FWF "chopped" input to scan to handle thetabs but discovered that read.fwf will parse trailing tab-separatedfields.


First a bit of experimentation:
> testdat <- "45678\t567\t45\t6"
> read.fwf(textConnection(testdat), c(5,100))
     V1 V2  V3 V4 V5
1 45678 NA 567 45  6

Then the test on your data source:

> testin <- read.fwf(url("ftp://ftp.bls.gov/pub/time.series/sm/sm.data.1.Alabama", open="r"), c(2,1,2,5,2,8,2,100 ), header=F, n=100, skip=1)

#Need to throw away the header, since the fields no longer match afterparsing what you communicated were the divisions within of the"series_id" field.


> str(testin)
'data.frame':   100 obs. of  12 variables:
 $ V1 : Factor w/ 1 level "SM": 1 1 1 1 1 1 1 1 1 1 ...
 $ V2 : Factor w/ 1 level "S": 1 1 1 1 1 1 1 1 1 1 ...
 $ V3 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V4 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ V5 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ V6 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V7 : logi  NA NA NA NA NA NA ...
 $ V8 : logi  NA NA NA NA NA NA ...
 $ V9 : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...

$ V10: Factor w/ 12 levels "M01","M02","M03",..: 1 2 3 4 5 6 7 8 910 ...

 $ V11: num  1625 1625 1624 1635 1639 ...
 $ V12: logi  NA NA NA NA NA NA ...

Noted that the leading 7 fwf fields were parse and followed bytrailing tab separated fields, and the floating point field is alsocomplete:


> testin$V11

[1] 1625.0 1625.1 1623.7 1634.8 1639.1 1643.5 1641.0 1639.4 1641.21636.9 1639.8 1639.3 1636.2[14] 1633.8 1637.0 1635.5 1638.3 1639.5 1643.5 1645.3 1647.0 1648.21649.3 1650.5 1657.9 1660.4

snipped
--
David

Note that most data lines have no footnote item, as in the sample.
Here (I think) we'd want all the character variables to be read asfactors,
possibly "year" as a date, and "value" as numeric.


Actually I'm surprised that nobody has yet said what a clearly
bonkers thing it is to mix up your data and your analysis code in a
single file. Now suppose you have another set of data you want to
analyse with the same code? Are you going to create a new file and

paste the new data in? You've now got two copies of your analysiscode

- good luck keeping corrections to that code synchronised.

This just seems like horrendously bad practice, which is one reason
it's kludgy in R. If it was good practice, someone would surely have
written a way to do it neatly.

Keep your data in data files, and your functions in .R function
files. You'll thank me later.

Barry



        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] SAS "datalines" or "cards" statement equivalent in R?

Reply via email to