On Dec 7, 2009, at 12:37 PM, Marshall Feldman wrote:

I totally agree with Barry, although it's sometimes convenient to
include data with analysis code for debugging and/or documentation purposes.

However, the example actually applies equally to separate data files. In
fact, the example is from the U.S. Bureau of Labor Statistics at
ftp://ftp.bls.gov/pub/time.series/sm/, which contains nothing but data
and documentation files. At issue is not where the data come from, but
rather how to parse relatively complex data organized inconsistently.
SAS has built-in the ability to parse five different organizations of
data: list (delimited), modified list, column, formatted, and mixed (see
http://www.masil.org/sas/input.html). It seems R can parse such data,
but only with considerable work by the user. It would be great to have a
function/package that implements something with as easy (hah!) and
flexible as SAS.

   Marsh

Barry Rowlingson wrote:
On Mon, Dec 7, 2009 at 3:53 PM, Marshall Feldman <ma...@uri.edu> wrote:

Regarding the various methods people have suggested, what if a typical
tab-delimited data line looks like:

   SMS11000000000000001 1990 M01 688.0

and the SAS INPUT statement is

INPUT survey $ 1-2 seasonal $ 3 state $ 4-5 area $ 6-10 supersector $ 11-12 @13 industry $8. datatype $ 21-22 year period $ value footnote $ ;

I was thinking of passing a FWF "chopped" input to scan to handle the tabs but discovered that read.fwf will parse trailing tab-separated fields.

First a bit of experimentation:
> testdat <- "45678\t567\t45\t6"
> read.fwf(textConnection(testdat), c(5,100))
     V1 V2  V3 V4 V5
1 45678 NA 567 45  6

Then the test on your data source:

> testin <- read.fwf(url("ftp://ftp.bls.gov/pub/time.series/sm/sm.data.1.Alabama ", open="r"), c(2,1,2,5,2,8,2,100 ), header=F, n=100, skip=1)

#Need to throw away the header, since the fields no longer match after parsing what you communicated were the divisions within of the "series_id" field.

> str(testin)
'data.frame':   100 obs. of  12 variables:
 $ V1 : Factor w/ 1 level "SM": 1 1 1 1 1 1 1 1 1 1 ...
 $ V2 : Factor w/ 1 level "S": 1 1 1 1 1 1 1 1 1 1 ...
 $ V3 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V4 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ V5 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ V6 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V7 : logi  NA NA NA NA NA NA ...
 $ V8 : logi  NA NA NA NA NA NA ...
 $ V9 : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
$ V10: Factor w/ 12 levels "M01","M02","M03",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ V11: num  1625 1625 1624 1635 1639 ...
 $ V12: logi  NA NA NA NA NA NA ...

Noted that the leading 7 fwf fields were parse and followed by trailing tab separated fields, and the floating point field is also complete:

> testin$V11
[1] 1625.0 1625.1 1623.7 1634.8 1639.1 1643.5 1641.0 1639.4 1641.2 1636.9 1639.8 1639.3 1636.2 [14] 1633.8 1637.0 1635.5 1638.3 1639.5 1643.5 1645.3 1647.0 1648.2 1649.3 1650.5 1657.9 1660.4
snipped
--
David

Note that most data lines have no footnote item, as in the sample.

Here (I think) we'd want all the character variables to be read as factors,
possibly "year" as a date, and "value" as numeric.


Actually I'm surprised that nobody has yet said what a clearly
bonkers thing it is to mix up your data and your analysis code in a
single file. Now suppose you have another set of data you want to
analyse with the same code? Are you going to create a new file and
paste the new data in? You've now got two copies of your analysis code
- good luck keeping corrections to that code synchronised.

This just seems like horrendously bad practice, which is one reason
it's kludgy in R. If it was good practice, someone would surely have
written a way to do it neatly.

Keep your data in data files, and your functions in .R function
files. You'll thank me later.

Barry



        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to