I also use the approach Philipp describes below. I use Python and shell
scripts for processing thousands of input files and getting all the data
into one tidy csv table. From that point onwards it's R all the way
(often with the reshape package).
Paul
Philipp Pagel wrote:
On Wed, May 06, 2009 at 12:22:45AM -0400, Farrel Buchinsky wrote:
Is R an appropriate tool for data manipulation and data reshaping and data
organizing? I think so but someone who recently joined our group thinks not.
The new recruit believes that python or another language is a far better
tool for developing data manipulation scripts that can be then used by
several members of our research group.
I happily use both approaches depending on the original format the
data come in:
For data that are not in a "well behaved" format and require actual
parsing, I tend to use Python scripts for transmogrifying the data
into nice and tidy tables (and maybe some very basic filtering). For
everything after that I prefer R. I also use Python if the relevant
data needs to be harvested and assembled from many differnt sources
(e.g. data files + web + databases).
Once the data files are easy to read (csv, tab separated, database,
...) and the task is to reshape, filter and clean the data, I usually
do it in R. R has true advantages here:
- After reading a table into a data frame I can immediatly tell, if all
measurements are what they are supposed to be (integer, numeric,
factor, boolean) and functions like read.table even do quite some
error checking for me (equal number of columns etc.)
- Finding out if factors have the right (or plausible) number of levels is easy
- Filtering by logical indexing
- Powerful and reliable reshaping (reshape package)
- Very conveniant diagnostics: str(), dim(), table(), summary(),
plotting the data in various ways, ...
cu
Philipp
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.