Re: [R] Documenting data

G . Maubach Thu, 30 Jun 2016 10:12:22 -0700

Hi Pito,
Dear Readers,

as other have already mentioned, there are good practices for documenting code 
and data. I would like to summarize them and add a few not mentioned earlier:


1. You should have always two things: your raw data and your R script/s. The 
raw data is immutable whereas the R script/s produce the results.

2. You might want to distinguish between documentating your CODE and 
documenting your DATA. Documenting code is similar to what you already know 
from your programmng experiences. Documenting data is somewhat different cause 
you store information about the meaning of you data directly in your data.

Example
You have a variable with codes ranging from 1 to 5. But what do they mean? 
Perhaps it could be

1 = Strongly agree
2 = Agree
3 = Neither agree/nor disagree
4 = Disagree
5 = Strongly Disagree

But it could also be the other way round:

1 = Strongly Disagree
2 = Disagree
3 = Nether agree/nor disagree
4 = Agree
5 = Strongly Agree

What the codes in your variable means depends on the systems oder processes you 
derived your data from.

Within R there are some limitations for storing the informtation about what a 
variable or a value within a variable means. Possibilities to store this 
information is in other software packages like SAS or SPSS much broader 
implemented. In R you can work with meaningful variable names and the data 
type/class factor which can store mappings between values and value 
descriptions.

Example
-- cut --
var1 <- c(rep(1:5, 3))
ds_example <- data.frame(var1)

var1_labels <- c("1 = Strongly Agree",
                "2 = Agree",
                "3 = Neither agree/nor disagree",
                "4 = Disagree",
                "5 = Strongly disagree")

ds_example[["var1"]] <- factor(ds_example[["var1"]],
                               levels = c(1, 2, 3, 4, 5),
                               labels = var1_labels)

summary(ds_example["var1"])
-- cut --

In addition you find methods to work with variable labels and value labels in 
the pacakges Hmisc and memisc. They can also produce a thing called codebook 
which contains all variable names, variable labels, values, value labels and 
summaries of the distribution of values within the variables.

3. In addition to this you could structure your script in a modular way 
according to the analysis process, e. g. 
importing, cleaning, preparation for analysis, analysis, reporting. Other 
structure may be more sufficient in your case. These modules could have a 
number in the file name indicating in which sequence the scripts should be run.

4. I find it valuable to use a software repository like Github, Sourceforge or 
others to keep the revisions save and seucre in case you would like to go back 
to a version with code you deleted before and figure out that you need it now 
again. The R Studio IDE has an interface to git if you like to go with that. 
Good commit message can help you track what has changed. Commits also help you 
to prepare precise steps when developing your scripts.

5. I have no experience with Sweave or knitr but you could also compile a 
simple documentation through copying comments to an Excel sheet using R-2-Excel 
libraries like excel.link or others.

Example
install.packages("excel.link")
library(excel.link)
xlc["A1"] <- "Project Documentation"
xlc["A2"] <- "Step XY"
xlc["A3"] <- "Some explanation about step xy"

This way you have the documentation in your code and in an external source.

Which approach you chose depends on your experience with R and its libraries as 
well as the size of your project and the need for documentation.

6. It can be helpful to store interim results in a format that can be read by 
non-R-users, e. g. Excel.

7. Documenting code can be done using roxygen2.

If there are different opinions to my suggestions please say so.

Kind regards

Georg


> Gesendet: Donnerstag, 30. Juni 2016 um 16:51 Uhr
> Von: "Pito Salas" <pitosa...@brandeis.edu>
> An: r-help@r-project.org
> Betreff: [R] Documenting data
>
> I am studying statistics and using R in doing it. I come from software 
> development where we document everything we do.
> 
> As I “massage” my data, adding columns to a frame, computing on other data, 
> perhaps cleaning, I feel the need to document in detail what the meaning, or 
> background, or calculations, or whatever of the data is. After all it is now 
> derived from my raw data (which may have been well documented) but it is 
> “new.” 
> 
> Is this a real problem? Is there a “best practice” to address this?
> 
> Thanks!
> 
> Pito Salas
> Brandeis Computer Science
> Feldberg 131
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Documenting data

Reply via email to