Re: [R] script to data clear

Jeff Newmiller Tue, 12 Aug 2014 08:10:57 -0700

Without a representative sample of data, it is very hard to understand your 
question or to be specific about suggestions. See [1] for some ideas about how 
to communicate questions online.

Not that "clearing" data would usually mean deleting it, as in rm(data). From 
context I assume you mean "cleaning", where invalid characters need to be 
removed.

Also assuming that you have a data frame with some columns that are categorical 
data:

1) If the values are contaminated or incomplete (don't have rows representing 
every possible category) then it is almost always better to delay converting to 
factor until after data are cleaned. The read.table family of functions include 
a "stringsAsFactors=FALSE" option that will prevent automatic conversion of 
columns with unknown types into factors. This is also useful for contaminated 
numeric columns. Only after the vector of character data is clean and as 
complete as it can be should you convert to factor.

Note that most data sets have a variety of column types, and even after 
resolving issues discussed here your function is not necessarily going to work 
with every input data file that you encounter. Specifically, not every column 
of data should be converted to factor. With this in mind, it can be helpful to 
look for ways to confirm that the date you are processing is what you expect it 
to be. Often this is implemented by confirming that specific columns have 
specific kinds of data in them. That is using a loop may be TOO flexible... 
apply this cleaning loop cautiously.

2) Most functions in R can process whole vectors of data at once, so your inner 
loop should not be necessary. Specifically, the line

data[[i]] <- gsub( " +", " ", data[[i]] )

would replace all sequences of one or more spaces in every element of the 
vector with a single space.

(Your j loop also goes too many times... str_replace_all(data[[i]], "  ", " ") 
is affecting the whole column, but you repeat it unnecessarily.)

 3) I don't know what a "depurate" value is.

4) You should be able to convert your cleaned character column to factor with 
the "factor" function... like

data[[i]] <- factor( data[[i]] )

Note that if you know certain levels should be possible but not all of them are 
actually present (e.g. "Small", "Medium", and "Large" but no data with "Small" 
are present) then you will need to specify the levels as a parameter to the 
factor function. See the help file ?factor.

5) You have several lines of code at the end that appear to execute regardless 
of whether the column is a factor or not. They should be within the braces of 
the if statement.

6) Please read the Posting Guide mentioned at the end of this and every post on 
this list, specifically regarding posting in plain text. Your code was 
partially damaged by the HTML email format.

[1] 
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On August 12, 2014 5:42:13 AM PDT, "Maicel Monzón Pérez" 
<mai...@infomed.sld.cu> wrote:
>Hello List,
>
>I did this script to clear data after import (I don�t know is ok ).
>After
>its execution levels and label values got lost. Could some explain me
>to
>reassign levels again in the script (new depurate value)? 
>
>Best regard
>
>Maicel Monzon MD, PHD
>
>Center of Cybernetic Apply to Medicine
>
># data cleaning  script
>
>library(stringr)
>
>for(i in 1:length(data)) { 
>
>  if (is.factor(data[[i]])==T) 
>
>  {for(j in 1:sum(str_detect(data[,i], "  "))) 
>
>  {data[[i]]<-str_replace_all(data[[i]], "  ", " ")}}
>
>  data[[i]]<-str_trim (data[[i]],side = "both")
>
>  data[[i]]<-tolower(data[[i]])
>
>}
>
>Note: �   � is 2 blank space  and � �  only one
>
> 
>
>
>
>--
>Nunca digas nunca, di mejor: gracias, permiso, disculpe.
>
>Este mensaje le ha llegado mediante el servicio de correo electronico
>que ofrece Infomed para respaldar el cumplimiento de las misiones del
>Sistema Nacional de Salud. La persona que envia este correo asume el
>compromiso de usar el servicio a tales fines y cumplir con las
>regulaciones establecidas
>
>Infomed: http://www.sld.cu/
>
>
>
>
>       [[alternative HTML version deleted]]
>
>
>
>------------------------------------------------------------------------
>
>______________________________________________
>R-help@r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] script to data clear

Reply via email to