On Tue, Jan 24, 2012 at 11:54 AM, Paul Miller <pjmiller...@yahoo.com> wrote:
> Hello Everyone,
>
> Still new to R. Wrote some code that finds and prints invalid dates (see 
> below). This code works but I suspect it's not very good. If someone could 
> show me a better way, I'd greatly appreciate it.
>
> Here is some information about what I'm trying to accomplish. My sense is 
> that the R date functions are best at identifying invalid dates when fed 
> character data in their default format. So my code converts the input dates 
> to character, breaks them apart using strsplit, and then reformats them. It 
> then identifies which dates are "missing" in the sense that the month or year 
> are unknown and prints out any remaining invalid date values.
>
> As I see it, the code has at least 4 shortcomings.
>
> 1. It's too long. My understanding is that skilled programmers can usually or 
> often complete tasks like this in a few lines.
>
> 2. It's not vectorized. I started out trying to do something that was 
> vectorized but ran into problems with the strsplit function. I looked at the 
> help file and it appears this function will only accept a single character 
> vector.
>
> 3. It prints out the incorrect dates but doesn't indicate which date variable 
> they belong to. I tried various things with paste but never came up with 
> anything that worked. Ideally, I'd like to get something that looks roughly 
> like:
>
> Error: Invalid date values in birthDT
>
> "21931-11-23"
> "1933-06-31"
>
> Error: Invalid date values in diagnosisDT
>
> "2010-02-30"
>
> 4. There's no way to specify names for input and output data. I imagine this 
> would be fairly easy to specify this in the arguments to a function but am 
> not sure how to incorporate it into a for loop.
>
> Thanks,
>
> Paul
>
> ##########################################
> #### Code for detecting invalid dates ####
> ##########################################
>
> #### Test Data ####
>
> connection <- textConnection("
> 1 11/23/21931 05/23/2009 un/17/2011
> 2 06/20/1940  02/30/2010 03/17/2011
> 3 06/17/1935  12/20/2008 07/un/2011
> 4 05/31/1937  01/18/2007 04/30/2011
> 5 06/31/1933  05/16/2009 11/20/un
> ")
>
> TestDates <- data.frame(scan(connection,
>                 list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))
>
> close(connection)
>
> TestDates
>
> class(TestDates$birthDT)
> class(TestDates$diagnosisDT)
> class(TestDates$metastaticDT)
>
> #### List of Date Variables ####
>
> DateNames <- c("birthDT", "diagnosisDT", "metastaticDT")
>
> #### Read Dates ####
>
> for (i in seq(TestDates[DateNames])){
> TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]])
> TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/")
> TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1])
> TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2])
> TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3])
> TestDates$Day[TestDates$Day=="un"] <- "15"
> TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep = 
> "-"))
> is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T
> is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T
> TestDates$Date <- as.Date(TestDates[DateNames][[i]], format="%Y-%m-%d")
> TestDates$Invalid <- ifelse(is.na(TestDates$Date) & 
> !is.na(TestDates[DateNames][[i]]), 1, 0)
> if( sum(TestDates$Invalid)==0 )
>        { TestDates[DateNames][[i]] <- TestDates$Date } else
>        { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
> TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, 
> Invalid))
> }
>
> TestDates
>
> class(TestDates$birthDT)
> class(TestDates$diagnosisDT)
> class(TestDates$metastaticDT)

If s is a vector of character strings representing dates then bad is a
logical vector which is TRUE for the bad ones and FALSE for the good
ones (adjust as needed if a different date range is valid) so s[bad]
is the bad inputs and the output d is a "Date" vector with NAs for the
bad ones:

        x <- gsub("un", 15, s)
        d <- as.Date(x, "%m/%d/%Y")
        bad <- is.na(d) | d < as.Date("1900-01-01") | d > Sys.Date()
        d[bad] <- NA

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to