On Tue, Jan 24, 2012 at 11:54 AM, Paul Miller <pjmiller...@yahoo.com> wrote: > Hello Everyone, > > Still new to R. Wrote some code that finds and prints invalid dates (see > below). This code works but I suspect it's not very good. If someone could > show me a better way, I'd greatly appreciate it. > > Here is some information about what I'm trying to accomplish. My sense is > that the R date functions are best at identifying invalid dates when fed > character data in their default format. So my code converts the input dates > to character, breaks them apart using strsplit, and then reformats them. It > then identifies which dates are "missing" in the sense that the month or year > are unknown and prints out any remaining invalid date values. > > As I see it, the code has at least 4 shortcomings. > > 1. It's too long. My understanding is that skilled programmers can usually or > often complete tasks like this in a few lines. > > 2. It's not vectorized. I started out trying to do something that was > vectorized but ran into problems with the strsplit function. I looked at the > help file and it appears this function will only accept a single character > vector. > > 3. It prints out the incorrect dates but doesn't indicate which date variable > they belong to. I tried various things with paste but never came up with > anything that worked. Ideally, I'd like to get something that looks roughly > like: > > Error: Invalid date values in birthDT > > "21931-11-23" > "1933-06-31" > > Error: Invalid date values in diagnosisDT > > "2010-02-30" > > 4. There's no way to specify names for input and output data. I imagine this > would be fairly easy to specify this in the arguments to a function but am > not sure how to incorporate it into a for loop. > > Thanks, > > Paul > > ########################################## > #### Code for detecting invalid dates #### > ########################################## > > #### Test Data #### > > connection <- textConnection(" > 1 11/23/21931 05/23/2009 un/17/2011 > 2 06/20/1940 02/30/2010 03/17/2011 > 3 06/17/1935 12/20/2008 07/un/2011 > 4 05/31/1937 01/18/2007 04/30/2011 > 5 06/31/1933 05/16/2009 11/20/un > ") > > TestDates <- data.frame(scan(connection, > list(Patient=0, birthDT="", diagnosisDT="", metastaticDT=""))) > > close(connection) > > TestDates > > class(TestDates$birthDT) > class(TestDates$diagnosisDT) > class(TestDates$metastaticDT) > > #### List of Date Variables #### > > DateNames <- c("birthDT", "diagnosisDT", "metastaticDT") > > #### Read Dates #### > > for (i in seq(TestDates[DateNames])){ > TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]]) > TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/") > TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1]) > TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2]) > TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3]) > TestDates$Day[TestDates$Day=="un"] <- "15" > TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep = > "-")) > is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T > is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T > TestDates$Date <- as.Date(TestDates[DateNames][[i]], format="%Y-%m-%d") > TestDates$Invalid <- ifelse(is.na(TestDates$Date) & > !is.na(TestDates[DateNames][[i]]), 1, 0) > if( sum(TestDates$Invalid)==0 ) > { TestDates[DateNames][[i]] <- TestDates$Date } else > { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) } > TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, > Invalid)) > } > > TestDates > > class(TestDates$birthDT) > class(TestDates$diagnosisDT) > class(TestDates$metastaticDT)
If s is a vector of character strings representing dates then bad is a logical vector which is TRUE for the bad ones and FALSE for the good ones (adjust as needed if a different date range is valid) so s[bad] is the bad inputs and the output d is a "Date" vector with NAs for the bad ones: x <- gsub("un", 15, s) d <- as.Date(x, "%m/%d/%Y") bad <- is.na(d) | d < as.Date("1900-01-01") | d > Sys.Date() d[bad] <- NA -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.