I have one more pattern to take care of. What is happening is that if a string like "10 minutes and 30 seconds" comes for parsing then the function generates 2 values both for 10 minutes and for 30 seconds and the result list then has 2 elements. So when I use unlist function then try to merge with the original dataset from which the input vector was extracted then i get row mismatch.
I think i have to parse data till i get 10,000 data. Any help in this regard is appreciated. Thanks Susanta On Wed, Oct 27, 2010 at 3:43 PM, Susanta Mohapatra < mohapatra.susa...@gmail.com> wrote: > Thanks Gabor, > > It is a very tricky task and your comment helped. I modified the function > to handle average of two numbers when it is like 2-3 minutes. I also > improved on the regex part to parse the decimal parts also. Right now i can > parse 100% of one sample. > > Thanks > Susanta > > > On Wed, Oct 27, 2010 at 5:11 AM, Gabor Grothendieck < > ggrothendi...@gmail.com> wrote: > >> On Tue, Oct 26, 2010 at 7:17 PM, Gabor Grothendieck >> <ggrothendi...@gmail.com> wrote: >> > On Tue, Oct 26, 2010 at 3:28 PM, Susanta Mohapatra >> > <mohapatra.susa...@gmail.com> wrote: >> >> Hi, >> >> >> >> I am working with a dataset for sometime and I need some help in >> parsing >> >> some data. >> >> >> >> There is a column called "Duration" which has data like following: >> >> >> >> 2 minutes => 120 >> >> 2 min => 120 >> >> 10 seconds =>10 >> >> 2 hrs =>7200 >> >> 2-3 minutes => 150 or 120 >> >> 5 minutes (when i arrived => 300 >> >> Flyby approx 20 sec. => 20 >> >> felt like 10 mins but tim => 600 >> >> >> >> I need to convert them to numerics as given. Any help in this regard >> will be >> >> highly appreciated. >> > >> > Assuming that "convert to numerics as given" means creating a list of >> > numeric vectors, one per row. >> > >> >> or if => was supposed to mean that that is the desired result then try >> this: >> >> >> f <- function(n1, n2, units) { >> if (n2 == "" && substr(units, 1, 3) == "sec") n1 >> else if (n2 == "" && substr(units, 1, 3) == "min") paste(60 * >> as.numeric(n1)) >> else if (n2 == "" && substr(units, 1, 3) == "hrs") paste(3600 * >> as.numeric(n1)) >> else if (n2 != "" && substr(units, 1, 3) == "sec") paste(n1, "or", >> -as.numeric(n2)) >> >> else if (n2 != "" && substr(units, 1, 3) == "min") paste(60 * >> as.numeric(n1), "or", -60 * as.numeric(n2)) >> else if (n2 != "" && substr(units, 1, 3) == "hrs") paste(3600 * >> as.numeric(n1), "or", -3660 * as.numeric(n2)) >> else NA >> } >> >> >> xx <- c("2 minutes ", "2 min ", "10 seconds ", "2 hrs ", " 2-3 minutes ", >> "5 minutes (when i arrived ", "Flyby approx 20 sec. ", >> "felt like 10 mins but tim ") >> >> library(gsubfn) >> out2 <- strapply(xx, "(\\d+)(-\\d+)? (\\S+)", f) >> >> The output looks like this: >> >> > str(out2) >> List of 8 >> $ : chr "120" >> $ : chr "120" >> $ : chr "10" >> $ : chr "7200" >> $ : chr "120 or 180" >> $ : chr "300" >> $ : chr "20" >> $ : chr "600" >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.