I have one more pattern to take care of.

What is happening is that if a string like "10 minutes and 30 seconds" comes
for parsing then the function generates 2 values both for 10 minutes and for
30 seconds and the result list then has 2 elements. So when I use unlist
function then try to merge with the original dataset from which the input
vector was extracted then i get row mismatch.

I think i have to parse data till i get 10,000 data. Any help in this regard
is appreciated.

Thanks
Susanta

On Wed, Oct 27, 2010 at 3:43 PM, Susanta Mohapatra <
mohapatra.susa...@gmail.com> wrote:

> Thanks Gabor,
>
> It is a very tricky task and your comment helped. I modified the function
> to handle average of two numbers when it is like 2-3 minutes. I also
> improved on the regex part to parse the decimal parts also. Right now i can
> parse 100% of one sample.
>
> Thanks
> Susanta
>
>
> On Wed, Oct 27, 2010 at 5:11 AM, Gabor Grothendieck <
> ggrothendi...@gmail.com> wrote:
>
>> On Tue, Oct 26, 2010 at 7:17 PM, Gabor Grothendieck
>> <ggrothendi...@gmail.com> wrote:
>> > On Tue, Oct 26, 2010 at 3:28 PM, Susanta Mohapatra
>> > <mohapatra.susa...@gmail.com> wrote:
>> >> Hi,
>> >>
>> >> I am working with a dataset for sometime and I need some help in
>> parsing
>> >> some data.
>> >>
>> >> There is a column called "Duration" which has data like following:
>> >>
>> >> 2 minutes => 120
>> >> 2 min => 120
>> >> 10 seconds =>10
>> >> 2 hrs =>7200
>> >>  2-3 minutes => 150 or 120
>> >> 5 minutes (when i arrived => 300
>> >> Flyby approx 20 sec. => 20
>> >> felt like 10 mins but tim => 600
>> >>
>> >> I need to convert them to numerics as given. Any help in this regard
>> will be
>> >> highly appreciated.
>> >
>> > Assuming that "convert to numerics as given" means creating a list of
>> > numeric vectors, one per row.
>> >
>>
>> or if => was supposed to mean that that is the desired result then try
>> this:
>>
>>
>> f <- function(n1, n2, units) {
>>        if (n2 == "" && substr(units, 1, 3) == "sec") n1
>>        else if (n2 == "" && substr(units, 1, 3) == "min") paste(60 *
>> as.numeric(n1))
>>        else if (n2 == "" && substr(units, 1, 3) == "hrs") paste(3600 *
>> as.numeric(n1))
>>        else if (n2 != "" && substr(units, 1, 3) == "sec") paste(n1, "or",
>> -as.numeric(n2))
>>
>>        else if (n2 != "" && substr(units, 1, 3) == "min") paste(60 *
>> as.numeric(n1), "or", -60 * as.numeric(n2))
>>        else if (n2 != "" && substr(units, 1, 3) == "hrs") paste(3600 *
>> as.numeric(n1), "or", -3660 * as.numeric(n2))
>>        else NA
>> }
>>
>>
>> xx <- c("2 minutes ", "2 min ", "10 seconds ", "2 hrs ", " 2-3 minutes ",
>> "5 minutes (when i arrived ", "Flyby approx 20 sec. ",
>> "felt like 10 mins but tim ")
>>
>> library(gsubfn)
>> out2 <- strapply(xx, "(\\d+)(-\\d+)? (\\S+)", f)
>>
>> The output looks like this:
>>
>> > str(out2)
>> List of 8
>>  $ : chr "120"
>>  $ : chr "120"
>>  $ : chr "10"
>>  $ : chr "7200"
>>  $ : chr "120 or 180"
>>  $ : chr "300"
>>  $ : chr "20"
>>  $ : chr "600"
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to