Re: [R] Function to "lump" factors together?

David Winsemius Mon, 17 Oct 2011 20:37:04 -0700


On Oct 17, 2011, at 9:45 PM, David Wolfskill wrote:

Sorry about the odd terminology, but I suspect that my intent might be
completely missed had I used "aggregate" or "classify" (each of which

appears to have some rather special meanings in statistical analysisand

modeling).

I have some data about software builds; one of the characteristics of
each is the name of the branch.

A colleague has generated some fairly interesting graphs from thedata,

but he's treating each unique branch as if it were a separate factor.

Last I checked, I had 276 unique branches, but these could be
aggregated, classified, or "lumped" into about 8 - 10 categories; I

believe it would be useful and helpful for me to be able to doprecisely

that.

A facility that could work for this purpose (that that we use in our

"continuous build" driver) is the Bourne shell "case" statement.Such a

construct might look like:

        case branch in
        trunk)    factor="trunk"; continue;;
        IB*)      factor="IB"; continue;;
        DEV*)     factor="DEV"; continue;;
        PVT*)     factor="PVT"; continue;;
        RELEASE*) factor="RELEASE"; continue;;
        *)        factor="UNK"; continue;;
        esac

Which would assign one of 6 values to "factor" depending on thevalue of

"branch" -- using "UNK" as a default if nothing else matched.

Mind, the patterns there are "Shell Patterns" ("globs"), not regular
expressions.

I've looked at R functions match(), pmatch(), charmatch(), andswitch();

while each looks as it it might be coercable to get the result I want,

it also looks to require iteration over the thousands of entries Ihave

-- as well as using the functions in question in a fairly "unnatural"
way.

I could also write my own function that iterates over the entries,
generating factors from the branch names -- but I can't help but think
that what I'm trying to do can't be so uncommon that someone hasn't

already written a function to do what I'm trying to do. And I'dreally

rather avoid "re-inventing the wheel," here.

Here's a loopless lumping of random letters with an "other" value .There better ways, but my efforts with match and switch came tonaught. "pmatch" returns a numeric vector that selects the group.


> x <- sample(letters[1:10], 50, replace =TRUE)

> c("abc","abc","abc","def","def","def","ghi","ghi","ghi", "j")[pmatch(x, letters[1:10], duplicates.ok=TRUE, nomatch=10)][1] "ghi" "ghi" "ghi" "ghi" "ghi" "def" "def" "ghi" "def" "abc""abc" "j" "def" "def" "ghi"[16] "abc" "j" "def" "ghi" "abc" "ghi" "abc" "abc" "abc" "abc" "abc""abc" "ghi" "def" "abc"[31] "ghi" "def" "ghi" "def" "abc" "ghi" "ghi" "j" "abc" "def" "abc""ghi" "abc" "def" "def"

[46] "def" "j"   "ghi" "def" "def"

Classifying 5 million letters in about a second:

> x <- sample(letters[1:10], 5000000, replace =TRUE)

> system.time( v <-c("abc","abc","abc","def","def","def","ghi","ghi","ghi", "j")[pmatch(x, letters[1:10], duplicates.ok=TRUE, nomatch=10)] )

   user  system elapsed
  0.858   0.208   1.062

The same strategy (indexing to return a set membership) can be usedwith findInterval.


--

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Function to "lump" factors together?

Reply via email to