Re: [R] fast way to find most common value across columns dataframe

Rui Barradas Sat, 31 Oct 2020 11:15:05 -0700

Hello,

Here is a comparative test of 3 options.


cumstats::Mode returns a list with two members,

Values: all the modes.
Frequency: their frequency

The value of the mode must be extracted after. cumstats::Mode is by farthe slowest but returns more information.

The function below is in this StackOverflow post [1]. It's the fastestbut only returns one mode, the first found.



set.seed(2020)
V <- LETTERS
df <- replicate(100, sample(V, 1000, replace = TRUE))
df <- as.data.frame(t(df))

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

res1 <- apply(df, 1, prettyR::Mode)
res2 <- apply(df, 1, cumstats::Mode)
res3 <- apply(df, 1, Mode)

head(res1)
res2vals <- lapply(res2, '[[', 1)
head(res2vals)
head(res3)

library(microbenchmark)

mb <- microbenchmark(
  pre = apply(df, 1, prettyR::Mode),
  cum = cumstats::Mode(x),
  so = apply(x, 1, Mode),
  times = 10
)
print(mb, unit = "relative", order = "median")




[1] https://stackoverflow.com/a/8189441/8245406


Hope this helps,

Rui Barradas

Às 17:12 de 31/10/20, Luigi Marongiu escreveu:

Thank you. The problem was not finding the mode but applying it the R
way (I have the tendency to loop into each line of the dataframes,
which I believe is NOT the R way).
I'll try them.
Best regards
Luigi

On Sat, Oct 31, 2020 at 5:40 PM Bert Gunter <bgunter.4...@gmail.com> wrote:


As usual, a web search ("find statistical mode in R") brought up something that 
is possibly useful -- Did you try this before posting? If not, please do so in future and 
let us know what your results were if you subsequently post here.

Here's what SO suggested:

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

# ergo:
apply(as.matrix(df),1,Mode)

Note that all the functionality in Mode is via .Internal functions.  So you can 
determine whether this is faster than Jim's code for your use case, but I'm 
pretty sure it will be faster than yours. However, note that this gives only 
the value of the *first* mode if there is more than one, while Jim's code 
alerts you to multiple modes.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and sticking 
things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Oct 31, 2020 at 2:29 AM Jim Lemon <drjimle...@gmail.com> wrote:


Hi Luigi,
If I understand your request:

library(prettyR)
apply(as.matrix(df),1,Mode)
[1] "C"       "B"       "D"       ">1 mode" ">1 mode" ">1 mode" "D"
[8] "C"       "B"       ">1 mode"

Jim

On Sat, Oct 31, 2020 at 7:56 PM Luigi Marongiu <marongiu.lu...@gmail.com>
wrote:

Hello,
I have a large dataframe (1 000 000 rows, 1000 columns) where the
columns contain a character. I would like to determine the most common
character for each row.
In the example below, I can parse one row at the time and find the
most common character (apart for ties...). But I think this will be
very slow and memory consuming.
Is there a way to run it more efficiently?
Thank you

```
V = c("A", "B", "C", "D")
df = data.frame(n = 1:10,
        col_01 = sample(V, 10, replace = TRUE, prob = NULL),
        col_02 = sample(V, 10, replace = TRUE, prob = NULL),
        col_03 = sample(V, 10, replace = TRUE, prob = NULL),
        col_04 = sample(V, 10, replace = TRUE, prob = NULL),
        col_05 = sample(V, 10, replace = TRUE, prob = NULL),
        stringsAsFactors = FALSE)

q = vector()
for(i in 1:nrow(df)) {
   x = as.vector(t(df[i,2:ncol(df)]))
   q[i] =    names(which.max(table(x)))
}
df$most = q
```

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


         [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] fast way to find most common value across columns dataframe

Reply via email to