Hi R people, I have recently had to use some old code which utilizes data.matrix and found a functionality which I found unintuitive. We are converting a dataframe containing numerical values stored as character strings to a matrix using the data.matrix function.
This does unfortunately not yield a numerical matrix consisting of the same numbers stored in the original matrix - see for example below: > df <- data.frame(a=c("1","2","7","10"),b=c("1","7","10","19"),c=c("a","b","c","a"),d=c("1","7","a","b")) > df a b c d 1 1 1 a 1 2 2 7 b 7 3 7 10 c a 4 10 19 a b > data.matrix(df) a b c d [1,] 1 1 1 1 [2,] 3 4 2 2 [3,] 4 2 3 3 [4,] 2 3 1 4 The current implementation of data.matrix iterates over each column in the dataframe and utilizes the following code to convert a column into integers: if (is.character(xi)) { frame[[i]] <- as.integer(as.factor(xi)) next } While I kind of understand the reasoning here, i.e. you avoid NA's when the characters are non-numerical, this returns a (to me) unintuitive result when providing the function with a dataframe containing numerical characters. This makes the values of any two columns output from data.matrix very difficult to compare, and not easily traceable to the original data. Was this really the original intent behind the function? I would like to propose a change, which instead checks whether the data.matrix function can convert a column to integers without utilizing the as.factor intermediary. Otherwise it will use the current implementation. if (is.character(xi)) { frame[[i]] <- tryCatch({ as.integer(xi) }, warning = function(war) { f = as.integer(factor(xi)) return(f) }) next } This change results in the following outputs from the data.matrix function (using the earlier df): > data.matrix_new(df) a b c d [1,] 1 1 1 1 [2,] 2 7 2 2 [3,] 7 10 3 3 [4,] 10 19 1 4 Thanks for considering this! Best, Adam Marstrand [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel