Hi R gurus,

We do a lot of work with biological -omics datasets (genomics, proteomics etc). 
 The text file inputs to R typically contain a mixture of (mostly) character 
data and numeric data.  The number of columns (both character and numeric data) 
in the file vary with the number of samples measured (which makes use of 
colClasses , so a typical approach might be

1) read in the whole file as character matrix

#simulated result of read.table (with stringsAsFactors=FALSE)
raw <- 
data.frame(Accession=c('P04637','P01375','P00761'),Description=c('Cellular 
tumor antigen p53','Tumor necrosis factor','Trypsin'),Species=c('Homo 
sapiens','Homo sapiens','Sus 
scrofa'),Intensity.SampleA=c('919948','1346170','15870'),Intensity.SampleB=c('1625540','710272','83624'),Intensity.SampleC=c('1232780','1481040','62548'))

2) use grep to identify numeric columns based on column names and split the raw 
matrix

QUANT_COLS <- grepl('^Intensity\\.',colnames(raw))
META_COLS <- !QUANT_COLS
quant.df.char <- raw[,QUANT_COLS]
meta.df <- raw[, META_COLS]

3) convert the quantitation data frame to a numeric matrix

Prior to R version 4, my standard method for step 3 was to use data.matrix() 
for this last step.  After recently updating from v3.6.3, I've found that all 
my workflows using this function were giving wildly incorrect results. I 
figured out that data.matrix now yields a matrix of factor levels rather than 
the actual numeric values

> quant.df.char
  Intensity.SampleA Intensity.SampleB Intensity.SampleC
1            919948           1625540           1232780
2           1346170            710272           1481040
3             15870             83624             62548

> data.matrix(quant.df.char)
     Intensity.SampleA Intensity.SampleB Intensity.SampleC
[1,]                 3                 1                 1
[2,]                 1                 2                 2
[3,]                 2                 3                 3

The change in behaviour of this function is documented in the R v4.0.0 
changelog, so it is clearly intentional:

"data.matrix() now converts character columns to factors and from this to 
integers."

Now, I know there are other ways to achieve the same conversion, e.g. 
sapply(quant.df.char, as.numeric). They aren't quite as straightforward to read 
in the code as data.matrix (sapply/lapply in particular I have to think though 
whether there will a need to transpose the result!), but the fact that this 
base function has been changed (without a way to replicate the previous 
behaviour) leads me to suspect that I have probably not previously been using 
data.matrix in the intended manner - and I may therefore be making similar 
mistakes elsewhere! I've certainly distributed/handed out R scripting examples 
in the past that will now give incorrect results when run on v4+ R.

What even more confusing to me (but possibly related as regards an answer) is 
that R v4 broke with long-standing convention to change 
default.stringsAsFactors() to FALSE. So on one hand the update took away what 
was (at least, from our perspective, with our data - I am sure some here may 
disagree!) a perennial source of confusion/bugs for R learners, by not 
introducing string factorisation during data import, and then on the other hand 
changed a base function to explicitly introduce string factorisation..  I can't 
see when converting a character dataset, not to factors but, straight to 
numeric factor levels might be that useful (but of course that doesn't mean it 
isn't!).

I've had a look through r-help and r-devel archives and couldn't spot any 
discussion of this, so apologies if this has been asked before. I'm also pretty 
sure my misunderstanding is with the intended use-case of data.matrix and R 
ethos around strings/factors rather than the rationale for the change, which is 
why I'm asking here.

Best wishes,

Phil

Philip Charles
Target Discovery Institute, Nuffield Department Of Medicine
University of Oxford




        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to