for both cidx <- !(sapply(df, is.numeric)) df[cidx] <- lapply(df[cidx], as.numeric)
Ô__ c/ /'_;~~~~kmezhoud (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ http://bioinformatics.tn/ On Wed, Dec 31, 2014 at 5:51 PM, Karim Mezhoud <kmezh...@gmail.com> wrote: > Yes the last one this the best. But I need to test if returned data.frame > is with factor or character: > cidx <- sapply(df, is.factor) or cidx <- sapply(df, is.character) > Thanks > > Ô__ > c/ /'_;~~~~kmezhoud > (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ > http://bioinformatics.tn/ > > > > On Wed, Dec 31, 2014 at 5:24 PM, Karim Mezhoud <kmezh...@gmail.com> wrote: > >> Concretely I request cbioportal through cgsdr package. >> Depending of Cases and Genetic profiles I receive in general data.frame >> with heterogeneous structure. The bad one if the returned data.frame is >> composed by numeric and character columns. in this case numeric columns are >> considered as factor. It is the case when I explore/extract information >> from Clinical Data (Age, gender., tumor stage..). In this case I need to >> convert only numeric column and not character ones. I am using >> grep("[0-9]*.[0-9]*",df[,i])!=0 {fun to convert}. >> >> But this heterogeneity comes even with only supposed numeric data.frame >> (gene expression). here an example >> >> >> library(cgdsr) >> GeneList <- c("DDR2", "HPGDS", "MS4A2","SSUH2","MLH1" ,"MSH2", "ATM" >> ,"ATR", "MDC1" ,"PARP1") >> cgds<-CGDS("http://www.cbioportal.org/public-portal/") >> >> str(getProfileData(cgds,GeneList, >> "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27")) >> >> str(getProfileData(cgds,GeneList, >> "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450")) >> >> >> With my computer I did not find the same structure (numeric vs factor). >> >> Also I need to preserve row and column names ;) >> So I am working to resolve these details depending on data of >> cbioportal... >> >> Thank you >> >> >> Ô__ >> c/ /'_;~~~~kmezhoud >> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ >> http://bioinformatics.tn/ >> >> >> >> On Wed, Dec 31, 2014 at 4:37 PM, Karim Mezhoud <kmezh...@gmail.com> >> wrote: >> >>> Many Many Many thanks! >>> it is a demonstrative lesson. I need time to test all examples :) >>> Thank you for your time and support. >>> Happy and Healthy New Year >>> >>> Ô__ >>> c/ /'_;~~~~kmezhoud >>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ >>> http://bioinformatics.tn/ >>> >>> >>> >>> On Wed, Dec 31, 2014 at 2:38 PM, Martin Morgan <mtmor...@fredhutch.org> >>> wrote: >>> >>>> On 12/31/2014 12:22 AM, Karim Mezhoud wrote: >>>> >>>>> Thanks, >>>>> It seems for loop spends less time ;) >>>>> >>>>> with >>>>> dim(DataFrame) >>>>> [1] 338 70 >>>>> >>>>> For loop has >>>>> user system elapsed >>>>> 0.012 0.000 0.012 >>>>> >>>>> and apply has >>>>> user system elapsed >>>>> 0.020 0.000 0.021 >>>>> >>>> >>>> The timings are so short that the answer in terms of speed is 'it does >>>> not matter'. >>>> >>>> Here is a selection of approaches >>>> >>>> f0 <- function(df) { >>>> for (i in seq_along(df)) >>>> df[,i] <- as.numeric(df[,i]) >>>> df >>>> } >>>> >>>> f0a <- function(df) { >>>> ## data.frame is a list-of-equal-length vectors; access each >>>> ## column with "[[" >>>> for (i in seq_along(df)) >>>> df[[i]] <- as.numeric(df[[i]]) >>>> df >>>> } >>>> >>>> f0c <- compiler::cmpfun(f0) ## loops sometimes benefit from compilation >>>> >>>> f1 <- function(df) >>>> as.data.frame(apply(df, 2, as.numeric)) >>>> >>>> f2 <- function(df) { >>>> ## replace all columns of df with list-of-vectors >>>> df[] <- lapply(df, as.numeric) >>>> df >>>> } >>>> >>>> f3 <- function(df) { >>>> ## coerce to matrix to avoid the explicit loop, use mode<- to >>>> ## change storage of elements >>>> m <- as.matrix(df) >>>> mode(m) <- "numeric" >>>> as.data.frame(m) >>>> } >>>> >>>> f4 <- function(df) { >>>> ## if it's a matrix, why are we returning a data.frame? >>>> m <- as.matrix(df) >>>> mode(m) <- "numeric" >>>> m >>>> } >>>> >>>> f4a <- function(df) >>>> ## unlist to single vector, coerce, then format as matrix >>>> matrix(as.numeric(unlist(df, use.names=FALSE)), nrow(df), >>>> dimnames=dimnames(df)) >>>> >>>> It's important to test that different methods return the same result >>>> (perhaps allowing for differences in attributes such as row or column >>>> names). The microbenchmark package repeats timings across multiple trials >>>> (default 100 times). >>>> >>>> library(microbenchmark) >>>> test <- function(df) { >>>> stopifnot( >>>> identical(f0(df), f0a(df)), >>>> identical(f0(df), f0c(df)), >>>> identical(f0(df), f1(df)), >>>> identical(f0(df), f2(df)), >>>> identical(f0(df), f3(df)), >>>> identical(as.matrix(f0(df)), f4(df)), >>>> all.equal(f4(df), f4a(df), check.attributes=FALSE)) >>>> microbenchmark(f0(df), f0a(df), f1(df), f2(df), f3(df), f4(df), >>>> f4a(df)) >>>> } >>>> >>>> Here are some data sets >>>> >>>> m <- matrix(rnorm(338 * 70), 338) >>>> df <- as.data.frame(m) >>>> dfc <- as.data.frame(lapply(df, as.character), stringsAsFactors=FALSE) >>>> dff <- as.data.frame(lapply(df, as.character)) >>>> >>>> and results >>>> >>>> > test(df) >>>> Unit: microseconds >>>> expr min lq mean median uq max neval >>>> f0(df) 6208.956 6270.5500 6367.4138 6306.7110 6362.2225 7731.281 >>>> 100 >>>> f0a(df) 2917.973 2975.2090 3024.8623 3002.3805 3036.5365 3951.618 >>>> 100 >>>> f0c(df) 6078.399 6150.1085 6264.0998 6188.3690 6244.5725 7684.116 >>>> 100 >>>> f1(df) 2698.074 2743.2905 2821.8453 2769.3655 2805.5345 4033.229 >>>> 100 >>>> f2(df) 1989.057 2041.0685 2066.1830 2055.0020 2083.8545 2267.732 >>>> 100 >>>> f3(df) 1532.435 1572.9810 1609.7378 1597.6245 1624.2305 2003.584 >>>> 100 >>>> f4(df) 808.593 828.5445 852.2626 847.5355 864.6665 1180.977 100 >>>> f4a(df) 422.657 437.2705 458.9845 455.2470 465.5815 695.443 100 >>>> > test(dfc) >>>> Unit: milliseconds >>>> expr min lq mean median uq max >>>> neval >>>> f0(df) 11.416532 11.647858 11.915287 11.767647 12.016276 14.239622 >>>> 100 >>>> f0a(df) 8.095709 8.211116 8.380638 8.289895 8.454948 9.529026 >>>> 100 >>>> f0c(df) 11.339293 11.577811 11.772087 11.702341 11.896729 12.674766 >>>> 100 >>>> f1(df) 8.227371 8.277147 8.422412 8.331403 8.490411 9.145499 >>>> 100 >>>> f2(df) 6.907888 7.010828 7.162529 7.147198 7.239048 7.763758 >>>> 100 >>>> f3(df) 6.608107 6.688232 6.845936 6.792066 6.892635 8.359274 >>>> 100 >>>> f4(df) 5.859482 5.939680 6.046976 5.993804 6.105388 6.968601 >>>> 100 >>>> f4a(df) 5.372214 5.460987 5.556687 5.521542 5.614482 6.107081 >>>> 100 >>>> > test(dff) >>>> Error: identical(f0(df), f1(df)) is not TRUE >>>> >>>> Except when dealing with factors, the use of explicit loops is the >>>> slowest. With factors, matrix-based methods coerce the level labels to >>>> numeric, whereas vector-based methods coerce the underlying codes (level >>>> values) of the factor; obviously great care needs to be taken. >>>> >>>> > f0(dff)[1:5, 1:5] >>>> V1 V2 V3 V4 V5 >>>> 1 150 232 294 88 56 >>>> 2 159 8 89 59 10 >>>> 3 132 171 40 205 119 >>>> 4 214 273 26 262 216 >>>> 5 281 49 255 31 233 >>>> > f1(dff)[1:5, 1:5] >>>> V1 V2 V3 V4 V5 >>>> 1 -1.7092463 0.50234009 0.8492982 -0.5636901 -0.38545566 >>>> 2 -2.3020854 -0.05580931 -0.5963673 -0.3671748 -0.09408031 >>>> 3 -1.2915110 -2.46181533 -0.2470108 0.3301129 -1.06810225 >>>> 4 0.3065989 0.89263099 -0.1717432 0.7721411 0.35856334 >>>> 5 0.8795616 -0.43049898 0.4560515 -0.1722099 0.46125149 >>>> >>>> In terms of 'best practice', I would represent my data in the >>>> appropriate data structure in the first place (as a matrix of appropriate >>>> type, rather than data.frame, so the entire coercion is irrelevant). If >>>> faced with a data.frame with specific columns to coerce I would use the >>>> approach >>>> >>>> cidx <- sapply(df, is.character) # index of columns to coerce >>>> df[cidx] <- lapply(df[cidx], as.numeric) >>>> >>>> which seems to be reasonably correct, expressive, compact, and speedy. >>>> >>>> Martin Morgan >>>> >>>> >>>> >>>>> Ô__ >>>>> c/ /'_;~~~~kmezhoud >>>>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ >>>>> http://bioinformatics.tn/ >>>>> >>>>> >>>>> >>>>> On Wed, Dec 31, 2014 at 8:54 AM, Berend Hasselman <b...@xs4all.nl> >>>>> wrote: >>>>> >>>>> >>>>>> On 31-12-2014, at 08:40, Karim Mezhoud <kmezh...@gmail.com> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> I would like to choice between these two data frame convert. which is >>>>>>> faster? >>>>>>> >>>>>>> for(i in 1:ncol(DataFrame)){ >>>>>>> >>>>>>> DataFrame[,i] <- as.numeric(DataFrame[,i]) >>>>>>> } >>>>>>> >>>>>>> >>>>>>> OR >>>>>>> >>>>>>> DataFrame <- as.data.frame(apply(DataFrame,2 ,function(x) >>>>>>> as.numeric(x))) >>>>>>> >>>>>>> >>>>>>> >>>>>> Try it and use system.time. >>>>>> >>>>>> Berend >>>>>> >>>>>> Thanks >>>>>>> Karim >>>>>>> Ô__ >>>>>>> c/ /'_;~~~~kmezhoud >>>>>>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ >>>>>>> http://bioinformatics.tn/ >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide >>>>>>> >>>>>> http://www.R-project.org/posting-guide.html >>>>>> >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>> posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> >>>> >>>> -- >>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N. >>>> PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: Arnold Building M1 B861 >>>> Phone: (206) 667-2793 >>>> >>> >>> >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.