On 9/27/2021 1:06 AM, Leonard Mada wrote: > > Dear Bill, > > > Does list.files() always sort the results? > > It seems so. The option: full.names = FALSE does not have any effect: > the results seem always sorted. > > > Maybe it is better to process the files in an unsorted order: as > stored on the disk? >
After some more investigations: This took only a few seconds: sapply(list.dirs(path=path, full.name=F, recursive=F), function(f) length(list.files(path = paste0(path, "/", f), full.names = FALSE, recursive = TRUE))) # maybe with caching, but the difference is enormous Seems BH contains *by far* the most files: 11701 files. But excluding it from processing did have only a liniar effect: still 377 s. I had a look at src/main/platform.c, but do not fully understand it. Sincerely, Leonard > > Sincerely, > > > Leonard > > > On 9/25/2021 8:13 PM, Bill Dunlap wrote: >> On my Windows 10 laptop I see evidence of the operating system >> caching information about recently accessed files. This makes it >> hard to say how the speed might be improved. Is there a way to clear >> this cache? >> >> > system.time(L1 <- size.f.pkg(R.home("library"))) >> user system elapsed >> 0.48 2.81 30.42 >> > system.time(L2 <- size.f.pkg(R.home("library"))) >> user system elapsed >> 0.35 1.10 1.43 >> > identical(L1,L2) >> [1] TRUE >> > length(L1) >> [1] 30 >> > length(dir(R.home("library"),recursive=TRUE)) >> [1] 12949 >> >> On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help >> <r-help@r-project.org <mailto:r-help@r-project.org>> wrote: >> >> Dear List Members, >> >> >> I tried to compute the file sizes of each installed package and the >> process is terribly slow. >> >> It took ~ 10 minutes for 512 packages / 1.6 GB total size of files. >> >> >> 1.) Package Sizes >> >> >> system.time({ >> x = size.pkg(file=NULL); >> }) >> # elapsed time: 509 s !!! >> # 512 Packages; 1.64 GB; >> # R 4.1.1 on MS Windows 10 >> >> >> The code for the size.pkg() function is below and the latest >> version is >> on Github: >> >> https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R >> <https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R> >> >> >> Questions: >> Is there a way to get the file size faster? >> It takes long on Windows as well, but of the order of 10-20 s, >> not 10 >> minutes. >> Do I miss something? >> >> >> 1.b.) Alternative >> >> It came to my mind to read first all file sizes and then use >> tapply or >> aggregate - but I do not see why it should be faster. >> >> Would it be meaningful to benchmark each individual package? >> >> Although I am not very inclined to wait 10 minutes for each new >> try out. >> >> >> 2.) Big Packages >> >> Just as a note: there are a few very large packages (in my list >> of 512 >> packages): >> >> 1 123,566,287 BH >> 2 113,578,391 sf >> 3 112,252,652 rgdal >> 4 81,144,868 magick >> 5 77,791,374 openNLPmodels.en >> >> I suspect that sf & rgdal have a lot of duplicated data structures >> and/or duplicate code and/or duplicated libraries - although I am >> not an >> expert in the field and did not check the sources. >> >> >> Sincerely, >> >> >> Leonard >> >> ======= >> >> >> # Package Size: >> size.f.pkg = function(path=NULL) { >> if(is.null(path)) path = R.home("library"); >> xd = list.dirs(path = path, full.names = FALSE, recursive = >> FALSE); >> size.f = function(p) { >> p = paste0(path, "/", p); >> sum(file.info <http://file.info>(list.files(path=p, >> pattern=".", >> full.names = TRUE, all.files = TRUE, recursive = >> TRUE))$size); >> } >> sapply(xd, size.f); >> } >> >> size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") { >> x = size.f.pkg(path=path); >> x = as.data.frame(x); >> names(x) = "Size" >> x$Name = rownames(x); >> # Order >> if(sort) { >> id = order(x$Size, decreasing=TRUE) >> x = x[id,]; >> } >> if( ! is.null(file)) { >> if( ! is.character(file)) { >> print("Error: Size NOT written to file!"); >> } else write.csv(x, file=file, row.names=FALSE); >> } >> return(x); >> } >> >> ______________________________________________ >> R-help@r-project.org <mailto:R-help@r-project.org> mailing list >> -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> <https://stat.ethz.ch/mailman/listinfo/r-help> >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> <http://www.R-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.