Hello,

R 4.1.0 on Ubuntu 20.04, sessionInfo at the end.

I'm arriving a bit late to this thread but here are the timings I'm getting on an 10+ years old PC.

1. I am not getting anything even close to 5 or 10 mins running times.
2. Like Bill said, there seems to be a caching effect, the first runs are consistently slower. And this is Ubuntu, not Windows, so different OS's present the same behavior. It's not unexpected, disk accesses are slow operations and have been cached for a while now. 3. I am not at all sure if this is relevant but as for how to clean the Windows File Explorer cache, open a File Explorer window and click

View > Options > (Privacy section) Clear

4. Now for my timings. The cache effect is large, from 23s down to 2.5s.
But even with an old PC nowhere near 300s or 500s.

rui@rui:~$ R -q -f rhelp.R
#
# functions size.pkg and size.f.pkg omitted
#
> R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
>
> cat("\nLeonard Mada's code:\n\n")

Leonard Mada's code:

> system.time({
+         x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
   user  system elapsed
  1.700   0.988  23.339
> system.time({
+         x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
   user  system elapsed
  1.578   0.921   2.540
> system.time({
+         x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
   user  system elapsed
  1.542   0.949   2.523
>
> cat("\nBill Dunlap's code:\n\n")

Bill Dunlap's code:

> system.time(L1 <- size.f.pkg(R_LIBS_USER))
   user  system elapsed
  1.608   0.887   2.538
> system.time(L2 <- size.f.pkg(R_LIBS_USER))
   user  system elapsed
  1.515   0.982   2.510
> identical(L1,L2)
[1] TRUE
> length(L1)
[1] 1773
> length(dir(R_LIBS_USER,recursive=TRUE))
[1] 85204
>
> cat("\n\nsessionInfo return value:\n\n")


sessionInfo return value:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=pt_PT.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=pt_PT.UTF-8        LC_COLLATE=pt_PT.UTF-8
 [5] LC_MONETARY=pt_PT.UTF-8    LC_MESSAGES=pt_PT.UTF-8
 [7] LC_PAPER=pt_PT.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.1.1


And the sapply code.


rui@rui:~$ R -q -f rhelp2.R
> R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
> path <- R_LIBS_USER
> system.time({
+ sapply(list.dirs(path=path, full.name=F, recursive=F),
+      function(f) length(list.files(path = file.path(path, f),
+ full.names = FALSE, recursive = TRUE)))
+ })
   user  system elapsed
  0.802   0.901  15.964
>
>
rui@rui:~$ R -q -f rhelp2.R
> R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
> path <- R_LIBS_USER
> system.time({
+ sapply(list.dirs(path=path, full.name=F, recursive=F),
+      function(f) length(list.files(path = file.path(path, f),
+ full.names = FALSE, recursive = TRUE)))
+ })
   user  system elapsed
  0.730   0.528   1.264


Once again the 2nd run took a fraction of the 1st run.

Leonard, if you are getting those timings, is there another process running or that has previously run and eat up the cache?

Hope this helps,

Rui Barradas

Às 23:31 de 26/09/21, Leonard Mada via R-help escreveu:

On 9/27/2021 1:06 AM, Leonard Mada wrote:

Dear Bill,


Does list.files() always sort the results?

It seems so. The option: full.names = FALSE does not have any effect:
the results seem always sorted.


Maybe it is better to process the files in an unsorted order: as
stored on the disk?


After some more investigations:

This took only a few seconds:

sapply(list.dirs(path=path, full.name=F, recursive=F),
      function(f) length(list.files(path = paste0(path, "/", f),
full.names = FALSE, recursive = TRUE)))

# maybe with caching, but the difference is enormous


Seems BH contains *by far* the most files: 11701 files.

But excluding it from processing did have only a liniar effect: still 377 s.


I had a look at src/main/platform.c, but do not fully understand it.


Sincerely,


Leonard



Sincerely,


Leonard


On 9/25/2021 8:13 PM, Bill Dunlap wrote:
On my Windows 10 laptop I see evidence of the operating system
caching information about recently accessed files.  This makes it
hard to say how the speed might be improved.  Is there a way to clear
this cache?

system.time(L1 <- size.f.pkg(R.home("library")))
    user  system elapsed
    0.48    2.81   30.42
system.time(L2 <- size.f.pkg(R.home("library")))
    user  system elapsed
    0.35    1.10    1.43
identical(L1,L2)
[1] TRUE
length(L1)
[1] 30
length(dir(R.home("library"),recursive=TRUE))
[1] 12949

On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help
<r-help@r-project.org <mailto:r-help@r-project.org>> wrote:

     Dear List Members,


     I tried to compute the file sizes of each installed package and the
     process is terribly slow.

     It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.


     1.) Package Sizes


     system.time({
              x = size.pkg(file=NULL);
     })
     # elapsed time: 509 s !!!
     # 512 Packages; 1.64 GB;
     # R 4.1.1 on MS Windows 10


     The code for the size.pkg() function is below and the latest
     version is
     on Github:

     https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
     <https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R>


     Questions:
     Is there a way to get the file size faster?
     It takes long on Windows as well, but of the order of 10-20 s,
     not 10
     minutes.
     Do I miss something?


     1.b.) Alternative

     It came to my mind to read first all file sizes and then use
     tapply or
     aggregate - but I do not see why it should be faster.

     Would it be meaningful to benchmark each individual package?

     Although I am not very inclined to wait 10 minutes for each new
     try out.


     2.) Big Packages

     Just as a note: there are a few very large packages (in my list
     of 512
     packages):

     1  123,566,287               BH
     2  113,578,391               sf
     3  112,252,652            rgdal
     4   81,144,868           magick
     5   77,791,374 openNLPmodels.en

     I suspect that sf & rgdal have a lot of duplicated data structures
     and/or duplicate code and/or duplicated libraries - although I am
     not an
     expert in the field and did not check the sources.


     Sincerely,


     Leonard

     =======


     # Package Size:
     size.f.pkg = function(path=NULL) {
          if(is.null(path)) path = R.home("library");
          xd = list.dirs(path = path, full.names = FALSE, recursive =
     FALSE);
          size.f = function(p) {
              p = paste0(path, "/", p);
              sum(file.info <http://file.info>(list.files(path=p,
     pattern=".",
                  full.names = TRUE, all.files = TRUE, recursive =
     TRUE))$size);
          }
          sapply(xd, size.f);
     }

     size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") {
          x = size.f.pkg(path=path);
          x = as.data.frame(x);
          names(x) = "Size"
          x$Name = rownames(x);
          # Order
          if(sort) {
              id = order(x$Size, decreasing=TRUE)
              x = x[id,];
          }
          if( ! is.null(file)) {
              if( ! is.character(file)) {
                  print("Error: Size NOT written to file!");
              } else write.csv(x, file=file, row.names=FALSE);
          }
          return(x);
     }

     ______________________________________________
     R-help@r-project.org <mailto:R-help@r-project.org> mailing list
     -- To UNSUBSCRIBE and more, see
     https://stat.ethz.ch/mailman/listinfo/r-help
     <https://stat.ethz.ch/mailman/listinfo/r-help>
     PLEASE do read the posting guide
     http://www.R-project.org/posting-guide.html
     <http://www.R-project.org/posting-guide.html>
     and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to