Hello,
R 4.1.0 on Ubuntu 20.04, sessionInfo at the end.
I'm arriving a bit late to this thread but here are the timings I'm
getting on an 10+ years old PC.
1. I am not getting anything even close to 5 or 10 mins running times.
2. Like Bill said, there seems to be a caching effect, the first runs
are consistently slower. And this is Ubuntu, not Windows, so different
OS's present the same behavior. It's not unexpected, disk accesses are
slow operations and have been cached for a while now.
3. I am not at all sure if this is relevant but as for how to clean the
Windows File Explorer cache, open a File Explorer window and click
View > Options > (Privacy section) Clear
4. Now for my timings. The cache effect is large, from 23s down to 2.5s.
But even with an old PC nowhere near 300s or 500s.
rui@rui:~$ R -q -f rhelp.R
#
# functions size.pkg and size.f.pkg omitted
#
> R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
>
> cat("\nLeonard Mada's code:\n\n")
Leonard Mada's code:
> system.time({
+ x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
user system elapsed
1.700 0.988 23.339
> system.time({
+ x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
user system elapsed
1.578 0.921 2.540
> system.time({
+ x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
user system elapsed
1.542 0.949 2.523
>
> cat("\nBill Dunlap's code:\n\n")
Bill Dunlap's code:
> system.time(L1 <- size.f.pkg(R_LIBS_USER))
user system elapsed
1.608 0.887 2.538
> system.time(L2 <- size.f.pkg(R_LIBS_USER))
user system elapsed
1.515 0.982 2.510
> identical(L1,L2)
[1] TRUE
> length(L1)
[1] 1773
> length(dir(R_LIBS_USER,recursive=TRUE))
[1] 85204
>
> cat("\n\nsessionInfo return value:\n\n")
sessionInfo return value:
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=pt_PT.UTF-8 LC_NUMERIC=C
[3] LC_TIME=pt_PT.UTF-8 LC_COLLATE=pt_PT.UTF-8
[5] LC_MONETARY=pt_PT.UTF-8 LC_MESSAGES=pt_PT.UTF-8
[7] LC_PAPER=pt_PT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.1.1
And the sapply code.
rui@rui:~$ R -q -f rhelp2.R
> R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
> path <- R_LIBS_USER
> system.time({
+ sapply(list.dirs(path=path, full.name=F, recursive=F),
+ function(f) length(list.files(path = file.path(path, f),
+ full.names = FALSE, recursive = TRUE)))
+ })
user system elapsed
0.802 0.901 15.964
>
>
rui@rui:~$ R -q -f rhelp2.R
> R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
> path <- R_LIBS_USER
> system.time({
+ sapply(list.dirs(path=path, full.name=F, recursive=F),
+ function(f) length(list.files(path = file.path(path, f),
+ full.names = FALSE, recursive = TRUE)))
+ })
user system elapsed
0.730 0.528 1.264
Once again the 2nd run took a fraction of the 1st run.
Leonard, if you are getting those timings, is there another process
running or that has previously run and eat up the cache?
Hope this helps,
Rui Barradas
Às 23:31 de 26/09/21, Leonard Mada via R-help escreveu:
On 9/27/2021 1:06 AM, Leonard Mada wrote:
Dear Bill,
Does list.files() always sort the results?
It seems so. The option: full.names = FALSE does not have any effect:
the results seem always sorted.
Maybe it is better to process the files in an unsorted order: as
stored on the disk?
After some more investigations:
This took only a few seconds:
sapply(list.dirs(path=path, full.name=F, recursive=F),
function(f) length(list.files(path = paste0(path, "/", f),
full.names = FALSE, recursive = TRUE)))
# maybe with caching, but the difference is enormous
Seems BH contains *by far* the most files: 11701 files.
But excluding it from processing did have only a liniar effect: still 377 s.
I had a look at src/main/platform.c, but do not fully understand it.
Sincerely,
Leonard
Sincerely,
Leonard
On 9/25/2021 8:13 PM, Bill Dunlap wrote:
On my Windows 10 laptop I see evidence of the operating system
caching information about recently accessed files. This makes it
hard to say how the speed might be improved. Is there a way to clear
this cache?
system.time(L1 <- size.f.pkg(R.home("library")))
user system elapsed
0.48 2.81 30.42
system.time(L2 <- size.f.pkg(R.home("library")))
user system elapsed
0.35 1.10 1.43
identical(L1,L2)
[1] TRUE
length(L1)
[1] 30
length(dir(R.home("library"),recursive=TRUE))
[1] 12949
On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help
<r-help@r-project.org <mailto:r-help@r-project.org>> wrote:
Dear List Members,
I tried to compute the file sizes of each installed package and the
process is terribly slow.
It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.
1.) Package Sizes
system.time({
x = size.pkg(file=NULL);
})
# elapsed time: 509 s !!!
# 512 Packages; 1.64 GB;
# R 4.1.1 on MS Windows 10
The code for the size.pkg() function is below and the latest
version is
on Github:
https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
<https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R>
Questions:
Is there a way to get the file size faster?
It takes long on Windows as well, but of the order of 10-20 s,
not 10
minutes.
Do I miss something?
1.b.) Alternative
It came to my mind to read first all file sizes and then use
tapply or
aggregate - but I do not see why it should be faster.
Would it be meaningful to benchmark each individual package?
Although I am not very inclined to wait 10 minutes for each new
try out.
2.) Big Packages
Just as a note: there are a few very large packages (in my list
of 512
packages):
1 123,566,287 BH
2 113,578,391 sf
3 112,252,652 rgdal
4 81,144,868 magick
5 77,791,374 openNLPmodels.en
I suspect that sf & rgdal have a lot of duplicated data structures
and/or duplicate code and/or duplicated libraries - although I am
not an
expert in the field and did not check the sources.
Sincerely,
Leonard
=======
# Package Size:
size.f.pkg = function(path=NULL) {
if(is.null(path)) path = R.home("library");
xd = list.dirs(path = path, full.names = FALSE, recursive =
FALSE);
size.f = function(p) {
p = paste0(path, "/", p);
sum(file.info <http://file.info>(list.files(path=p,
pattern=".",
full.names = TRUE, all.files = TRUE, recursive =
TRUE))$size);
}
sapply(xd, size.f);
}
size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") {
x = size.f.pkg(path=path);
x = as.data.frame(x);
names(x) = "Size"
x$Name = rownames(x);
# Order
if(sort) {
id = order(x$Size, decreasing=TRUE)
x = x[id,];
}
if( ! is.null(file)) {
if( ! is.character(file)) {
print("Error: Size NOT written to file!");
} else write.csv(x, file=file, row.names=FALSE);
}
return(x);
}
______________________________________________
R-help@r-project.org <mailto:R-help@r-project.org> mailing list
-- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
<https://stat.ethz.ch/mailman/listinfo/r-help>
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
<http://www.R-project.org/posting-guide.html>
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.