Re: [Rd] object.size vs lobstr::obj_size

Tomas Kalibera Fri, 27 Mar 2020 07:02:06 -0700

On 2/19/20 3:55 AM, Stefan Schreiber wrote:

I have posted this question on R-help where it was suggested to me
that I might get a better response on R-devel. So far I have gotten no
response. The post I am talking about is here:
https://stat.ethz.ch/pipermail/r-help/2020-February/465700.html


My apologies for cross-posting, which I am aware is impolite and I
should have posted on R-devel in the first place - but I wasn't sure.

Here is my question again:

I am currently working through Advanced R by H. Wickham and came
across the `lobstr::obj_size` function which appears to calculate the
size of an object by taking into account whether the same object has
been referenced multiple times, e.g.

x <- runif(1e6)
y <- list(x, x, x)
lobstr::obj_size(y)
# 8,000,128 B

# versus:
object.size(y)
# 24000224 bytes

Reading through `?object.size` in the "Details" it reads: [...] but
does not detect if elements of a list are shared [...].

My questions are:

(1) is the result of `obj_size()` the "correct" one when it comes to
actual size used in memory?

(2) And if yes, why wouldn't `object.size()` be updated to reflect the
more precise calculation of an object in question similar to
`obj_size()`?

Please keep in mind that "actual size used in memory" is an elusiveconcept, particularly in managed languages such as R. Even in nativelanguages, you have on-demand paging (not all data in physical memory,some may be imputed (all zeros), some may be swapped out, some may bestored in files (code), etc). Also you have internal and externalfragmentation caused by the "C library" memory allocator, overhead ofobject headers and allocator meta-data. On top of that you have themanaged heap: more of internal and external fragmentation, more headers.Moreover, memory representation may change invisibly and sometimes insurprising ways (in R it is copy-on-write, so the sharing, but alsocompact objects via ALTREP, e.g. sequences). R has the symbol table,string cache (strings are interned, as in some other language runtimes,so the price is paid only once for each string). In principle, managedruntimes could do much more, including say compression of objects withadaptive decompression, some systems internally split representation oflarge objects depending on their size with additional overheads, systemscould have some transparent de-duplication (not only for strings), somechoices could be adaptive based on memory pressure. Then in R, packagesoften can maintain memory related to specific R objects, linked say viaexternal pointers, and again there may be no meaningful way to map thatusage to individual objects.

Not only that what is a size of an object tree is not easy to define.That information is in addition not very useful, either, becauseinnocuous changes may change it in arbitrary ways out of control of theuser: there is no good intuition how much that size will change fromintended application-level modifications of the tree. Users of thesystem could hardly create a reliable mental model of the memory usage,because it depends on internal design of the virtual machine, which inaddition can change over time.

As the concept is elusive, the best advice would be don't ask for theobject size, find some other solutions to your problem. In some cases,it makes sense to ask for object size in some application-specific way,and then implement object size methods for specific application classes(e.g. structures holding strings would sum up number of characters inthe strings, etc). Such application-specific way may be inspired by someparticular (perhaps trivial) serialization format.

I've used object.size() myself only for profiling when quicklyidentifying objects that are probably very large from objects of trivialsize, where these nuances did not matter, but for that I knew roughlywhat the objects were (e.g. that they were not hiding things inenvironments).

Intuitively, the choices made by object.size() in R are conservative,they provide an over-approximation that somewhat intuitively makes senseat user level, and they reduce surprises of significant size expansiondue to minimal updates. The choices and their limitations aredocumented. I think this at least no worse than than say taking intoaccount sharing, looking at current "size" of compact objects, etc. Onecould provide more options to object.size(), but I don't think that itwould be useful.


Best,
Tomas


There are probably valid reasons for this and any insight would be
greatly appreciated.

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] object.size vs lobstr::obj_size

Reply via email to