Hi all, I wanted to give you an update on vctrs (<https://vctrs.r-lib.org/>) since I last bought it up here in August. The biggest change is that I now have a much clearer idea of what vctrs is! I’ll summarise that here, and point you to the documentation if you’re interested in learning more. I’m planning on submitting vctrs to CRAN in the near future, but it’s very much a 0.1.0 release and I expect it to continue to evolve as more people try it out and give me feedback. I’d love to hear your thoughts\!
vctrs has three main goals: - To define and motivate `vec_size()` and `vec_type()` as alternatives to `length()` and `class()`. - To define type- and size-stability, useful tools for analysing function interfaces. - To make it easier to create new S3 vector classes. ## Size and prototype `vec_size()` was motivated by my desire to have a function that captures the number of “observations” in a vector. This particularly important for data frames because it’s useful to have a function such that `f(data.frame(x))` equals `f(x)`. No base function has this property: `NROW()` comes closest, but because it’s defined in terms of `length()` for dimensionless objects, it always returns a number, even for types that can’t go in a data frame, e.g. `data.frame(mean)` errors even though `NROW(mean)` is `1`. ``` r vec_size(1:10) #> [1] 10 vec_size(as.POSIXlt(Sys.time() + 1:10)) #> [1] 10 vec_size(data.frame(x = 1:10)) #> [1] 10 vec_size(array(dim = c(10, 4, 1))) #> [1] 10 vec_size(mean) #> Error: `x` is a not a vector ``` `vec_size()` is paired with `vec_slice()` for subsetting, i.e. `vec_slice()` is to `vec_size()` as `[` is to `length()`; `vec_slice(data.frame(x), i)` equals `data.frame(vec_slice(x, i))` (modulo variable/row names). (I plan to make `vec_size()` and `vec_slice()` generic in the next release, providing another point of differentiation from `NROW()`.) Complementary to the size of a vector is its prototype, a zero-observation slice of the vector. You can compute this using `vec_type()`, but because many classes don’t have an informative print method for a zero-length vector, I also provide `vec_ptype()` which prints a brief summary. As well as the class, the prototype also captures important attributes: ``` r vec_ptype(1:10) #> Prototype: integer vec_ptype(array(1:40, dim = c(10, 4, 1))) #> Prototype: integer[,4,1] vec_ptype(Sys.time()) #> Prototype: datetime<local> vec_ptype(data.frame(x = 1:10, y = letters[1:10])) #> Prototype: data.frame< #> x: integer #> y: factor<5e105> #> > ``` `vec_size()` and `vec_type()` are accompanied by functions that either find or enforce a common size (using modified recycling rules) or common type (by reducing a double-dispatching `vec_type2()` that determines the common type from a pair of types). You can read more about `vec_size()` and `vec_type()` at <https://vctrs.r-lib.org/articles/type-size.html>. ## Stability The definitions of size and prototype are motivated by my experiences doing code review. I find that I can often spot problems by running R code in my head. Obviously my mental R interpreter is much simpler than the real interpreter, but it seems to focus on prototypes and sizes, and I’m suspicious of code where I can’t easily predict the class of every new variable. This leads me to two definitions. A function is **type-stable** iif: - You can predict the output type knowing only the input types. - The order of arguments in … does not affect the output type. Similary, a function is **size-stable** iif: - You can predict the output size knowing only the input sizes, or there is a single numeric input that specifies the output size. For example, `ifelse()` is type-unstable because the output type can be different even when the input types are the same: ``` r vec_ptype(ifelse(NA, 1L, 1L)) #> Prototype: logical vec_ptype(ifelse(FALSE, 1L, 1L)) #> Prototype: integer ``` Size-stability is generally not a useful for analysing base R functions because the definition is a bit too far away from base conventions. The analogously defined length-stability is a bit better, but the definition of length for non-vectors means that complete length-stability is rare. For example, while `length(c(x, y))` usually equals `length(x) + length(y)`, it does not hold for all possible inputs: ``` r length(globalenv()) #> [1] 0 length(mean) #> [1] 1 length(c(mean, globalenv())) #> [1] 2 ``` (I don’t mean to pick on base here; the tidyverse also has many functions that violate these principles, but I wanted to stick to functions that all readers would be familiar with.) Type- and size-stable functions are desirable because they make it possible to reason about code without knowing the precise values involved. Of course, not all functions should be type- or size-stable: R would be incredibly limited if you could predict the type or size of `[[` and `read.csv()` without knowing the specific inputs\! But where possible, I think using type- and size-stable functions makes code easier to reason about and hence more likely to be bug free. You can read more about size- and type-stability at <https://vctrs.r-lib.org/articles/stability.html>. This vignette includes a detailed analysis of `c()` and a type- and size-stable alternative called `vec_c()`. ## New vector types Finally, vctrs provides `new_vctr()` and `new_rcrd()` to make it easier to define new classes, following the conventions that I’ve found helpful, including writing a constructor function that enforces the types of the underlying vector and its attributes (more details at <https://adv-r.hadley.nz/s3.html>\>). vctrs also makes life easier by implementing many base generics in terms of a small set of primitives: - At the simplest level, `print()` and `str()` are defined in terms of `format()`. `as.data.frame()` is implemented using the standard approach used for factor, POSIXct, Date etc. - `[[` and `[` use `NextMethod()` dispatch to the underlying base function, then restore attributes with `vec_restore()`. I’m not sure what the base equivalent of `vec_restore()` is, but it makes subclassing easier, as described in <https://adv-r.hadley.nz/s3.html#s3-subclassing>. - `==`, `!=`, `unique()`, `anyDuplicated()`, and `is.na()` are defined in terms of `vec_proxy_equal()`. `<`, `<=`, `>=`, `>`, `min()`, `max()`, `median()`, `quantile()`, and `xtfrm()` methods are defined in terms of `vec_proxy_compare()`. More details + examples at <https://vctrs.r-lib.org/articles/s3-vector.html#equality-and-comparison> - `+`, `-`, `/`, `*`, `^`, `%%`, `%/%`, `!`, `&`, and `|` operators are defined in terms of a double-dispatch use `vec_arith()`. Mathematical functions including the Summary group generics, the Math group generics, and a handful of others are defined using `vec_math()`. More details at <https://vctrs.r-lib.org/articles/s3-vector.html#arithmetic> These generics make creating a new vector more rewarding more quickly: you can easily sketch out the big picture before going back and filling in all the methods that make your class unique. More details at <https://vctrs.r-lib.org/articles/s3-vector.html>. Hadley -- http://hadley.nz ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel