[Rd] An update on the vctrs package

Hadley Wickham Mon, 05 Nov 2018 16:25:17 -0800

Hi all,

I wanted to give you an update on vctrs (<https://vctrs.r-lib.org/>)
since I last bought it up here in August. The biggest change is that I now
have a much clearer idea of what vctrs is! I’ll summarise that here,
and point you to the documentation if you’re interested in learning
more. I’m planning on submitting vctrs to CRAN in the near future, but
it’s very much a 0.1.0 release and I expect it to continue to evolve as
more people try it out and give me feedback. I’d love to hear your
thoughts\!


vctrs has three main goals:

  - To define and motivate `vec_size()` and `vec_type()` as alternatives
    to `length()` and `class()`.

  - To define type- and size-stability, useful tools for analysing
    function interfaces.

  - To make it easier to create new S3 vector classes.

## Size and prototype

`vec_size()` was motivated by my desire to have a function that captures
the number of “observations” in a vector. This particularly important
for data frames because it’s useful to have a function such that
`f(data.frame(x))` equals `f(x)`. No base function has this property:
`NROW()` comes closest, but because it’s defined in terms of `length()`
for dimensionless objects, it always returns a number, even for types
that can’t go in a data frame, e.g. `data.frame(mean)` errors even
though `NROW(mean)` is `1`.

``` r
vec_size(1:10)
#> [1] 10
vec_size(as.POSIXlt(Sys.time() + 1:10))
#> [1] 10
vec_size(data.frame(x = 1:10))
#> [1] 10
vec_size(array(dim = c(10, 4, 1)))
#> [1] 10
vec_size(mean)
#> Error: `x` is a not a vector
```

`vec_size()` is paired with `vec_slice()` for subsetting, i.e.
`vec_slice()` is to `vec_size()` as `[` is to `length()`;
`vec_slice(data.frame(x), i)` equals `data.frame(vec_slice(x, i))`
(modulo variable/row names).

(I plan to make `vec_size()` and `vec_slice()` generic in the next
release, providing another point of differentiation from `NROW()`.)

Complementary to the size of a vector is its prototype, a
zero-observation slice of the vector. You can compute this using
`vec_type()`, but because many classes don’t have an informative print
method for a zero-length vector, I also provide `vec_ptype()` which
prints a brief summary. As well as the class, the prototype also
captures important attributes:

``` r
vec_ptype(1:10)
#> Prototype: integer
vec_ptype(array(1:40, dim = c(10, 4, 1)))
#> Prototype: integer[,4,1]
vec_ptype(Sys.time())
#> Prototype: datetime<local>
vec_ptype(data.frame(x = 1:10, y = letters[1:10]))
#> Prototype: data.frame<
#>   x: integer
#>   y: factor<5e105>
#> >
```

`vec_size()` and `vec_type()` are accompanied by functions that either
find or enforce a common size (using modified recycling rules) or common
type (by reducing a double-dispatching `vec_type2()` that determines the
common type from a pair of types).

You can read more about `vec_size()` and `vec_type()` at
<https://vctrs.r-lib.org/articles/type-size.html>.

## Stability

The definitions of size and prototype are motivated by my experiences
doing code review. I find that I can often spot problems by running R
code in my head. Obviously my mental R interpreter is much simpler than
the real interpreter, but it seems to focus on prototypes and sizes, and
I’m suspicious of code where I can’t easily predict the class of every
new variable.

This leads me to two definitions. A function is **type-stable** iif:

  - You can predict the output type knowing only the input types.
  - The order of arguments in … does not affect the output type.

Similary, a function is **size-stable** iif:

  - You can predict the output size knowing only the input sizes, or
    there is a single numeric input that specifies the output size.

For example, `ifelse()` is type-unstable because the output type can be
different even when the input types are the same:

``` r
vec_ptype(ifelse(NA, 1L, 1L))
#> Prototype: logical
vec_ptype(ifelse(FALSE, 1L, 1L))
#> Prototype: integer
```

Size-stability is generally not a useful for analysing base R functions
because the definition is a bit too far away from base conventions. The
analogously defined length-stability is a bit better, but the definition
of length for non-vectors means that complete length-stability is rare.
For example, while `length(c(x, y))` usually equals `length(x) +
length(y)`, it does not hold for all possible inputs:

``` r
length(globalenv())
#> [1] 0
length(mean)
#> [1] 1
length(c(mean, globalenv()))
#> [1] 2
```

(I don’t mean to pick on base here; the tidyverse also has many
functions that violate these principles, but I wanted to stick to
functions that all readers would be familiar with.)

Type- and size-stable functions are desirable because they make it
possible to reason about code without knowing the precise values
involved. Of course, not all functions should be type- or size-stable: R
would be incredibly limited if you could predict the type or size of
`[[` and `read.csv()` without knowing the specific inputs\! But where
possible, I think using type- and size-stable functions makes code
easier to reason about and hence more likely to be bug free.

You can read more about size- and type-stability at
<https://vctrs.r-lib.org/articles/stability.html>. This vignette
includes a detailed analysis of `c()` and a type- and size-stable
alternative called `vec_c()`.

## New vector types

Finally, vctrs provides `new_vctr()` and `new_rcrd()` to make it easier
to define new classes, following the conventions that I’ve found
helpful, including writing a constructor function that enforces the
types of the underlying vector and its attributes (more details at
<https://adv-r.hadley.nz/s3.html>\>). vctrs also makes life easier by
implementing many base generics in terms of a small set of primitives:

  - At the simplest level, `print()` and `str()` are defined in terms of
    `format()`. `as.data.frame()` is implemented using the standard
    approach used for factor, POSIXct, Date etc.

  - `[[` and `[` use `NextMethod()` dispatch to the underlying base
    function, then restore attributes with `vec_restore()`. I’m not sure
    what the base equivalent of `vec_restore()` is, but it makes
    subclassing easier, as described in
    <https://adv-r.hadley.nz/s3.html#s3-subclassing>.

  - `==`, `!=`, `unique()`, `anyDuplicated()`, and `is.na()` are defined
    in terms of `vec_proxy_equal()`. `<`, `<=`, `>=`, `>`, `min()`,
    `max()`, `median()`, `quantile()`, and `xtfrm()` methods are defined
    in terms of `vec_proxy_compare()`. More details + examples at
    <https://vctrs.r-lib.org/articles/s3-vector.html#equality-and-comparison>

  - `+`, `-`, `/`, `*`, `^`, `%%`, `%/%`, `!`, `&`, and `|` operators
    are defined in terms of a double-dispatch use `vec_arith()`.
    Mathematical functions including the Summary group generics, the
    Math group generics, and a handful of others are defined using
    `vec_math()`. More details at
    <https://vctrs.r-lib.org/articles/s3-vector.html#arithmetic>

These generics make creating a new vector more rewarding more quickly:
you can easily sketch out the big picture before going back and filling
in all the methods that make your class unique. More details at
<https://vctrs.r-lib.org/articles/s3-vector.html>.

Hadley

-- 
http://hadley.nz

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] An update on the vctrs package

Reply via email to