I have been asked (in private) > Hi Martin,
y <- c(1, 2, 3, NA, 4) x <- c(1, 2, 2, 1, 1) t.test(y ~ x) lm(y ~ x) > Normally, most R functions follow the principle that > "missings should never silently go missing". Do you have > any background on why these functions (and most model/test > function in stats, I presume) silently drop missing values? And I think, the issue and an answer are important enough to be public, hence this posting to R-help : First note that in some sense it is not true that lm(y ~ x) silently drops NAs: Everybody who is taught about lm() is taught to look at summary( fm ) where fm <- lm(y ~ x) and that (for the above case) "says" ' (1 observation deleted due to missingness) ' and so is not entirely silent. This goes all back to the idea of having an 'na.action' argument which may be a function (a very good idea, "functional programming" in a double sense!)... which Chambers et al introduced in The White Book (*1) and which I think to remember was quite a revolutionary idea; at least I had liked that very much once I understood the beauty of passing functions as arguments to other functions. One problem already back then has been that we already had the ---much more primitive but often sufficient--- standard of an 'na.rm = FALSE' (i.e. default FALSE) argument. This has been tought in all good classes/course about statistical modeling with S and R ever since ... I had hoped .... (but it seems I was too optimistic, .. or many students have too quickly forgotten what they were taught ..) Notably the white book itself, and the MASS (*2) book do teach this.. though possibly not loudly enough. Two more decisions about this were made back then, as well: 1) The default for na.action to be na.omit (==> "silently dropping") 2) na.action being governed by options(na.action = ..) '1)' may have been mostly "historical": I think it had been the behavior of other "main stream" statistical packages back then (and now?) *and* possibly more importantly, notably with the later (than white book = "S3") advent of na.replace, you did want to keep the missing in your data frame, for later analysis; e.g. drawing (-> "gaps" in plots) so the NA *were* carried along and would not be forgotten, something very important in most case.s. and '2)' is something I personally no longer like very much, as it is "killing" the functional paradigm. OTOH, it has to be said in favor of that "session wide" / "global" setting options(na.action = *) that indeed it depends on the specific data analysis, or even the specific *phase* of a data analysis, *what* behavior of NA treatment is desired.... and at the time it was thought smart that all methods (also functions "deep down" called by user-called functtions) would automatically use the "currently desired" NA handling. There have been recommendations (I don't know exactly where and by whom) to always set options(na.action = na.fail) in your global .First() or nowadays rather your Rprofile, and I assume that some of the CRAN packages and some of the "teaching setups" would do that (and if you do that, the lm() and t.test() calls above give an error). Of course, there's much more to say about this, quite a bit of which has been published in scientific papers and books.. Martin Maechler ETH Zurich and R Core Team ---------------- *1) "The White Book": Chambers and Hastie (1992): https://www.r-project.org/doc/bib/R-books_bib.html#R:Chambers+Hastie:1992 *2) "MASS - the book": Venables and Ripley (2002 (= 4th ed.)): https://www.r-project.org/doc/bib/R-books_bib.html#R:Venables+Ripley:2002 ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.