Hi,
As far as I am aware, the model.matrix function does not return
perfect metadata on what each column of the model matrix "means".
The columns are named (e.g. age:genderM), but encoding the metadata as
strings can result in ambiguity. For example, the dummy variables
created when the factors var0 = 0 and var = 00 both are named var00.
Additionally, if a level of a factor variable contains a colon, this
could be confused for an interaction.
While a human can generally work out the meaning of each column
somewhat manually, I am interested in achieving this programmatically.
My solution is to edit the modelmatrix function in
/src/library/stats/src/model.c to additionally return the following:
intrcept
factors
contr1
contr2
count
With the availability of these in R it is possible to determine the
precise meaning of each column without the error-prone parsing of
strings. I have attached my edit: see lines 753-764.
I am seeking advice on this approach. Am I missing a simpler way of
achieving this (which perhaps avoids rebuilding R)?
Since model.matrix is used in so many modeling functions this would be
very helpful for the programmatic interpretation of model output. A
search on the Internet suggests there are other R users who would
welcome such functionality.
Many thanks in advance,
Pat O'Reilly
/*
* R : A Computer Language for Statistical Data Analysis
* Copyright (C) 1995, 1996 Robert Gentleman and Ross Ihaka
* Copyright (C) 1997--2013 The R Core Team
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, a copy is available at
* http://www.r-project.org/Licenses/
*/
#ifdef HAVE_CONFIG_H
#include
#endif
#include
#include "statsR.h"
#undef _
#ifdef ENABLE_NLS
#include
#define _(String) dgettext ("stats", String)
#else
#define _(String) (String)
#endif
/* inline-able versions, used just once! */
static R_INLINE Rboolean isUnordered_int(SEXP s)
{
return (TYPEOF(s) == INTSXP
&& inherits(s, "factor")
&& !inherits(s, "ordered"));
}
static R_INLINE Rboolean isOrdered_int(SEXP s)
{
return (TYPEOF(s) == INTSXP
&& inherits(s, "factor")
&& inherits(s, "ordered"));
}
/*
* model.frame
*
* The argument "terms" contains the terms object generated from the
* model formula. We first evaluate the "variables" attribute of
* "terms" in the "data" environment. This gives us a list of basic
* variables to be in the model frame. We do some basic sanity
* checks on these to ensure that resulting object make sense.
*
* The argument "dots" gives additional things like "weights", "offsets"
* and "subset" which will also go into the model frame so that they can
* be treated in parallel.
*
* Next we subset the data frame according to "subset" and finally apply
* "na.action" to get the final data frame.
*
* Note that the "terms" argument is glued to the model frame as an
* attribute. Code downstream appears to need this.
*
*/
/* model.frame(terms, rownames, variables, varnames, */
/* dots, dotnames, subset, na.action) */
SEXP modelframe(SEXP call, SEXP op, SEXP args, SEXP rho)
{
SEXP terms, data, names, variables, varnames, dots, dotnames, na_action;
SEXP ans, row_names, subset, tmp;
char buf[256];
int i, j, nr, nc;
int nvars, ndots, nactualdots;
const void *vmax = vmaxget();
args = CDR(args);
terms = CAR(args); args = CDR(args);
row_names = CAR(args); args = CDR(args);
variables = CAR(args); args = CDR(args);
varnames = CAR(args); args = CDR(args);
dots = CAR(args); args = CDR(args);
dotnames = CAR(args); args = CDR(args);
subset = CAR(args); args = CDR(args);
na_action = CAR(args);
/* Argument Sanity Checks */
if (!isNewList(variables))
error(_("invalid variables"));
if (!isString(varnames))
error(_("invalid variable names"));
if ((nvars = length(variables)) != length(varnames))
error(_("number of variables != number of variable names"));
if (!isNewList(dots))
error(_("invalid extra variables"));
if ((ndots = length(dots)) != length(dotnames))
error(_("number of variables != number of variable names"));
if ( ndots && !isString(dotnames))
error(_("invalid extra variable names"));
/* check for NULL extra arguments -- moved from interpreted code */
nactualdots = 0;
for (i = 0; i < ndots; i++)
if (VECTOR_ELT(dots, i) != R_NilValue) nactualdots++;
/* Assemble