categorical and overfitting

Patrick Agin Mon, 30 Oct 2000 14:45:27 -0800

Hi,

Suppose a matrix X of dimension n x k. Here, n stands for the number of
points in the sample and k for the number of independant variables. I
want to estimate the parameters b under the simple linear model:
  E(y) = Xb

I have one variable x_i that is categorical, so I transformed it in a
�coding variable� (or dummy variable): a string of 0�s with a 1 at the
position corresponding to the good category. Suppose now that we have
n=100 and k=4 with 3 x_i�s that are �real� and the fourth that is a
categorical variable with 4 categories. I represent this as follows
(sorting on the D�s):

X1 X2 X3 D1 D2 D3 D4
..    ..    ..    1    0    0    0    *
..    ..    ..    1    0    0    0    * the sub-block for which X4
belongs to the first category
..    ..    ..    1    0    0    0    *
..    ..    ..    0    1    0    0    |
..    ..    ..    0    1    0    0    |
..    ..    ..    0    1    0    0    |  the sub-block for which X4
belongs to the second category
..    ..    ..    0    1    0    0    |
..    ..    ..    0    1    0    0    |
   .
   .
   .
   .
..    ..    ..    0    0    0    1    *
..    ..    ..    0    0    0    1    * the sub-block for which X4
belongs to the fourth category


As you see, the first sub-block (the one for which the categorical
variable X4 is equal to the first category, in other terms for which
D1=1) includes only 3 points. I�m wondering if I can have some
confidence in the estimated beta of D1. Don�t we have here a serious
problem of overfit? I mean, if I consider the �local model� for which
the beta of D1 is estimated (for which the �development� of the beta
occurs), it has 4 parameters and only 3 points! So having 100 points and
7 variables seems to be ok at first glance but when you have some
categories that include less points than the number of
(non-categorical+1) variables (here 3+1=4), isn�t it a problem?

Thank you for any hint!
Patrick





=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================
categorical and overfitting

Reply via email to