Hi,
Suppose a matrix X of dimension n x k. Here, n stands for the number of
points in the sample and k for the number of independant variables. I
want to estimate the parameters b under the simple linear model:
E(y) = Xb
I have one variable x_i that is categorical, so I transformed it in a
�coding variable� (or dummy variable): a string of 0�s with a 1 at the
position corresponding to the good category. Suppose now that we have
n=100 and k=4 with 3 x_i�s that are �real� and the fourth that is a
categorical variable with 4 categories. I represent this as follows
(sorting on the D�s):
X1 X2 X3 D1 D2 D3 D4
.. .. .. 1 0 0 0 *
.. .. .. 1 0 0 0 * the sub-block for which X4
belongs to the first category
.. .. .. 1 0 0 0 *
.. .. .. 0 1 0 0 |
.. .. .. 0 1 0 0 |
.. .. .. 0 1 0 0 | the sub-block for which X4
belongs to the second category
.. .. .. 0 1 0 0 |
.. .. .. 0 1 0 0 |
.
.
.
.
.. .. .. 0 0 0 1 *
.. .. .. 0 0 0 1 * the sub-block for which X4
belongs to the fourth category
As you see, the first sub-block (the one for which the categorical
variable X4 is equal to the first category, in other terms for which
D1=1) includes only 3 points. I�m wondering if I can have some
confidence in the estimated beta of D1. Don�t we have here a serious
problem of overfit? I mean, if I consider the �local model� for which
the beta of D1 is estimated (for which the �development� of the beta
occurs), it has 4 parameters and only 3 points! So having 100 points and
7 variables seems to be ok at first glance but when you have some
categories that include less points than the number of
(non-categorical+1) variables (here 3+1=4), isn�t it a problem?
Thank you for any hint!
Patrick
=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
http://jse.stat.ncsu.edu/
=================================================================