Thank you very much. This will give me something to chew on for quite some time.
Kevin ---- [EMAIL PROTECTED] wrote: > On 07-Oct-08 22:23:22, Bert Gunter wrote: > > But it **is** indexed in both of V&R's MASS and S Programming. > > I have no idea whether the info there will be helpful to you, > > of course. I would find (and have found) it so. > > -- Bert Gunter > > The discussion of factors in V&R is certainly quite comprehensive, > but it is not for beginners! > > A more elementary and very readable published text is Peter Dalgaard's > "Introductory Statistics with R". > > An even more introductory, but still adequate, account can be found > in various places of Julian Faraway's "Practical Regression and Anova > using R" which is on-line on CRAN under Documentation/Contributed. > > However, you will need to piece together the bigger picture from > passages found in various places. There is no index, but a search > for "factor" in the PDF file throws up: > pages 11; 69-70; Chapter 15 (160-167) -- especially section 15.2; > Chapter 16 (168-203) -- though this deals mainly with factorial > experimental designs. > > A reference with more detail at the technical level from the R > viewpoint (but still well spelt out) is John Maindonald's > "Using R for Data Analysis and Graphics - Introduction, Examples > and Commentary", especially section 2.4. This is also on-line in > the same section of CRAN. > > That being said, on the grounds that an introductory outline may > also be useful to others, here is a summary. > > Factors are variables which, essentially, introduce a "contingency > table" structure into the data (and they can co-exist with variables > which have quantitative interpretation). > > A factor is a variable with categorical values -- an item is an "A", > or a "B", or a "C", ... -- used in a particular way. It may or may > not make sense to consider A, B, C, ... as ordered: A < B < C < ... say. > For example, a variable called Sex may have values "M" (for Male) > or "F" (for Female). Whether one can consider that M < F is something > I will not discuss (though others may have a view). > > Or Social Class may have categories A (highest) > B > C > D > E > (lowest). Or, say, an ecological classification of terrain may use > "Grassland", "Forest", "Swamp" with no implication of any ordering: > they are all on the same footing. > > The category labels of factors are called "Levels". As seen in the > data, these labels may be alphabetic, numeric, or both -- e.g. M or F > for Sex, which people also often code as 1 or 2 (but with no > implication that 1 < 2); Terrain may be G, F or S or 1, 2, 3; Social > Class my be subdivided into A1, A2, B1, B2, ... (with implied ordering > A1 > A2 > B1 > B2 > ... ). > > In regression analysis, the usefulness of factors is that they > allow comparison between the outcomes for different levels of > the factors. In simple cases the result may be as simple as > the difference between the mean of cases with level A and the > mean of cases with level B of sa single factor. > > This is where the plot starts to thicken. For example, if Terrain > were coded 1, 2, 3 you would not want to treat these as quantitative > values (even if they represented ordered levels). Instead, a factor > with k levels is presented to the regression in terms of k "dummy > variables". If the regression model has an intercept, then one > level (the "base level") of the factor will be absorbed into the > Intercept. > > So, for instance, data on weight(Kgm) might look like > > Sex Weight > M 69.5 > F 60.2 > F 65.7 > M 72.5 > .... > > This would be transformed into > > Sex.M Sex.F Weight > 1 0 69.5 > 0 1 60.2 > 0 1 65.7 > 1 0 72.5 > > where, now, the 0s and 1s will have their *quantitative* interpetation. > So the regression model Weight ~ Sex now becomes the quantitative > regression > > Weight = a + b.M*Sex.M + b.F*Sex.F + error > > using the values 0 and 1 of Sex.M and Sex.M quantitatively. > However, since Sex.F + Sex.M = 1 throughout, one is redundant > in the presence of the intercept (whose "dummy" equivalent has > value 1 throughout). Hence the results of this regression will > usually be presented as Intercept together with the coefficient > of (say) Sex.F. However, if you left out the Intercept, giving > the model formula Weight ~ Sex - 1, then the above data matrix > with both dummy variables Sex.M and Sex.F would be used in full > in the regression, whoch would fit the equation > > Weight = b.M$Sex.M + b.F*Sex.F + error > > without redundancy (and in this case the coeficients would be > the mean of the weights of Males [b.M] and the mean of the > weights of Females [b.F]). > > If there are two factors in the regression, say Sex (M/F) and > Diet (M = meat-eater, V = vegetarian), then the possibilities > are richer. One might then have, for the regression model > > Weight ~ Sex + Diet > > Sex.M Sex.F Diet.M Diet.V Weight > 1 0 0 1 69.5 > 0 1 0 1 60.2 > 0 1 0 1 65.7 > 1 0 0 1 72.5 > 1 0 1 0 74.5 > 0 1 1 0 65.2 > 0 1 1 0 70.7 > 1 0 1 0 77.5 > > which would fit the equation > > Weight = b.S.F*Sex.F + b.D.V*Diet.V + error > > with the same absorption of a base-level of each factor into the > Intercept (since now we have 2 redundancies: for each factor, > the two dummy variables add up to 1). The coefficient of Sex.F > will represent a difference between Males and Females, the > coefficient of Diet.V will represent a difference between > meat-eaters and vegetarians. Because of the redundacies, an > equivalent representation of the data used in the calculations is > > Sex.F Diet.V Weight > 0 1 69.5 > 1 1 60.2 > 1 1 65.7 > 0 1 72.5 > 0 0 74.5 > 1 0 65.2 > 1 0 70.7 > 0 0 77.5 > > > But now we have the opportunity to ask: Is the difference > between meat-eater and vegetarian Males the same as the > difference between meat-eater and vegetarian Females? Now we > need the Interaction -- the difference, between Males and > Females, of the two differences between the two diets: one > difference evaluated for Males, the other for Females. This > leads to the regression model > > Weight ~ Sex * Diet, equivalent to Weight ~ Sex + Diet + Sex:Diet > > and we now need a further dummy variable for the different > combinations of levels of the two factors: > > Sex.F Diet.V Sex.F:Diet.V Weight > 0 1 0 69.5 > 1 1 1 60.2 > 1 1 1 65.7 > 0 1 0 72.5 > 0 0 0 74.5 > 1 0 0 65.2 > 1 0 0 70.7 > 0 0 0 77.5 > > where the variable Sex.F:Diet.V has the value 1 when Sex.F=1 > and Diet.V=1, and the value 0 otherwise. > > This is all very basic and straightforward (though can appear > more complicated in richer problems). But the point about using > a variable of "factor" type in R is beginning to emerge. When > there is a factor with k levels, you need (k-1) dummy variables > as quantitative variables for the regression. Interactions > introduce further dummy variables. For all this to happen, a > variable which is going to be used as a factor needs a special > representation inside R, so that R knows how to set about > constructing all that stuff. So, in R, a factor is not a simple > list of levels (like c("M","F","F","M","M","F","F","M")), but > a more elaborate encoding, and a more complex structure. > > Once past this stage, there is then the question of what > system of *contrasts* is going to be used. For 2-level factors > (as above) there are not many issues which arise -- the effect > of a factor corresponds to a simple difference between the > results corresponding to its two levels. But, say, for the > Terrain factor (G,F,S) there are several ways in which differences > can be formulated. For example: > G, F-G, S-G ("treatment contrasts") > > Or, for Social Class (ordered, A>B>C>D>E) > D-E, C-D, B-C, A-B ("successive difference contrasts") > E, D-E, C-(mean of D&E), B-(mean of C&D&E), A-(mean of B&C&D&E) > ("Helmert contrasts") > > and so on. What system of contrasts you use will depend on what > aspects of the differences between categories you are interested in. > > And then the contrast specification also has to be part of the > specification of a factor (since it determines how to compute > the dummy variables which will represent it in the regression). > See John Maindonald's on-line book. > > Hoping this helps! > Ted. > > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On > > Behalf Of [EMAIL PROTECTED] > > Sent: Tuesday, October 07, 2008 2:29 PM > > To: r-help@r-project.org > > Subject: [R] Factor tutorial? > > > > This is probably a very basic question. I want to understand factors > > but I > > am not sure where to turn. Looking up factor in the Chambers book > > doesn't > > even show up in the index. Maybe I am just slow but ?factor doesn't > > help > > either. Would someone please point me to a very basic tutorial where I > > can > > see what the usefullness of factors is (so far they have just gotten in > > the > > way). > > > > Thank you. > > > > Kevin > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <[EMAIL PROTECTED]> > Fax-to-email: +44 (0)870 094 0861 > Date: 08-Oct-08 Time: 01:30:31 > ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.