[Apologies -- I made an error (see at [***] near the end)] On 24-May-09 19:07:46, Ted Harding wrote: > [Your data and output listings removed. For comments, see at end] > > On 24-May-09 13:01:26, cdm wrote: >> Fellow R Users: >> I'm not extremely familiar with lda or R programming, but a recent >> editorial review of a manuscript submission has prompted a crash >> course. I am on this forum hoping I could solicit some much needed >> advice for deriving a classification equation. >> >> I have used three basic measurements in lda to predict two groups: >> male and female. I have a working model, low Wilk's lambda, graphs, >> coefficients, eigenvalues, etc. (see below). I adjusted the sample >> analysis for Fisher's or Anderson's Iris data provided in the MASS >> library for my own data. >> >> My final and last step is simply form the classification equation. >> The classification equation is simply using standardized coefficients >> to classify each group- in this case male or female. A more thorough >> explanation is provided: >> >> "For cases with an equal sample size for each group the classification >> function coefficient (Cj) is expressed by the following equation: >> >> Cj = cj0+ cj1x1+ cj2x2+...+ cjpxp >> >> where Cj is the score for the jth group, j = 1 ⦠k, cjo is the >> constant for the jth group, and x = raw scores of each predictor. >> If W = within-group variance-covariance matrix, and M = column matrix >> of means for group j, then the constant cjo= (-1/2)CjMj" (Julia >> Barfield, John Poulsen, and Aaron French >> http://userwww.sfsu.edu/~efc/classes/biol710/discrim/discriminant.htm). >> >> I am unable to navigate this last step based on the R output I have. >> I only have the linear discriminant coefficients for each predictor >> that would be needed to complete this equation. >> >> Please, if anybody is familiar or able to to help please let me know. >> There is a spot in the acknowledgments for you. >> >> All the best, >> Chase Mendenhall > > The first thing I did was to plot your data. This indicates in the > first place that a perfect discrimination can be obtained on the > basis of your variables WRMA_WT and WRMA_ID alone (names abbreviated > to WG, WT, ID, SEX): > > d.csv("horsesLDA.csv") > # names(D0) # "WRMA_WG" "WRMA_WT" "WRMA_ID" "WRMA_SEX" > WG<-D0$WRMA_WG; WT<-D0$WRMA_WT; > ID<-D0$WRMA_ID; SEX<-D0$WRMA_SEX > > ix.M<-(SEX=="M"); ix.F<-(SEX=="F") > > ## Plot WT vs ID (M & F) > plot(ID,WT,xlim=c(0,12),ylim=c(8,15)) > points(ID[ix.M],WT[ix.M],pch="+",col="blue") > points(ID[ix.F],WT[ix.F],pch="+",col="red") > lines(ID,15.5-1.0*(ID)) > > and that there is a lot of possible variation in the discriminating > line WT = 15.5-1.0*(ID) > > Also, it is apparent that the covariance between WT and ID for Females > is different from the covariance between WT and ID for Males. Hence > the assumption (of common covariance matrix in the two groups) for > standard LDA (which you have been applying) does not hold. > > Given that the sexes can be perfectly discriminated within the data > on the basis of the linear discriminator (WT + ID) (and others), > the variable WG is in effect a close approximation to noise. > > However, to the extent that there was a common covariance matrix > to the two groups (in all three variables WG, WT, ID), and this > was well estimated from the data, then inclusion of the third > variable WG could yield a slightly improved discriminator in that > the probability of misclassification (a rare event for such data) > could be minimised. But it would not make much difference! > > However, since that assumption does not hold, this analysis would > not be valid. > > If you plot WT vs WG, a common covariance is more plausible; but > there is considerable overlap for these two variables: > > plot(WG,WT) > points(WG[ix.M],WT[ix.M],pch="+",col="blue") > points(WG[ix.F],WT[ix.F],pch="+",col="red") > > If you plot WG vs ID, there is perhaps not much overlap, but a > considerable difference in covariance between the two groups: > > plot(ID,WG) > points(ID[ix.M],WG[ix.M],pch="+",col="blue") > points(ID[ix.F],WG[ix.F],pch="+",col="red") > > This looks better on a log scale, however: > > lWG <- log(WG) ; lWT <- log(WT) ; lID <- log(ID) >## Plot log(WG) vs log(ID) (M & F) > plot(lID,lWG) > points(lID[ix.M],lWG[ix.M],pch="+",col="blue") > points(lID[ix.F],lWG[ix.F],pch="+",col="red") > > and common covaroance still looks good for WG vs WT: > > ## Plot log(WT) vs log(WG) (M & F) > plot(lWG,lWT) > points(lWG[ix.M],lWT[ix.M],pch="+",col="blue") > points(lWG[ix.F],lWT[ix.F],pch="+",col="red") > > but there is no improvement for WG vs IG: > > ## Plot log(WT) vs log(ID) (M & F) > plot(ID,WT,xlim=c(0,12),ylim=c(8,15)) > points(ID[ix.M],WT[ix.M],pch="+",col="blue") > points(ID[ix.F],WT[ix.F],pch="+",col="red")
[***] The above is incorrect! Apologies. I plotted the raw WT and ID instead of their logs. In fact, if you do plot the logs: ## Plot log(WT) vs log(ID) (M & F) plot(lID,lWT) points(lID[ix.M],lWT[ix.M],pch="+",col="blue") points(lID[ix.F],lWT[ix.F],pch="+",col="red") you now get what looks like much closer agreement between the covariance cov(lID,lWT) then before. Hence, I would now suggest that you do your limear discrimination on the logarithms of the variables (since you also get agreement for the other pairs on the log scale. In fact: [Raw]: [Male]: cov(cbind(WG,WT,ID)[ix.M,]) # WG WT ID # WG 2.2552465 0.11074710 -0.02202080 # WT 0.1107471 0.33853450 0.06601287 # ID -0.0220208 0.06601287 0.31979368 [Female]: cov(cbind(WG,WT,ID)[ix.F,]) # WG WT ID # WG 2.4716912 0.1577307 0.6670657 # WT 0.1577307 0.3183928 0.2973335 # I D 0.6670657 0.2973335 2.8326520 [log]: [Male]: cov(cbind(lWG,lWT,lID)[ix.M,]) # lWG lWT lID # lWG 0.0006584465 0.0001813315 -0.0002133576 # lWT 0.0001813315 0.0030368382 0.0030442356 # lID -0.0002133576 0.0030442356 0.0693965979 [Female]: cov(cbind(lWG,lWT,lID)[ix.F,]) # lWG lWT lID # lWG 0.0007244826 0.0002171885 0.001951343 # lWT 0.0002171885 0.0019640076 0.003305884 # lID 0.0019513428 0.0033058841 0.068406840 > So there is no simple road to applying a routine LDA to your data. > > To take account of different covariances between the two groups, > you would normally be looking at a quadratic discriminator. However, > as indicated above, the fact that a linear discriminator using > the variables ID & WT alone works so well would leave considerable > imprecision in conclusions to be drawn from its results. > > Sorry this is not the straightforward answer you were hoping for > (which I confess I have not sought); it is simply a reaction to > what your data say. > > Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 24-May-09 Time: 21:49:50 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.