Whoops, the final var estimator var(f(Y)) should have N^4 in the denominator not N^2
> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Doran, Harold > Sent: Monday, August 18, 2008 10:47 AM > To: Stas Kolenikov > Cc: r-help@r-project.org > Subject: Re: [R] Design-consistent variance estimate > > It also turns out that in educational testing, it is rare to > consider the sampling design and to estimate > design-consistent standard errors. I appreciate your thoughts > on this, Stas. As a result, I was able to bring to my mind > more transparency into what R's survey package as well as SAS > proc surveymeans are doing. I've copied some minimal latex code below. > My R code reflecting this latex replicates svymean() and the > SAS procedures exactly under all conditions that I have > tested so far for a > 1 stage cluster sample. > > It clearly reduces to a more simple expression when cluster > sizes are equal. > > My hat is off to sampling statisticians, this has got to be a > lot of fun for you :) > > ### LaTeX > > \documentclass[12pt]{article} > \usepackage{bm,geometry} > \begin{document} > > In this scenario, the appropriate procedure is to estimate > design-consistent standard errors. This is accomplished by > first defining the ratio estimator of the mean as: > > \begin{equation} > f(Y) = \frac{Y}{N} > \end{equation} > > \noindent where $Y$ is the total of the variable and $N$ is > the population size. Treating both $Y$ and $N$ as random > variables, a first-order taylor series expansion of the ratio > estimator $f(Y)$ can be used to derive the design-consistent > variance estimator as: > > \begin{equation} > var(f(Y)) = \left[\frac{\partial f(Y)}{\partial Y}, > \frac{\partial f(Y)}{\partial N}\right] \left [ \begin{array}{cc} > var(Y) & cov(Y,N)\\ > cov(Y,N) & var(N)\\ > \end{array} > \right] > \left[\frac{\partial f(Y)}{\partial Y}, \frac{\partial > f(Y)}{\partial N}\right]^T \end{equation} > > \noindent where > > \begin{equation} > \left[\frac{\partial f(Y)}{\partial Y}\right] = \frac{1}{N} > \end{equation} > > \begin{equation} > \left[\frac{\partial f(Y)}{\partial N}\right] = - > \frac{Y}{N^2} \end{equation} > > \begin{equation} > var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2 > \end{equation} > > \begin{equation} > \hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)} \end{equation} > > \begin{equation} > \hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j \end{equation} > > \begin{equation} > var(N) = \frac{k}{k-1} \sum_{j=1}^k(\hat{N}_j-\hat{N}_{..})^2 > \end{equation} > > \begin{equation} > \hat{N}_j = \sum_{i=1}^{n_j}\hat{N}_{j(i)} \end{equation} > > \begin{equation} > \hat{N}_{..} = k^{-1} \sum_{j=1}^k \hat{N}_j \end{equation} > > \begin{equation} > cov(Y,N) = \sum_{j=1}^k(\hat{Y}_j- \hat{Y}_{..}) (\hat{N}_j- > \hat{N}_{..}) \times \frac{k}{k-1} > \end{equation} > > \noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, > $j(i)$ indexes the $i$th member of cluster $j$, and $n_j$ is > the total number of members in cluster $j$. > > The estimate of the variance of $f(Y)$ is then taken as: > > \begin{equation} > var(f(Y)) = \frac{N^2var(Y) - 2cov(Y,N)NY + var(N)Y^2 }{N^2} > \end{equation} > > The standard error is then taken as: > > \begin{equation} > se = \sqrt{var(f(Y))} > \end{equation} > > \end{document} > > > -----Original Message----- > > From: Stas Kolenikov [mailto:[EMAIL PROTECTED] > > Sent: Monday, August 18, 2008 10:40 AM > > To: Doran, Harold > > Cc: r-help@r-project.org > > Subject: Re: [R] Design-consistent variance estimate > > > > On 8/16/08, Doran, Harold <[EMAIL PROTECTED]> wrote: > > > In terms of the "design" (which is a term used loosely) > the schools > > > were not randomly selected. They volunteered to participate > > in a pilot study. > > > > Oh, that's a next level of disaster, then! You may have to > work with > > treatment effect models, of which there are many: > > propensity score matching, nearest neighbor matching, instrumental > > variables, etc. > > Those methods require asymptotics in terms of number of treatment > > units, which would be schools -- and I would imagine those are > > numbered in dozens rather than thousands in your study, so > > straightforward application of those methods might be problematic... > > At least I would augment my analysis with propensity score weights: > > somehow estimate the (school level) probability of participating in > > the study (I imagine you have the school characteristics at > hand for > > your complete universe of schools > > -- principal's education level, # of computers per student, > fraction > > free/reduced price lunch, whatever... > > you probably know those better than I do :) ), and use > inverse of that > > probability as the probability weight. If the selection was > > informative, you might see quite different results in weighted and > > unweighted analysis. > > > > > In Wolter (1985) he shows the variance of a cluster sample with a > > > single strata and then extends that to the more general > example. It > > > turns out though in many educational assessment studies, > the single > > > stage cluster sample is a norm and not so rare. > > > > I can see why. Thanks, I'll keep educational statistics examples in > > mind for those kinds of designs! > > > > -- > > Stas Kolenikov, also found at http://stas.kolenikov.name > Small print: > > I use this email account for mailing lists only. > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.