gheine wrote on 10/11/2011 02:31:46 PM: > > An organization has asked me to comment on the validity of their > recent all-employee survey. Survey responses, by geographic region, > compared > with the total number of employees in each region, were as follows: > > > ByRegion > All.Employees Survey.Respondents > Region_1 735 142 > Region_2 500 83 > Region_3 897 78 > Region_4 717 133 > Region_5 167 48 > Region_6 309 0 > Region_7 806 125 > Region_8 627 122 > Region_9 858 177 > Region_10 851 160 > Region_11 336 52 > Region_12 1823 312 > Region_13 80 9 > Region_14 774 121 > Region_15 561 24 > Region_16 834 134 > > How well does the survey represent the employee population? > Chi-square test says, not very well: > > > chisq.test(ByRegion) > > Pearson's Chi-squared test > > data: ByRegion > X-squared = 163.6869, df = 15, p-value < 2.2e-16 > > By striking three under-represented regions (3,6, and 15), we get > a more reasonable, although still not convincing, result: > > > chisq.test(ByRegion[setdiff(1:16,c(3,6,15)),]) > > Pearson's Chi-squared test > > data: ByRegion[setdiff(1:16, c(3, 6, 15)), ] > X-squared = 22.5643, df = 12, p-value = 0.03166
You can't simply eliminate the three regions with the fewest respondents (3, 6, and 15). These are the three largest contributors to the chi-squared statistic, precisely because fewer people in those regions were surveyed than expected. In addition, more people in regions 1, 5, and 9 were surveyed than expected. This should be clear in a bar chart. And the resulting chi-squared test confirms this. Jean > This poses several questions: > > 1) Looking at a side-by-side barchart (proportion of responses vs. > proportion of employees, per region), the pattern of survey responses > appears, visually, to match fairly well the pattern of employees. Is > this a case where we trust the numbers and not the picture? > > 2) Part of the problem, ironically, is that there were too many > responses > to the survey. If we had only one-tenth the responses, but in the same > proportions by region, the chi-square statistic would look much better, > (though with a warning about possible inaccuracy): > > data: data.frame(ByRegion$All.Employees, 0.1 * > (ByRegion$Survey.Respondents)) > X-squared = 17.5912, df = 15, p-value = 0.2848 > > Is there a way of reconciling a large response rate with an > unrepresentative > response profile? Or is the bad news that the survey will give very > precise > results about a very ill-specified sub-population? > > (Of course, I would put in softer terms, like "you need to assess the > degree > of homogeneity across different regions" .) > > 3) Is Chi-squared really the right measure of how representative is the > survey? > > <<<<<<< >>>>>>>>> > > Thanks for any help you can give - hope these questions make sense - > > George H. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.