-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Gundala Viswanath wrote: | Dear Ben, | | Given a set of words | ('foo', 'bar', 'bar', 'bar', "quux" ..... "foo") this can be in 10.000 items. | I would like to compute the significance of the word occurrence with P-Value. | | Is there a simple way to do it? | | - GV | ~ Closer, but still not enough information. What is your null hypothesis? Equidistribution? If so, ... dat <- sample(c("foo","bar","quux","pridznyskie"), ~ replace=TRUE,size=10000) tab <- table(dat) chisq.test(tab) from ?chisq.test: ~ If 'x' is a matrix with one row or column, or if 'x' is a vector ~ and 'y' is not given, then a _goodness-of-fit test_ is performed ~ ('x' is treated as a one-dimensional contingency table). The ~ entries of 'x' must be non-negative integers. In this case, the ~ hypothesis tested is whether the population probabilities equal ~ those in 'p', or are all equal if 'p' is not given. ~ Note that this won't test the significance of *individual* deviations from equiprobability, just the overall pattern. If you wanted to test individual words you could use binom.test -- but if you tested more than one word, or tested words on the basis of those that appeared to have extreme frequencies, you'd start running into multiple comparisons/ post hoc testing issues. ~ Do you know something about the methods that people usually use in this area? ~ Ben Bolker -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIPXhVc5UpGjwzenMRAsunAJ9to/KGX0ohSrhUC8qTkhIR0CO8OgCfcejV +LpiB16YBG9ExiHd2tD0sOg= =w5FE -----END PGP SIGNATURE----- ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.