[R] subset a data frame by largest frequencies of factors

Michael Friendly Thu, 05 Mar 2015 10:47:53 -0800

A consulting client has a large data set with a binary response(negative) and two factors (ctry and member) which have many levels,but many occur with very small frequencies. It is far too sparse with amodel like glm(negative ~ ctry+member, family=binomial).


> str(Dataset)
'data.frame':   10672 obs. of  5 variables:

$ ctry : Factor w/ 31 levels "Barbados","Belize",..: 21 21 5 22 1818 18 18 26 18 ...$ member : Factor w/ 163 levels "","ADHOPIA, PREETI ",..: 150 19 19111 120 1 1 4 55 18 ...

 $ negative: int  0 1 0 1 1 1 1 0 0 0 ...
>

For analysis, we'd like to subset the data to include only those thatoccur with frequency greater than a givenvalue, or the top 10 (say) in frequency, or the highest frequencycategories accounting for 80% (say) of the

total.  I'm not sure how to do any of these in R.  Can anyone help?

--
Michael Friendly     Email: friendly AT yorku DOT ca
Professor, Psychology Dept. & Chair, Quantitative Methods
York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele Street    Web:http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] subset a data frame by largest frequencies of factors

Reply via email to