Friends, I am doing a URL classification, based on certain key words whether it contains an executive information or not. I have already went through 50K URL's and identified the key words and made it as 0, 1 ( 0 - do not have the key word and 1 - have the key word) and 0- do not contain executive information 1 - contains executive information.
A sample set of data is shown below. DomainID Domain LinkID Raw_Link Cleansed_Link Biz_name Address1 City State PostalCode Address_Page_flag Executive_page_flag collections other_keywords conditions policy story history brand login job career who company people staff Board management team terms privacy shop gallery News location site Sitemap page Content Event blog categories Services Index Product Reviews Testimonials about contact Home_page Link_Len LinkCount ExecWordCount ExecWordRatio Category 250842730 www.aaronwomenscenterhouston.com 250842730-1 http://www.aaronwomenscenterhouston.com aaronwomenscenterhouston.com AARON WOMEN’S CENTER 2505 North Shepherd Dr Houston TX 77008 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 28 9 1 0.48% Clinic 250842730 www.aaronwomenscenterhouston.com 250842730-2 http://www.aaronwomenscenterhouston.com/surgical-termination aaronwomenscenterhouston.com/surgical-termination AARON WOMEN’S CENTER 2505 North Shepherd Dr Houston TX 77008 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 49 9 1 0.65% Clinic 250842730 www.aaronwomenscenterhouston.com 250842730-3 http://www.aaronwomenscenterhouston.com/non-surgical-termination aaronwomenscenterhouston.com/non-surgical-termination AARON WOMEN’S CENTER 2505 North Shepherd Dr Houston TX 77008 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 53 9 1 0.79% Clinic 250842730 www.aaronwomenscenterhouston.com 250842730-4 http://www.aaronwomenscenterhouston.com/birth-control aaronwomenscenterhouston.com/birth-control AARON WOMEN’S CENTER 2505 North Shepherd Dr Houston TX 77008 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 9 1 0.59% Clinic 250842730 www.aaronwomenscenterhouston.com 250842730-5 http://www.aaronwomenscenterhouston.com/late-term-termination aaronwomenscenterhouston.com/late-term-termination AARON WOMEN’S CENTER 2505 North Shepherd Dr Houston TX 77008 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 9 2 0.71% Clinic 250842730 www.aaronwomenscenterhouston.com 250842730-6 http://www.aaronwomenscenterhouston.com/patient-forms aaronwomenscenterhouston.com/patient-forms AARON WOMEN’S CENTER 2505 North Shepherd Dr Houston TX 77008 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 9 1 0.78% Clinic I understand that i need to use Multivariate Bernouli classification to segregate the URL's......I am struggling to get an appropriate R code for the same.... Any help in providing and R code for this would be greatly appreciated. Cheers ALN [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.