On Jul 2, 2014, at 11:42 AM, Supriya Jain wrote: > Hi David, > > Thanks for your mail. > > Here are the details of what I would like to do. > Given a dataset, I make two sets from it (for training and testing my model, > respectively). But before the modeling, I would like to check the > distributions of all columns in my dataset in order to make sure that my > splitted tables represent the same distributions. > > With the code below (using the "attenu" dataset), I can overlay histograms > normalized to unit area from the two splitted datasets, for columns that are > of type numeric. > --------------------- > > head(attenu, 10) > nrow(attenu) > indices <- sample(1:182, 50) > t1 <- attenu[indices, ] > t2 <- attenu[-indices, ] > > # overlay column "event" from t1 and t2: > > hist(t2$event, col = "red", density = 0, freq = FALSE, breaks = seq(1, 25, > 2), xlab = "event", ylim = c(0, 0.2)) > par(new=TRUE) > hist(t1$event, col = "blue", density = 0, freq = FALSE, breaks = seq(1, 25, > 2), xlab = "event", ylim = c(0, 0.2)) > > #------------- > > However, for columns of type factor, although I can get the frequency of the > different levels using the "summary" method for the columns separately, how > do I plot their frequency distribution, after normalizing the frequencies by > the total count, and overlay these distributions? > > summary(t1$station) > > #--------output-------------- > > 135 111 113 117 1027 1028 1052 1093 1095 > 1102 112 1219 > 3 2 2 2 1 1 1 1 1 > 1 1 1 > 126 127 1291 1293 130 1308 1383 1408 1409 > 141 1410 1418 > 1 1 1 1 1 1 1 1 1 > 1 1 1 > 266 270 272 411 412 5042 5043 5054 5060 > 5066 5069 5160 > 1 1 1 1 1 1 1 1 1 > 1 1 1 > 5165 952 c168 c266 1008 1011 1013 1014 1015 > 1016 1030 1032 > 1 1 1 1 0 0 0 0 0 > 0 0 0 > 1051 1083 1096 110 1117 116 125 1250 1251 > 128 1292 1298 > 0 0 0 0 0 0 0 0 0 > 0 0 0 > 1299 1376 1377 1411 1413 1422 1438 1445 1456 > 1492 2001 2316 > 0 0 0 0 0 0 0 0 0 > 0 0 0 > 262 269 2708 2714 2715 2728 2734 280 283 > 286 290 3501 > 0 0 0 0 0 0 0 0 0 > 0 0 0 > 475 5028 5044 5045 5047 5049 5050 5051 5052 > 5053 5055 5056 > 0 0 0 0 0 0 0 0 0 > 0 0 0 > 5057 5058 (Other) NA's > 0 0 0 5 > > #---------------------------- > > summary(t2$station) > > #--------output-------------- > > 1028 117 475 1030 1083 112 113 116 1299 > 1377 269 283 > 3 3 3 2 2 2 2 2 2 > 2 2 2 > 290 5028 5053 5055 5056 5057 5058 5115 942 > 955 958 1008 > 2 2 2 2 2 2 2 2 2 > 2 2 1 > 1011 1013 1014 1015 1016 1032 1051 1093 1095 > 1096 110 1117 > 1 1 1 1 1 1 1 1 1 > 1 1 1 > 1219 125 1250 1251 128 1292 1298 130 1308 > 1376 1383 1411 > 1 1 1 1 1 1 1 1 1 > 1 1 1 > 1413 1418 1422 1438 1445 1456 1492 2001 2316 > 262 266 2708 > 1 1 1 1 1 1 1 1 1 > 1 1 1 > 2714 2715 272 2728 2734 280 286 3501 412 > 5044 5045 5047 > 1 1 1 1 1 1 1 1 1 > 1 1 1 > 5049 5050 5051 5052 5054 5059 5060 5061 5062 > 5067 5068 5070 > 1 1 1 1 1 1 1 1 1 > 1 1 1 > 5072 5073 5165 655 724 885 931 952 c118 > c203 c204 1027 > 1 1 1 1 1 1 1 1 1 > 1 1 0 > 1052 1102 (Other) NA's > 0 0 0 11 > > #--------------------------- >
It appears there may be a natural order to those categories but that the alpha ordering of the factor representation is making a hash of that fact. It also appears that the factor levels are different in the two datasets. Seems unlikely that you will get satisfactory plots for comparison using barplot. -- David. > > Thanks in advance for any help with this, > Supriya > > > > > > On Tue, Jul 1, 2014 at 6:42 PM, David Winsemius <dwinsem...@comcast.net> > wrote: > > On Jul 1, 2014, at 3:46 PM, Supriya Jain wrote: > > > Hello, > > > > Given two different datasets (having the same number and type of columns, > > but different observations, as commonly encountered in data-mining as > > train/test/validation datasets), is it possible to overlay plots > > (histograms) and compare the different attributes from the separate > > datasets, in order to check how similar the different datasets are? > > > > Is there a package available for such plotting together of similar columns > > from different datasets? > > Possible. Assuming you just want frequency histograms (or ones using counts > for that matter) it can be done in any of the three major plotting paradigms > supported in R. No extra packages needed if using just base graphics. > > > > > > Thanks, > > SJ > > > > [[alternative HTML version deleted]] > > Oh, you must have missed the parts of the Posign Guide where plain text was > requyested. See below. > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > And you missed that section, as well. > > > and provide commented, minimal, self-contained, reproducible code. > > > > -- > David Winsemius > Alameda, CA, USA > > David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.