validation datasets

David Winsemius Wed, 02 Jul 2014 14:49:28 -0700

On Jul 2, 2014, at 11:42 AM, Supriya Jain wrote:

> Hi David,
>  
> Thanks for your mail. 
>  
> Here are the details of what I would like to do. 
> Given a dataset, I make two sets from it (for training and testing my model, 
> respectively). But before the modeling, I would like to check the 
> distributions of all columns in my dataset in order to make sure that my 
> splitted tables represent the same distributions. 
>  
> With the code below (using the "attenu" dataset), I can overlay histograms 
> normalized to unit area from the two splitted datasets, for columns that are 
> of type numeric. 
> ---------------------
>  
> head(attenu, 10)
> nrow(attenu)
> indices <- sample(1:182, 50)
> t1 <- attenu[indices, ]
> t2 <- attenu[-indices, ]
>  
> # overlay column "event" from t1 and t2:
>  
> hist(t2$event, col = "red", density = 0, freq = FALSE, breaks = seq(1, 25, 
> 2), xlab = "event", ylim = c(0, 0.2))
> par(new=TRUE) 
> hist(t1$event, col = "blue", density = 0, freq = FALSE, breaks = seq(1, 25, 
> 2), xlab = "event", ylim = c(0, 0.2))
>  
> #-------------
>  
> However, for columns of type factor, although I can get the frequency of the 
> different levels using the "summary" method for the columns separately, how 
> do I plot their frequency distribution, after normalizing the frequencies by 
> the total count, and overlay these distributions? 
>  
> summary(t1$station)
>  
> #--------output--------------
>  
>    135     111     113     117    1027    1028    1052    1093    1095    
> 1102     112    1219 
>       3       2       2       2       1       1       1       1       1       
> 1       1       1 
>     126     127    1291    1293     130    1308    1383    1408    1409     
> 141    1410    1418 
>       1       1       1       1       1       1       1       1       1       
> 1       1       1 
>     266     270     272     411     412    5042    5043    5054    5060    
> 5066    5069    5160 
>       1       1       1       1       1       1       1       1       1       
> 1       1       1 
>    5165     952    c168    c266    1008    1011    1013    1014    1015    
> 1016    1030    1032 
>       1       1       1       1       0       0       0       0       0       
> 0       0       0 
>    1051    1083    1096     110    1117     116     125    1250    1251     
> 128    1292    1298 
>       0       0       0       0       0       0       0       0       0       
> 0       0       0 
>    1299    1376    1377    1411    1413    1422    1438    1445    1456    
> 1492    2001    2316 
>       0       0       0       0       0       0       0       0       0       
> 0       0       0 
>     262     269    2708    2714    2715    2728    2734     280     283     
> 286     290    3501 
>       0       0       0       0       0       0       0       0       0       
> 0       0       0 
>     475    5028    5044    5045    5047    5049    5050    5051    5052    
> 5053    5055    5056 
>       0       0       0       0       0       0       0       0       0       
> 0       0       0 
>    5057    5058 (Other)    NA's 
>       0       0       0       5 
>  
> #----------------------------
>  
> summary(t2$station)
>  
> #--------output--------------
>  
>    1028     117     475    1030    1083     112     113     116    1299    
> 1377     269     283 
>       3       3       3       2       2       2       2       2       2       
> 2       2       2 
>     290    5028    5053    5055    5056    5057    5058    5115     942     
> 955     958    1008 
>       2       2       2       2       2       2       2       2       2       
> 2       2       1 
>    1011    1013    1014    1015    1016    1032    1051    1093    1095    
> 1096     110    1117 
>       1       1       1       1       1       1       1       1       1       
> 1       1       1 
>    1219     125    1250    1251     128    1292    1298     130    1308    
> 1376    1383    1411 
>       1       1       1       1       1       1       1       1       1       
> 1       1       1 
>    1413    1418    1422    1438    1445    1456    1492    2001    2316     
> 262     266    2708 
>       1       1       1       1       1       1       1       1       1       
> 1       1       1 
>    2714    2715     272    2728    2734     280     286    3501     412    
> 5044    5045    5047 
>       1       1       1       1       1       1       1       1       1       
> 1       1       1 
>    5049    5050    5051    5052    5054    5059    5060    5061    5062    
> 5067    5068    5070 
>       1       1       1       1       1       1       1       1       1       
> 1       1       1 
>    5072    5073    5165     655     724     885     931     952    c118    
> c203    c204    1027 
>       1       1       1       1       1       1       1       1       1       
> 1       1       0 
>    1052    1102 (Other)    NA's 
>       0       0       0      11 
>  
> #---------------------------
>


It appears there may be a natural order to those categories but that the alpha 
ordering of the factor representation is making a hash of that fact. It also 
appears that the factor levels are different in the two datasets. Seems 
unlikely that you will get satisfactory plots for comparison using barplot.

-- 
David.
>  
> Thanks in advance for any help with this,
> Supriya
>  
> 
> 
> 
> 
> On Tue, Jul 1, 2014 at 6:42 PM, David Winsemius <dwinsem...@comcast.net> 
> wrote:
> 
> On Jul 1, 2014, at 3:46 PM, Supriya Jain wrote:
> 
> > Hello,
> >
> > Given two different datasets (having the same number and type of columns,
> > but different observations, as commonly encountered in data-mining as
> > train/test/validation datasets), is it possible to overlay plots
> > (histograms) and compare the different attributes from the separate
> > datasets, in order to check how similar the different datasets are?
> >
> > Is there a package available for such plotting together of similar columns
> > from different datasets?
> 
> Possible. Assuming you just want frequency histograms (or ones using counts 
> for that matter) it can be done in any of the three major plotting paradigms 
> supported in R. No extra packages needed if using just base graphics.
> 
> 
> >
> > Thanks,
> > SJ
> >
> >       [[alternative HTML version deleted]]
> 
> Oh, you must have missed the parts of the Posign Guide where plain text was 
> requyested. See below.
> 
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> 
> And you missed that section, as well.
> 
> > and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> --
> David Winsemius
> Alameda, CA, USA
> 
> 

David Winsemius
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data visualization: overlay columns of train/test/validation datasets

Reply via email to