Re: [R] help understanding hierarchical clustering

epi Wed, 01 May 2013 13:26:17 -0700

Hi David,

thank yuou so much for helping me!


Il giorno 01/mag/2013, alle ore 10:16, David Carlson <dcarl...@tamu.edu> ha 
scritto:

> You need to clarify what you are trying to achieve and fix some errors in
> your code. First, thanks for giving us reproducible data. 
> 

i tried to fix the errors and uploaded a new link to data and code [1]
Thanks for your advice!

i'll try to describe the dataset :

in the csv  are stored information recorded by an underwater towed camera 

[imagename, temp, sal, depth_m] 

plus 3 fields added later by an image analyst 

[idcode, count, subs]

so each ROW in the data is composed by 

- idcode (unique identifier for specie) 
- count  (how many individuals of species 'J' are found in image 'X' )
- temp (temperature)
- sal (salinity)
- depth_m (depth in meters)
- subs (substrate complexity, integer number describing the seafloor texture 
[hard <-> soft bottom] )

The csv looks like :


  idcode    count    temp    sal          depth_m   subs
  16001    136       4.308   32.828   63.46        47
..
  10010     1           4.342   32.865   83.58        35


> Once you have read the file, you seem to be attempting to remove cases with
> missing values, but you check for missing values of "count" twice and you
> never check "depth." The whole line can be replaced with
> 
> dd <- na.omit(mat)

my mistake sorry about that.
fixed in the code

> 
> Now you have data with complete cases. In your next step you create a
> distance matrix that includes "idcode" as a variable! Although it is
> numeric, it is really a categorical variable. That suggests you need to read
> up on R and cluster analysis. It is very likely that you want to exclude
> this variable from the distance matrix and possibly the "count" variable as
> well. 

 big mistake here, idcode is my "categorical value" 
the one i'm trying in grouping into classes

fixed in the code, i now running the code including the count [ dd1 ]  
and without including count [ dd2 ]

the count should express the "density for each species" with particular 
environmental parameters associated (i think it was important, it isn't?)


> 
> What does one row of data represent? You have 8036 complete cases
> representing data on 100 species. There are great differences in the number
> of rows for each species (idcode) ranging from 1 to 1066. 

- trying to clem up the dataset 
  should i remove the records for the idcode that are not well represented 
(IDcode  with a low number of records)
  so to have a subset of representative species ?

- idcodelist = [id_1,  , id_N]  
  with count(id_i) >= X

note :
in the data each record refer to a single species identified in an image, 
this means that there are multiple records for the same image (one record for 
each species identified in a single image)

in the database i have an unique [imagename] and position [lon lat]  for each 
image, should i include this information in my csv ?

so that it looks like :


  idcode  count   temp        sal            depth_m    subs   lon   lat   
imagename 
  16001   136      4.308       32.828    63.46          47       x1    y1    
image_year_day_h_m_ms_1
  18005   15        4.308       32.828    63.46          47       x1    y1    
image_year_day_h_m_ms_1
..
  10010    5          4.342      31.925     82.18         35       xN    yN    
image_year_day_h_m_ms_N 
  13010    1          4.342      31.925     82.18         35       xN    yN    
image_year_day_h_m_ms_N


and group my data by [imagename] adding a field for each representative species 
where to store the relative count ?

the example below should look like :

  count_id_1 count_id_2  count_id_5  count_id_9  idcode_N-1 idcode_N  temp   
 sal          depth_m  subs  lon   lat   imagename 
  136               0                       15                       0          
             0                    0                 4.308   32.828   63.46      
  47      x1    y1   image_year_day_h_m_ms_1
..
  0                    5                         0                       0      
                   1                   0                4.342   31.925   82.18  
      35      xN    yN    image_year_day_h_m_ms_N 

where :

count_id_1      is the count for the species with  idcode 16001 in the image Xi
count_id_5                                       //                             
        16005           //
count_id_2                                       //                             
        10010           //
count_id_N-1                                   //                               
      13010           //


thank you for any further advice,

Massimo.

[1] http://nbviewer.ipython.org/5497996


> 
> -------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
> 
> -----Original Message-----
> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
> Behalf Of epi
> Sent: Tuesday, April 30, 2013 8:06 PM
> To: r-help@r-project.org
> Subject: [R] help understanding hierarchical clustering
> 
> Hi All,
> 
> i've problem to understand how to work with R to generate a hierarchical
> clustering my data are in a csv and looks like :
> 
> idcode,count,temp,sal,depth_m,subs
> 16001,136,4.308,32.828,63.46,47
> 16001,109,4.31,32.829,63.09,49
> 16001,107,4.302,32.822,62.54,47
> 16001,87,4.318,32.834,62.54,48
> 16002,82,4.312,32.832,63.28,49
> 16002,77,4.325,32.828,65.65,46
> 16002,77,4.302,32.821,62.36,47
> 16002,71,4.299,32.832,65.84,37
> 16002,70,4.302,32.821,62.54,49
> 
> where idcode is a specie identification number and the other fields are
> environmental parameters.
> 
> library(vegan)
> mat<-read.csv("http://epi.whoi.edu/ipython/results/mdistefano/pg_site1.csv";,
> header=T)
> dd <- mat[!is.na(mat$idcode) &
>              !is.na(mat$temp) &
>              !is.na(mat$sal) &
>              !is.na(mat$count) &
>              !is.na(mat$count) &
>              !is.na(mat$subs),]
> distmat<-vegdist(dd)
> clusa<-hclust(distmat,"average")
> print(clusa)
>       Call:
>       hclust(d = distmat, method = "average")
>       
>       Cluster method   : average 
>       Distance         : bray 
>       Number of objects: 8036
> print(dend1 <- as.dendrogram(clusa))
>       'dendrogram' with 2 branches and 8036 members total, at height
> 0.3194225
> dend2 <- cut(dend1, h=0.07)
> 
> 
> a complete run with plots is available here :  
> 
> http://nbviewer.ipython.org/5492912
> 
> i'm trying try to group together the species (idcode's) that are sharing
> similar environmental parameters
> 
> like (looking at the plots) i should be able to retrieve the list of idcode
> for each branch at "cut-level" X
> 
> in the example :  
> 
> 
> X = 0.07 
> 
> branches1 : [idcodeA, .. .. ,idcodeJ]
> ..
> ..
> branche6 : [idcodeB, .. .. , idcodeK]
> 
> 
> 
> Many thanks for your precious help!!!
> 
> Massimo.
> 
> 
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] help understanding hierarchical clustering

Reply via email to