Dear R Users,

This is a follow-up to one of my older messages. Although I did not receive 
useful suggestions back then, I did some research on my own and found an R 
package and some references in the literature. I hope that this information is 
useful to others as well.

### Tree Balance

The problem can be solved using indexes that measure the tree balance. Package 
mdendro implements 2 such indexes (chaining coefficient and a tree balance 
coefficient, which is actually the entropy of the tree structure).
https://cran.r-project.org/web/packages/mdendro/index.html

The paper referenced by mdendro is in the meantime published as well (it was 
not accessible a few days ago):
Fern�ndez, A., & G�mez, S. (2025). mdendro: An R Package for Extended 
Agglomerative Hierarchical Clustering. Journal of Statistical Software, 114(2), 
1�26. https://doi.org/10.18637/jss.v114.i02


The other reference is much older (and I do not have access to it):
Williams W, Lambert J, Lance G (1966). �Multivariate Methods in Plant Ecology: 
V.
Similarity Analyses and Information-Analysis.� Journal of Ecology, 54(2), 
427�445. doi:
10.2307/2257960.
[it may be possible to read online, but not download the article; I have not 
fully checked]

I have included some functions in my own R code on GitHub, like index.chaining 
and index.entropy:
https://github.com/discoleo/PeptideClassifier/blob/main/R/Helper.Tree.Analysis.R

I hope this info is useful to others as well.

Sincerely,

Leonard

Initial Message:
https://stat.ethz.ch/pipermail/r-help/2025-August/481164.html

________________________________
From: Leo Mada <[email protected]>
Sent: Monday, August 25, 2025 12:28 AM
To: Leo Mada via R-help <[email protected]>
Subject: Branch Ratios & other Indexes for Trees/Dendrograms

Dear R-Users,

I have another question regarding trees (dendrograms).

After exploring the various hierarchical clustering methods, it seems that some 
of the methods (average, single, median) add sequentially very small clusters 
(even 1 leaf) to an increasingly larger branch.

I would like to quantify this more rigorously. I do not think that banner plots 
fully capture this fact, as they are limited to height of the node where a leaf 
binds.

I came up with 2 alternative measures:
- Ratio of leaves on 1 branch (larger branch) vs the other branch (see function 
branch.ratios);
- Size of other branch of the node where 1 leaf binds;

The latter resembles the bannerplot; and is also limited only to nodes with 
leaves.

Can anyone point me to such indexes in the literature and/or in other R 
packages?

I am not an expert in the field. Searching for cluster indexes will likely 
generate a huge number of false positive results (i.e. indexes for number of 
clusters).

An example of this functionality is given below:

# Pre-computed Trees:
x1 = readRDS("Tree.Full.M_ward.D.rds")
x2 = readRDS("Tree.Full.M_average.rds")

br1 = branch.ratios(x1)
br2 = branch.ratios(x2)
# Alternative: size.leafBranch(x1);

par.old = par(mfrow = c(1,2))
hist(br1);
# Branch Ratio goes up to 1300!
hist(br2);
par(par.old)

# Note: Median & centroid are even more extreme!

The data sets and functions are on GitHub:
https://github.com/discoleo/PeptideClassifier/tree/main/inst/examples

Functions: branch.ratios, size.leafBranch, count.nodes;
https://github.com/discoleo/PeptideClassifier/blob/main/R/Helper.Tree.R

I have attached an image to this mail with all 8 histograms. The image is also 
available on GitHub:
https://github.com/discoleo/PeptideClassifier/blob/main/Trees.BranchRatios.png
Many thanks in advance,

Leonard


        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to