[R] How to reduce the sparseness in a TDM to make a cluster plot readable?

Andrew Mon, 14 Sep 2020 11:54:29 -0700

Hello all

I am doing some text mining on a set of five plain text files and have 
run into a snag when I run hclust in that there are just too many leaves 
for anything to be read. It returns a solid black line.


The texts have been converted into a TDM which has a dim of 5,292 and 5 
(as per 5 docs).

My code for removing sparsity is as follows:

 > tdm2 <- removeSparseTerms(tdm, sparse=0.99999)

 > inspect(tdm2)

<<TermDocumentMatrix (terms: 5292, documents: 5)>>
Non-/sparse entries: 10415/16045
Sparsity           : 61%
Maximal term length: 22
Weighting          : term frequency (tf)

While the tf-idf weighting returns this when 0.99999 sparseness is removed:

 > inspect(tdm.tfidf)
<<TermDocumentMatrix (terms: 5292, documents: 5)>>
Non-/sparse entries: 7915/18545
Sparsity           : 70%
Maximal term length: 22
Weighting          : term frequency - inverse document frequency 
(normalized) (tf-idf)

I have experimented by decreasing the value I use for decreasing 
sparseness, and that helps a bit, for example:

 > tdm2 <- removeSparseTerms(tdm, sparse=0.215)
 > inspect(tdm2)
<<TermDocumentMatrix (terms: 869, documents: 5)>>
Non-/sparse entries: 3976/369
Sparsity           : 8%
Maximal term length: 14
Weighting          : term frequency (tf)

But, no matter what I do, the resulting plot is unreadable. The code for 
plotting the cluster is:

 > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete")
 > plot(hc, yaxt = 'n', main = "Hierarchical clustering")

Can someone kindly either advise me what I am doing wrong and/ or 
signpost me to some detailed info on how to fix this.

Many thanks in anticipation.

Andy


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] How to reduce the sparseness in a TDM to make a cluster plot readable?

Reply via email to