Hello all I am doing some text mining on a set of five plain text files and have run into a snag when I run hclust in that there are just too many leaves for anything to be read. It returns a solid black line.
The texts have been converted into a TDM which has a dim of 5,292 and 5 (as per 5 docs). My code for removing sparsity is as follows: > tdm2 <- removeSparseTerms(tdm, sparse=0.99999) > inspect(tdm2) <<TermDocumentMatrix (terms: 5292, documents: 5)>> Non-/sparse entries: 10415/16045 Sparsity : 61% Maximal term length: 22 Weighting : term frequency (tf) While the tf-idf weighting returns this when 0.99999 sparseness is removed: > inspect(tdm.tfidf) <<TermDocumentMatrix (terms: 5292, documents: 5)>> Non-/sparse entries: 7915/18545 Sparsity : 70% Maximal term length: 22 Weighting : term frequency - inverse document frequency (normalized) (tf-idf) I have experimented by decreasing the value I use for decreasing sparseness, and that helps a bit, for example: > tdm2 <- removeSparseTerms(tdm, sparse=0.215) > inspect(tdm2) <<TermDocumentMatrix (terms: 869, documents: 5)>> Non-/sparse entries: 3976/369 Sparsity : 8% Maximal term length: 14 Weighting : term frequency (tf) But, no matter what I do, the resulting plot is unreadable. The code for plotting the cluster is: > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete") > plot(hc, yaxt = 'n', main = "Hierarchical clustering") Can someone kindly either advise me what I am doing wrong and/ or signpost me to some detailed info on how to fix this. Many thanks in anticipation. Andy [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.