Hi Andrew, >From your last email the answer to your problem may be the findFreqTerms() function. Just increase the number of times a term has to appear and check the result until you get the matrix size that you want.
Jim On Fri, Sep 18, 2020 at 5:32 PM Andrew <phaedr...@gmail.com> wrote: > > Hi Abby > > Many thanks for reaching out with an offer of help. Very much appreciated. > > (1) The packages I'm using are 'tm' for text-mining and the TDM and for > the clustering it is 'cluster' > (2) Not sure where the problem is happening as it doesn't show up as an > error. Where it manifests is in the plotting, however logic would > suggest that it concerns the removal of sparse terms, so that would be > in the TDM process > (3) I don't think I can provide a reproducible example. When I practice > using data sets that packages provide, all is fine. The trouble is when > I apply it to my own data sets which are five documents, etc., as described. > > I think the nub of it is really to find a way that I can subset the TDM > to return the twenty or thirty most frequently used words, and then to > plot those using hclust. However, when searching on-line I haven't been > able to find any suggestions on how to do that, nor is there any mention > of using that approach in the books and tutorials I have. > > If you (or someone on this list) can advise on how I can sort the terms > in the TDM from most to least frequent, and then to subset the top > twenty or thirty most frequently occurring terms (preferably using tf as > well as tf-idf) and then I can plot that sub-set, then I think that that > would do the trick, and the terms would be plotted clearly and legibly. > > Thanks again for your offer of help. I hope that my reply helps clarify > rather than muddy the situation. > > Best wishes > Andy > > > On 17/09/2020 08:43, Abby Spurdle wrote: > > I'm not familiar with these subjects. > > And hopefully, someone who is, will offer some better suggestions. > > > > But to get things started, maybe... > > (1) What packages are you using (re: tdm)? > > (2) Where does the problem happen, in dist, hclust, the plot method > > for hclust, or in the package(s) you are using? > > (3) Do you think you could produce a small reproducible example, > > showing what is wrong, and explaining you would like it to do instead? > > > > Note that if the problem relates to hclust, or the plot method, then > > you should be able to produce a much simpler example. > > e.g. > > > > mycount.matrix <- matrix (rpois (25000, 20),, 5) > > head (mycount.matrix, 3) > > tail (mycount.matrix, 3) > > > > plot (hclust (dist (mycount.matrix) ) ) > > > > On Tue, Sep 15, 2020 at 6:54 AM Andrew <phaedr...@gmail.com> wrote: > >> Hello all > >> > >> I am doing some text mining on a set of five plain text files and have > >> run into a snag when I run hclust in that there are just too many leaves > >> for anything to be read. It returns a solid black line. > >> > >> The texts have been converted into a TDM which has a dim of 5,292 and 5 > >> (as per 5 docs). > >> > >> My code for removing sparsity is as follows: > >> > >> > tdm2 <- removeSparseTerms(tdm, sparse=0.99999) > >> > >> > inspect(tdm2) > >> > >> <<TermDocumentMatrix (terms: 5292, documents: 5)>> > >> Non-/sparse entries: 10415/16045 > >> Sparsity : 61% > >> Maximal term length: 22 > >> Weighting : term frequency (tf) > >> > >> While the tf-idf weighting returns this when 0.99999 sparseness is removed: > >> > >> > inspect(tdm.tfidf) > >> <<TermDocumentMatrix (terms: 5292, documents: 5)>> > >> Non-/sparse entries: 7915/18545 > >> Sparsity : 70% > >> Maximal term length: 22 > >> Weighting : term frequency - inverse document frequency > >> (normalized) (tf-idf) > >> > >> I have experimented by decreasing the value I use for decreasing > >> sparseness, and that helps a bit, for example: > >> > >> > tdm2 <- removeSparseTerms(tdm, sparse=0.215) > >> > inspect(tdm2) > >> <<TermDocumentMatrix (terms: 869, documents: 5)>> > >> Non-/sparse entries: 3976/369 > >> Sparsity : 8% > >> Maximal term length: 14 > >> Weighting : term frequency (tf) > >> > >> But, no matter what I do, the resulting plot is unreadable. The code for > >> plotting the cluster is: > >> > >> > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete") > >> > plot(hc, yaxt = 'n', main = "Hierarchical clustering") > >> > >> Can someone kindly either advise me what I am doing wrong and/ or > >> signpost me to some detailed info on how to fix this. > >> > >> Many thanks in anticipation. > >> > >> Andy > >> > >> > >> [[alternative HTML version deleted]] > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.