Hello Sun,

The order of the TM transformations makes a lot of difference.

It isn't a shortcut, but if you identify all names you could create your own Stop words list:

corpus  <-tm_map(corpus , removeWords, c("english", "  "))

In the case of York, Key Word in Context (KWIC) syntax could be used to check how certain words are used. You could identify the words useages you want to remove or retain and respectively rename the relevant instances.

This is labour intensive, but Greis in his Quantitative Corpus Linguistics, notes that sometimes time spent on trying to refine code might be better spent on manual analysis (p164). This book includes a KWIC type function (page 127), but I haven't been able to work out how to modify it to read more than six words either side of the specified word. Six should be adequate for your purpose. Jockers book also includes a KWIC function but I don't believe it searches the entire corpus, rather a specified text.

I recently checked and TM doesn't have a KWIC function, but for the R talented (which excludes me) it might be possible to write one. For example, Jim Holtman once wrote a KWIC function to identify word use in a csv file.

Bob

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to