On 2021-01-07 11:34 +1100, Jim Lemon wrote: > On Thu, Jan 7, 2021 at 10:40 AM Gordon Ballingrud > <gob.alling...@gmail.com> wrote: > > > > Hello all, > > > > I have asked this question on many forums without response. And although > > I've made progress myself, I am stuck as to how to respond to a particular > > error message. > > > > I have a question about text-analysis packages and code. The general idea > > is that I am trying to perform readability analyses on a collection of > > about 4,000 Word files. I would like to do any of a number of such > > analyses, but the problem now is getting R to recognize the uploaded files > > as data ready for analysis. But I have been getting error messages. Let me > > show what I have done so far. I have three separate commands because I > > broke the file of 4,000 files up into three separate ones because, > > evidently, the file was too voluminous to be read alone in its entirety. > > So, I divided the files up into three roughly similar folders. They are > > called ‘WPSCASES’ one through three. Here is my code, with the error > > messages for each command recorded below: > > > > token <- > > tokenize("/Users/Gordon/Desktop/WPSCASESONE/",lang="en",doc_id="sample") > > > > The code is the same for the other folders; the name of the folder is > > different, but otherwise identical. > > > > The error message reads: > > > > *Error in nchar(tagged.text[, "token"], type = "width") : invalid multibyte > > string, element 348* > > > > The error messages are the same for the other two commands. But the > > 'element' number is different. It's 925 for the second folder, and 4302 for > > the third. > > > > token2 <- > > tokenize("/Users/Gordon/Desktop/WPSCASES2/",lang="en",doc_id="sample") > > > > token3 <- > > tokenize("/Users/Gordon/Desktop/WPSCASES3/",lang="en",doc_id="sample") > > > > These are the other commands if that's helpful. > > > > I’ve tried to discover whether the ‘element’ that the error message > > mentions corresponds to the file of that number in the file’s order. But > > since folder 3 does not have 4,300 files in it, I think that that was > > unlikely. Please let me know if you can figure out how to fix this stuff so > > that I can start to use ‘koRpus’ commands, like ‘readability’ and its > > progeny. > > > > Thank you, > > Gordon > > Hi Gordon, > Looks to me as though you may have to extract the text from the Word > files. Export As Text.
Hi! quanteda::tokenizer says it needs a character vector or «corpus» as input https://www.rdocumentation.org/packages/quanteda/versions/0.99.12/topics/tokenize ... or is this tokenize from the tokenizers package, I found something about «doc_id» here: https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html You can convert docx to markdown using pandoc: pandoc --from docx --to markdown $inputfile odt also works, and many others. I believe pandoc is included in RStudio. But I have never used it from there myself, so that is really bad advice I think. To read doc, I use wvHtml: wvHtml $inputfile - 2> /dev/null | w3m -dump -T text/html Rasmus
signature.asc
Description: PGP signature
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.