Hi Roy (& others)
Many thanks for the advice - well taken. Thanks also to the others who
have responded so quickly - I thought I might have to wait days!! :-)
I'm on a Linux (Mint) machine. Below, I document three attempts, two
using officer and the last now using textreadr
My attempts so far using 'officer':
##################
(1) First Attempt:
# Load libraries
library(tcltk)
library(tidyverse)
library(officer)
setwd(tk_choose.dir())
doc_path <- list.files(getwd(), pattern = ".docx", full.names = TRUE)
files <- list.files(getwd(), ".docx")
files
length(files)
## This works to here - obtain a list of docx files in directory 'TEST
with 9 files'. However, the next line
doc_in <- read_docx(files)
Results in this error:Error in filetype %in% c("docx") &&
grepl("^([fh]ttp)", file) :'length = 9' in coercion to 'logical(1)'
No idea how to debug that.
Even when trying Calum's suggestion with officer:
content <- officer::docx_summary("Now they want us to charge our
electric cars from litter bins.docx") # A title of one of the articles
The error returned is:Error in x$doc_obj : $ operator is invalid for
atomic vectors
##################
(2) Second Attempt:
# Load libraries
library(tcltk)
library(tidyverse)
library(officer)
setwd(tk_choose.dir())
doc_path <- list.files(getwd(), pattern = ".docx", full.names = TRUE)
files <- list.files(getwd(), ".docx")
files
length(files)
docx_summary(doc_path, preserve = FALSE)
## At this point, the error is:Error in x$doc_obj : $ operator is
invalid for atomic vectors
So, not sure how I am passing an atomic vector or if there is something
I am supposed to set to make this something else?
##################
(3) Third attempt - now trying with textreadr (Thanks for the help on
installing this, Calum):
# Load libraries
library(tcltk)
library(tidyverse)
library(textreadr)
folder <- setwd(tk_choose.dir())
files <- list.files(folder, ".docx")
files
length(files)
doc <- read_docx("Now they want us to charge our electric cars from
litter bins.docx") # One of the 9 files in the folder
read_docx(doc, skip = 0, remove.empty = TRUE, trim = TRUE) # To test
against one file
## The last line returns the following error:Error in filetype %in%
c("docx") && grepl("^([fh]ttp)", file) :'length = 38' in coercion to
'logical(1)'
##################
And so I am going around in circles and not at all clear on how I can
make progress.
I am sure that there must be a way, but the suggestions on-line each
lead to the above errors.
Thanks for any further help.
Best wishes, and thanks
Andy
On 29/12/2023 18:25, Roy Mendelssohn - NOAA Federal wrote:
> Hi Andy:
>
> I don’t have an answer but I do have what I hope is some friendly advice.
> Generally the more information you can provide, the more likely you will get
> help that is useful. In your case you say that you tried several packages
> and they didn’t do what you wanted. Providing that code, as well as why
> they didn’t do what you wanted (be specific) would greatly facilitate things.
>
> Happy new year,
>
> -Roy
>
>
>> On Dec 29, 2023, at 10:14 AM, Andy<[email protected]> wrote:
>>
>> Hello
>>
>> I am trying to work through a problem, but feel like I've gone down a rabbit
>> hole. I'd very much appreciate any help.
>>
>> The task: I have several directories of multiple (some directories, up to
>> 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that I want
>> to iterate through to append to a spreadsheet only those articles that
>> satisfy a condition (i.e., a specific keyword is present for >= 50% coverage
>> of the subject matter). Lexis+ has a very specific structure and keywords
>> are given in the row "Subject".
>>
>> I'd like to be able to accomplish the following:
>>
>> (1) Append the title, the month, the author, the number of words, and page
>> number(s) to a spreadsheet
>>
>> (2) Read each article and extract keywords (in the docs, these are listed in
>> 'Subject' section as a list of keywords with a percentage showing the extent
>> to which the keyword features in the article (e.g., FAST FASHION (72%)) and
>> to append the keyword and the % coverage to the same row in the spreadsheet.
>> However, I want to ensure that the keyword coverage meets the threshold of
>> >= 50%; if not, then pass onto the next article in the directory. Rinse and
>> repeat for the entire directory.
>>
>> So far, I've tried working through some Stack Overflow-based solutions, but
>> most seem to use the textreadr package, which is now deprecated; others use
>> either the officer or the officedown packages. However, these packages don't
>> appear to do what I want the program to do, at least not in any of the
>> examples I have found, nor in the vignettes and relevant package manuals
>> I've looked at.
>>
>> The first point is, is what I am intending to do even possible using R? If
>> it is, then where do I start with this? If these docx files were converted
>> to UTF-8 plain text, would that make the task easier?
>>
>> I am not a confident coder, and am really only just getting my head around R
>> so appreciate a steep learning curve ahead, but of course, I don't know what
>> I don't know, so any pointers in the right direction would be a big help.
>>
>> Many thanks in anticipation
>>
>> Andy
>>
>> ______________________________________________
>> [email protected] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.