Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

Andy Fri, 29 Dec 2023 12:18:13 -0800

Hi Roy (& others)

Many thanks for the advice - well taken. Thanks also to the others who 
have responded so quickly - I thought I might have to wait days!! :-)


I'm on a Linux (Mint) machine. Below, I document three attempts, two 
using officer and the last now using textreadr

My attempts so far using 'officer':

##################

(1) First Attempt:

# Load libraries
library(tcltk)
library(tidyverse)
library(officer)

setwd(tk_choose.dir())

doc_path <- list.files(getwd(), pattern = ".docx", full.names = TRUE)

files <- list.files(getwd(), ".docx")
files
length(files)

## This works to here - obtain a list of docx files in directory 'TEST 
with 9 files'. However, the next line
doc_in <- read_docx(files)

Results in this error:Error in filetype %in% c("docx") && 
grepl("^([fh]ttp)", file) :'length = 9' in coercion to 'logical(1)'

No idea how to debug that.

Even when trying Calum's suggestion with officer:

content <- officer::docx_summary("Now they want us to charge our 
electric cars from litter bins.docx") # A title of one of the articles

The error returned is:Error in x$doc_obj : $ operator is invalid for 
atomic vectors


##################
(2) Second Attempt:

# Load libraries
library(tcltk)
library(tidyverse)
library(officer)

setwd(tk_choose.dir())

doc_path <- list.files(getwd(), pattern = ".docx", full.names = TRUE)

files <- list.files(getwd(), ".docx")
files
length(files)

docx_summary(doc_path, preserve = FALSE)
## At this point, the error is:Error in x$doc_obj : $ operator is 
invalid for atomic vectors

So, not sure how I am passing an atomic vector or if there is something 
I am supposed to set to make this something else?

##################
(3) Third attempt - now trying with textreadr (Thanks for the help on 
installing this, Calum):

# Load libraries
library(tcltk)
library(tidyverse)
library(textreadr)

folder <- setwd(tk_choose.dir())

files <- list.files(folder, ".docx")
files
length(files)

doc <- read_docx("Now they want us to charge our electric cars from 
litter bins.docx") # One of the 9 files in the folder

read_docx(doc, skip = 0, remove.empty = TRUE, trim = TRUE) # To test 
against one file

## The last line returns the following error:Error in filetype %in% 
c("docx") && grepl("^([fh]ttp)", file) :'length = 38' in coercion to 
'logical(1)'

##################
And so I am going around in circles and not at all clear on how I can 
make progress.

I am sure that there must be a way, but the suggestions on-line each 
lead to the above errors.

Thanks for any further help.

Best wishes, and thanks
Andy


On 29/12/2023 18:25, Roy Mendelssohn - NOAA Federal wrote:
> Hi Andy:
>
> I don’t have an answer but I do have what I hope is some friendly advice.  
> Generally the more information you can provide,  the more likely you will get 
> help that is useful.  In your case you say that you tried several packages 
> and they didn’t do what you wanted.  Providing that code,  as well as why 
> they didn’t do what you wanted (be specific)  would greatly facilitate things.
>
> Happy new year,
>
> -Roy
>
>
>> On Dec 29, 2023, at 10:14 AM, Andy<[email protected]>  wrote:
>>
>> Hello
>>
>> I am trying to work through a problem, but feel like I've gone down a rabbit 
>> hole. I'd very much appreciate any help.
>>
>> The task: I have several directories of multiple (some directories, up to 
>> 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that I want 
>> to iterate through to append to a spreadsheet only those articles that 
>> satisfy a condition (i.e., a specific keyword is present for >= 50% coverage 
>> of the subject matter). Lexis+ has a very specific structure and keywords 
>> are given in the row "Subject".
>>
>> I'd like to be able to accomplish the following:
>>
>> (1) Append the title, the month, the author, the number of words, and page 
>> number(s) to a spreadsheet
>>
>> (2) Read each article and extract keywords (in the docs, these are listed in 
>> 'Subject' section as a list of keywords with a percentage showing the extent 
>> to which the keyword features in the article (e.g., FAST FASHION (72%)) and 
>> to append the keyword and the % coverage to the same row in the spreadsheet. 
>> However, I want to ensure that the keyword coverage meets the threshold of 
>> >= 50%; if not, then pass onto the next article in the directory. Rinse and 
>> repeat for the entire directory.
>>
>> So far, I've tried working through some Stack Overflow-based solutions, but 
>> most seem to use the textreadr package, which is now deprecated; others use 
>> either the officer or the officedown packages. However, these packages don't 
>> appear to do what I want the program to do, at least not in any of the 
>> examples I have found, nor in the vignettes and relevant package manuals 
>> I've looked at.
>>
>> The first point is, is what I am intending to do even possible using R? If 
>> it is, then where do I start with this? If these docx files were converted 
>> to UTF-8 plain text, would that make the task easier?
>>
>> I am not a confident coder, and am really only just getting my head around R 
>> so appreciate a steep learning curve ahead, but of course, I don't know what 
>> I don't know, so any pointers in the right direction would be a big help.
>>
>> Many thanks in anticipation
>>
>> Andy
>>
>> ______________________________________________
>> [email protected]  mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

Reply via email to