I'm trying to use the tm package to extract text from a corpus of documents.
I'm able to read in a set of PDF's and get a corpus that is filtered to include
all the documents that contain a term, for example, "hot water". I'm also able
to get a list of the documents using the names() function but I just cannot get
a handle on getting the lines out of the corpus.
I would like to get a corpus that had just the filtered content out, ie the
lines containing the term.
I can manage to do this using something like :
library(tm)
library(tidyverse)
library(tidytext)
library(stringr)
cname <- file.path(".","pdfs")
docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))
docs <- tm_map(docs, content_transformer(tolower))
search.par <- c("18")
docs_filtered <- docs %>%
tm_filter(FUN=function(x) any(grep(search.par, content(x))))
content(docs_filtered[[1]])[grep(search.par,content(docs_filtered[[1]]))]
This gives me the lines that contain the term "18" in corpus document 1. Is
there any way to do this for all the corpus documents?
What I would like is something that has the lines containing the search
parameter in the corpus document to allow printing, at least to screen.
Thank you!
Shawn Way
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.