I'm running tm 0.5 on R 2.9.2 on a MacBook Pro 17" unibody early 2009 2.93 GHz 4GB RAM. I have a directory with 1697 plain text files on the Mac, that I want to analyze with the tm package. I have read the documents into a corpus, Corpus_3compounds, as follows:

# Assign directory to a character vector
dirName <- "/Volumes/RDR Test Documents/3Compounds/TXT"

# Put the paths of the .txt files in the directory into a vector
Files_3compounds <- dir(dirName,
        full.names = TRUE,
        pattern = "_.*\\.txt",
        ignore.case = TRUE)

# Use that vector to create a DirSource object
Dir_3compounds <- DirSource(dirName,
        pattern = "_.*\\.txt",
        ignore.case = TRUE,
        encoding = "latin1")

# Read the .txt files into a volatile corpus
Corpus_3compounds <- Corpus(Dir_3compounds,
        readerControl = list(reader = readPlain,
                language = "en",
                load = TRUE))

I have the metadata for these text documents in an Excel table, which I have read into Metadata_3compounds as follows:

# Read the metadata into a data frame
Metadata_3compounds <- read.xls("/Volumes/RDR Test Documents/ 3Compounds/3compounds.xls",
        sheet = 3, verbose = TRUE, pattern = "Document",
        method = "tab", perl ="perl")

Since the metadata and the text documents in the corpus are not in the same order, I have to create an index between the two. Basically, the filename contains the document ID.

# Index of the metadata for a document in the corpus in Metadata_3compounds iMyMetadata <- match(gsub("^(.*)/_(.*)\\.txt$", "\\2", Files_3compounds, perl = TRUE), Metadata_3compounds$Document.No)

The metadata dataframe has the following names:

 [1] "Document.No"               ...
 [5] ...
[9] "total" "SET" "CAT1" "CAT2" [13] "Title" "Approved.By" "Author.s." "Center" [17] "Comment" "Date.Approved" "Date.Submitted" "Department" [21] "Division" "Document.Class" "Document.Date" "Document.No.1" [25] "Language" "Pages" "Project.ID..Theme.Number." "Rapid.Document" [29] "Report.No" "Study.Protocol.No" "Submitted.By" "Substance.ID"

Now I want to assign this metadata to the local metadata of the documents in the corpus, for example as follows:

# Transfer metadata to local
meta(Corpus_3compounds, type = "local", tag = "DocId") <- Metadata_3compounds$Document.No[iMyMetadata]

I have let this statement run for more than twenty minutes before deciding to stop it I just cannot imagine that it should take anywhere near as long. If I assign the same vector to the indexed metadata of the corpus instead, it finishes in just a bit more than a blink of an eye. When I limit the number of documents to five I can verify that the code is correct.

QUESTIONS: Is it normal for this operation to take so long on a corpus of 1697 documents? Is there a quicker way of accomplishing the same thing? I really do want to store the metadata with the document, i.e., as local metadata. I am uncertain about the advantages, but I would think that, if I delete or filter out a document, the metadata is deleted or filtered as well. Furthermore, when I cluster the documents or train a machine learner on them, I could imagine -- but I do not know for sure -- that it might be easier to use local metadata as a feature, whereas that might not be so easy with indexed metadata.

Regards,
Richard Liu


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to