[R] tm: Why does adding local metadata take so long?

Richard R. Liu Tue, 13 Oct 2009 15:46:30 -0700

I'm running tm 0.5 on R 2.9.2 on a MacBook Pro 17" unibody early 20092.93 GHz 4GB RAM. I have a directory with 1697 plain text files onthe Mac, that I want to analyze with the tm package. I have read thedocuments into a corpus, Corpus_3compounds, as follows:


# Assign directory to a character vector
dirName <- "/Volumes/RDR Test Documents/3Compounds/TXT"


# Put the paths of the .txt files in the directory into a vector
Files_3compounds <- dir(dirName,
        full.names = TRUE,
        pattern = "_.*\\.txt",
        ignore.case = TRUE)

# Use that vector to create a DirSource object
Dir_3compounds <- DirSource(dirName,
        pattern = "_.*\\.txt",
        ignore.case = TRUE,
        encoding = "latin1")

# Read the .txt files into a volatile corpus
Corpus_3compounds <- Corpus(Dir_3compounds,
        readerControl = list(reader = readPlain,
                language = "en",
                load = TRUE))

I have the metadata for these text documents in an Excel table, whichI have read into Metadata_3compounds as follows:


# Read the metadata into a data frame

Metadata_3compounds <- read.xls("/Volumes/RDR Test Documents/3Compounds/3compounds.xls",

        sheet = 3, verbose = TRUE, pattern = "Document",
        method = "tab", perl ="perl")

Since the metadata and the text documents in the corpus are not in thesame order, I have to create an index between the two. Basically, thefilename contains the document ID.

# Index of the metadata for a document in the corpus inMetadata_3compoundsiMyMetadata <- match(gsub("^(.*)/_(.*)\\.txt$", "\\2",Files_3compounds, perl = TRUE), Metadata_3compounds$Document.No)


The metadata dataframe has the following names:

 [1] "Document.No"               ...
 [5] ...

[9] "total" "SET""CAT1" "CAT2"[13] "Title" "Approved.By""Author.s." "Center"[17] "Comment" "Date.Approved""Date.Submitted" "Department"[21] "Division" "Document.Class""Document.Date" "Document.No.1"[25] "Language" "Pages""Project.ID..Theme.Number." "Rapid.Document"[29] "Report.No" "Study.Protocol.No""Submitted.By" "Substance.ID"

Now I want to assign this metadata to the local metadata of thedocuments in the corpus, for example as follows:


# Transfer metadata to local

meta(Corpus_3compounds, type = "local", tag = "DocId") <-Metadata_3compounds$Document.No[iMyMetadata]

I have let this statement run for more than twenty minutes beforedeciding to stop it I just cannot imagine that it should takeanywhere near as long. If I assign the same vector to the indexedmetadata of the corpus instead, it finishes in just a bit more than ablink of an eye. When I limit the number of documents to five I canverify that the code is correct.

QUESTIONS: Is it normal for this operation to take so long on a corpusof 1697 documents? Is there a quicker way of accomplishing the samething? I really do want to store the metadata with the document,i.e., as local metadata. I am uncertain about the advantages, but Iwould think that, if I delete or filter out a document, the metadatais deleted or filtered as well. Furthermore, when I cluster thedocuments or train a machine learner on them, I could imagine -- but Ido not know for sure -- that it might be easier to use local metadataas a feature, whereas that might not be so easy with indexed metadata.


Regards,
Richard Liu

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] tm: Why does adding local metadata take so long?

Reply via email to