Hello I tried to use the CSVSource in the TextDocCol function in the tm package. But a) data from several columns is concatenated in one entry and b) data in a large text column is broken into several entries I hoped that it would be possible to assign columns as metadata to one entry with one specific column being the original text to analyze.
Here is an example from the vignette (the backslash in the output is not in the original data): > cars <- system.file("texts", "cars.csv", package = "tm"); > tdc <- TextDocCol(CSVSource(cars)) Read 5 items > inspect(tdc) A text document collection with 5 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID [[1]] [1] "1997,\"Ford\",\"Mustang\",\"3000.00\"" [[2]] [1] "1999,\"Chevy\",\"Venture\",4900.00" [[3]] [1] "1996,\"Chrylser\",\"Cherokee\",\"4799.00\"" [[4]] [1] "2005,\"Ferrari\",\"Modena\",\"80999.00\"" [[5]] [1] "1973,\"Tank\",\"\",\"9900.00\"" Also I have a question about the best workflow for text mining/analysis: My original data is in a mySQL table. Is it possible to import the data directly into TextDocCol without creating an intermediate csv file? I am using > R.Version() $platform [1] "powerpc-apple-darwin8.10.1" $arch [1] "powerpc" $os [1] "darwin8.10.1" $system [1] "powerpc, darwin8.10.1" $status [1] "" $major [1] "2" $minor [1] "6.1" $year [1] "2007" $month [1] "11" $day [1] "26" $`svn rev` [1] "43537" $language [1] "R" $version.string [1] "R version 2.6.1 (2007-11-26)" -- Armin Goralczyk, M.D. -- Universitätsmedizin Göttingen Abteilung Allgemein- und Viszeralchirurgie Rudolf-Koch-Str. 40 39099 Göttingen -- Dept. of General Surgery University of Göttingen Göttingen, Germany -- http://www.gwdg.de/~agoralc ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.