Hello: I have a series of newspaper articles from a Canadian newspaper database (Canadian Newsstand) that look just like below.
I've read through this vignette (http://cran.r-project.org/web/packages/tm/vignettes/extensions.pdf) about creating a custom reader to extract meta-data, but I can't understand how to apply this in the context of a text document, rather than in the tabular format as in the vignette. You can see there's all kinds of valuable information in each document -Author, page number, publication year, section, publication title.... Can anyone provide some suggestions to someone unfamiliar with the tm package as to how to go about creating a custom reader for this situation? Yours truly, Simon Kiss ____________________________________________________________ Document 1 of 40 First Nation agrees not to block trains Author: SHAWN BERRY Legislature Bureau Publication info: Daily Gleaner [Fredericton, N.B] 07 Jan 2013: A.3. http://remote.libproxy.wlu.ca/login?url=http://search.proquest.com/docview/1266701269?accountid=15090 Abstract: Participants are also concerned about Chief Theresa Spence who stopped eating solid food on Dec. 11 in a bid to secure a meeting between First Nations leaders, Prime Minister Stephen Harper and Gov. Gen. David Johnston to discuss the treaty relationship. Links: null Full Text: A bunch of text about a story here Subject: Railroads; Native North Americans; Meetings; Injunctions Title: First Nation agrees not to block trains Publication title: Daily Gleaner First page: A.3 Publication year: 2013 Publication date: Jan 7, 2013 Year: 2013 Section: Main Publisher: Infomart, a division of Postmedia Network Inc. Place of publication: Fredericton, N.B. Country of publication: Canada Journal subject: GENERAL INTEREST PERIODICALS--UNITED STATES ISSN: 08216983 Source type: Newspapers Language of publication: English Document type: News ProQuest document ID: 1266701269 Document URL: http://remote.libproxy.wlu.ca/login?url=http://search.proquest.com/docview/1266701269?accountid=15090 Copyright: (Copyright (c) 2013 The Daily Gleaner (Fredericton)) Last updated: 2013-01-07 Database: Canadian Newsstand Complete ********************************* Simon J. Kiss, PhD Assistant Professor, Wilfrid Laurier University 73 George Street Brantford, Ontario, Canada N3T 2C9 Cell: +1 905 746 7606 Please avoid sending me Word, PowerPoint or Excel attachments. Sending these documents puts pressure on many people to use Microsoft software and helps to deny them any other choice. In effect, you become a buttress of the Microsoft monopoly. To convert to plain text choose Text Only or Text Document as the Save As Type. Your computer may also have a program to convert to PDF format. Select File, then Print. Scroll through available printers and select the PDF converter. Click on the Print button and enter a name for the PDF file when requested. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.