Treder, Robert wrote: > I'm very new to python and am trying to figure out how to make a corpus > from a text file. I have a csv file (actually pipe '|' delimited) where > each row corresponds to a different text document. Each row contains a > communication note. Other columns correspond to categories of types of > communications. I am able to read the csv file and print the notes column > as follows: > > import csv > with open('notes.txt', 'rb') as infile: > reader = csv.reader(infile, delimiter = '|') > i = 0 > for row in reader: > if i <= 25: print row[8] > i = i+1 > > I would like to convert this to a categorized corpus with some of the > other columns corresponding to the categories. All of the columns are text > (i.e., strings). I have looked for documentation on how to use csv.reader > with PlaintextCorpusReader but have been unsuccessful in finding a > example similar to what I want to do. Can someone please help?
This mailing list is for learning Python. For problems with a specific library you should use the general python list <http://mail.python.org/mailman/listinfo/python-list> or a forum dedicated to that library <http://groups.google.com/group/nltk-users> If you ask on a general forum you should give some context -- the name of the library would be the bare minimum. The following comes with no warranties as I'm not an nltk user: import csv from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader from itertools import islice, chain LIMIT_SIZE = 25 # set to None if not debugging def pairs(filename): """Generate (filename, list_of_categories) pairs from a csv file """ with open(filename, "rb") as infile: rows = islice(csv.reader(infile, delimiter="|"), LIMIT_SIZE) for row in rows: # assume that columns 10 and above contain categories yield row[8], row[9:] if __name__ == "__main__": import random FILENAME = "notes.txt" # assume that every filename occurs only once in the file file_to_categories = dict(pairs(FILENAME)) files = list(file_to_categories) all_categories = set(chain.from_iterable(file_to_categories.itervalues())) reader = CategorizedPlaintextCorpusReader(".", files, cat_map=file_to_categories) # print words for a random category category = random.choice(list(all_categories)) print "words for category {}:".format(category) print sorted(set(reader.words(categories=category))) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor