[Tutor] Adding to a CSV file?
Hi, I'm learning Python so I can take advantage of the really cool stuff in the Natural Language Toolkit. But I'm having problems with some basic file manipulation stuff. My basic question: How do I read data in from a csv, manipulate it, and then add it back to the csv in new columns (keeping the manipulated data in the "right row")? Here's an example of what my data looks like ("test-8-29-10.csv"): MyWord Category Ct CatCt ! A 2932 456454 ! B 2109 64451 a C 7856 9 a A 19911 456454 abnormal C 174 9 abnormally D 5 7 cats E 1999 886454 cat B 160 64451 # I want to read in the MyWord for each row and then do some stuff to it and add in some new columns. Specifically, I want to "lemmatize" and "stem", which basically means I'll turn "abnormally" into "abnormal" and "cats" into "cat". import nltk wnl=nltk.WordNetLemmatizer() porter=nltk.PorterStemmer() text=nltk.word_tokenize(TheStuffInMyWordColumn) textlemmatized=[wnl.lemmatize(t) for t in text] textPort=[porter.stem(t) for t in text] # This creates the right info, but I don't really want "textlemmatized" and "textPort" to be independent lists, I want them inside the csv in new columns. # If I didn't want to keep the information in the Category and Counts columns, I would probably do something like this: for word in text: word2=wnl.lemmatize(word) word3=porter.stem(word) print word+";"+word2+";"+word3+"\r\n") # Looking through some of the older discussions about the csv module, I found this code helps identify headers, but I'm still not sure how to use them--or how to word the for-loop that I need correctly so I iterate through each row in the csv file. f_out.close() fp=open(r'c:test-8-29-10.csv', 'r') inputfile=csv.DictReader(fp) for record in inputfile: print record {'Category': 'A', 'CatCt': '456454', 'MyWord': '!', 'Ct': '2932'} {'Category': 'B', 'CatCt': '64451', 'MyWord': '!', 'Ct': '2109'} ... fp.close() # So I feel like I have *some* of the pieces, but I'm just missing a bunch of little connections. Any and all help would be much appreciated! Tyler ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Adding to a CSV file?
I checked out the csv module and got a little further along, but still can't quite figure out how to iterate line by line properly. # This shows that I'm reading the file in correctly: input_file=open("test-8-29-10.csv","rb") for row in input_file: print row MyWord,Category,Ct,CatCt !,A,2932,456454 !,B,2109,64451 a,C,7856,9 abandoned,A,11,456454 # But when I try to add columns, I'm only filling in some static value. So there's something wrong with my looping. testReader=csv.reader(open('test-8-29-10.csv', 'rb')) for line in testReader: for MyWord, Category, Ct, CatCt in testReader: text=nltk.word_tokenize(MyWord) word2=wnl.lemmatize(word) word3=porter.stem(word) print MyWord+","+Category+","+Ct+","+CatCt+","+word+","+word2+","+word3+"\r\n" !,A,2932,456454,yrs,yr,yr !,B,2109,64451,yrs,yr,yr a,C,7856,9,yrs,yr,yr abandoned,A,11,456454,yrs,yr,yr ... # I tried adding another loop, but it gives me an error. testReader=csv.reader(open('test-8-29-10.csv', 'rb')) for line in testReader: for MyWord, Category, Ct, CatCt in line: # I thought this line inside the other was clever, but, uh, not so much text=nltk.word_tokenize(MyWord) word2=wnl.lemmatize(word) word3=porter.stem(word) print MyWord+","+Category+","+Ct+","+CatCt+","+word+","+word2+","+word3+"\r\n" Traceback (most recent call last): File "", line 2, in for MyWord, Category, Ct, CatCt in line: ValueError: too many values to unpack My hope is that once I can figure out this problem, it'll be easy to write the csv file with the csv module. But I'm stumped about the looping. Thanks for any suggestions, Tyler ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] If/elif/else when a list is empty
Hi, I'm parsing IMDB movie reviews (each movie is in its own text file). In my script, I'm trying to extract genre information. Movies have up to three categories of genres--but not all have a "genre" tag and that fact is making my script abort whenever it encounters a movie text file that doesn't have a "genre" tag. I thought the following should solve it, but it doesn't. The basic question is how I say "if genre information doesn't at all, just make rg1=rg2=rg3="NA"? rgenre = re.split(r';', rf.info["genre"]) # When movies have genre information they store it as Drama;Western;Thriller if len(rgenre)>0: if len(rgenre)>2: rg1=rgenre[0] rg2=rgenre[1] rg3=rgenre[2] elif len(rgenre)==2: rg1=rgenre[0] rg2=rgenre[1] rg3="NA" elif len(rgenre)==1: rg1=rgenre[0] rg2="NA" rg3="NA" else len(rgenre)<1: # I was hoping this would take care of the "there is no genre information" scenario but it doesn't rg1=rg2=rg3="NA" This probably does a weird nesting thing, but even simpler version I have tried don't work. Thanks very much for any help! Tyler ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] If/elif/else when a list is empty
Hi Vince, Thanks very much for the one-line version--unfortunately, I still get errors. The overall script runs over every text file in a directory, but as soon as it hits a text file without a tag, it gives this error: Traceback (most recent call last): File "C:\Users\tylersc\Desktop\Tyler2\Tyler\words_per_review_IMDB_9-13-10.py", line 168, in main(".","output.csv") File "C:\Users\tylersc\Desktop\Tyler2\Tyler\words_per_review_IMDB_9-13-10.py", line 166, in main os.path.walk(top_level_dir, reviewDirectory, writer ) File "C:\Python26\lib\ntpath.py", line 259, in walk func(arg, top, names) File "C:\Users\tylersc\Desktop\Tyler2\Tyler\words_per_review_IMDB_9-13-10.py", line 162, in reviewDirectory reviewFile( dirname+'/'+fileName, args ) File "C:\Users\tylersc\Desktop\Tyler2\Tyler\words_per_review_IMDB_9-13-10.py", line 74, in reviewFile rgenre = re.split(r';', rf.info["genre"]) KeyError: 'genre' I'm about to give what may be too much information--I really thought there must be a way to say "don't choke if you don't find any rgenres because rf.info["genre"] was empty". But maybe I need to define the "None" condition earlier? Basically a text file has this structure: High Noon Drama;Western # But this tag doesn't exist for all text files # etc u493498 9 out of 10 A great flick blah blah blah # etc # next review--all about the movie featured in the info tags -Original Message- From: Vince Spicer To: aenea...@priest.com Cc: tutor@python.org Sent: Mon, Sep 13, 2010 9:08 pm Subject: Re: [Tutor] If/elif/else when a list is empty On Mon, Sep 13, 2010 at 9:58 PM, wrote: Hi, I'm parsing IMDB movie reviews (each movie is in its own text file). In my script, I'm trying to extract genre information. Movies have up to three categories of genres--but not all have a "genre" tag and that fact is making my script abort whenever it encounters a movie text file that doesn't have a "genre" tag. I thought the following should solve it, but it doesn't. The basic question is how I say "if genre information doesn't at all, just make rg1=rg2=rg3="NA"? rgenre = re.split(r';', rf.info["genre"]) # When movies have genre information they store it as Drama;Western;Thriller if len(rgenre)>0: if len(rgenre)>2: rg1=rgenre[0] rg2=rgenre[1] rg3=rgenre[2] elif len(rgenre)==2: rg1=rgenre[0] rg2=rgenre[1] rg3="NA" elif len(rgenre)==1: rg1=rgenre[0] rg2="NA" rg3="NA" else len(rgenre)<1: # I was hoping this would take care of the "there is no genre information" scenario but it doesn't rg1=rg2=rg3="NA" This probably does a weird nesting thing, but even simpler version I have tried don't work. Thanks very much for any help! Tyler ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor Hey Tyler you can simplify this with a onliner. rg1, rg2, rg3 = rgenre + ["NA"]*(3-len(rgenre[:3])) Hope that helps, if you have any questions feel free to ask. Vince ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] Can't process all my files (need to close?)
My Python script needs to process 45,000 files, but it seems to blow up after about 10,000. Note that I'm outputting bazillions of rows to a csv, so that may be part of the issue. Here's the error I get (I'm running it through IDLE on Windows 7): Microsoft Visual C++ Runtime Library Runtime Error! Program: C:\Python26\pythonw.exe This application has requested the Runtime to terminate it in an usual way. I think this might be because I don't specifically close the files I'm reading. Except that I'm not quite sure where to put the close. I have 3 places where I would think it might work but I'm not sure which one works or how exactly to do the closing (what it is I append ".close()" to). 1) During the self.string here: class ReviewFile: # In our movie corpus, each movie is one text file. That means that each text file has some "info" about the movie (genre, director, name, etc), followed by a bunch of reviews. This class extracts the relevant information about the movie, which is then attached to review-specific information. def __init__(self, filename): self.filename = filename self.string = codecs.open(filename, "r", "utf8").read() self.info = self.get_fields(self.get_field(self.string, "info")[0]) review_strings = self.get_field(self.string, "review") review_dicts = map(self.get_fields, review_strings) self.reviews = map(Review, review_dicts) 2) Maybe here? def reviewFile ( file, args): for file in glob.iglob("*.txt"): print " Reviewing" + file rf = ReviewFile(file) 3) Or maybe here? def reviewDirectory ( args, dirname, filenames ): print 'Directory',dirname for fileName in filenames: reviewFile( dirname+'/'+fileName, args ) def main(top_level_dir,csv_out_file_name): csv_out_file = open(str(csv_out_file_name), "wb") writer = csv.writer(csv_out_file, delimiter=',') os.path.walk(top_level_dir, reviewDirectory, writer ) main(".","output.csv") Thanks very much for any help! Tyler ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] Getting total counts
Hi, I have created a csv file that lists how often each word in the Internet Movie Database occurs with different star-ratings and in different genres. The input file looks something like this--since movies can have multiple genres, there are three genre rows. (This is fake, simplified data.) ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count film1DramaThrillerWestern1the20 film2ComedyMusicalNA2the20 film3MusicalHistoryBiography1the 20 film4DramaThrillerWestern1the10 film5DramaThrillerWestern9the20 I can get the program to tell me how many occurrence of "the" there are in Thrillers (50), how many "the"'s in 1-stars (50), and how many 1-star drama "the"'s there are (30). But I need to be able to expand beyond a particular word and say "how many words total are in "Drama"? How many total words are in 1-star ratings? How many words are there in the whole corpus? On these all-word totals, I'm stumped. What I've done so far: I used shelve() to store my input csv in a database format. Here's how I get count information so far: def get_word_count(word, db, genre=None, rating=None): c = 0 vals = db[word] for val in vals: if not genre and not rating: c += val['count'] elif genre and not rating: if genre in val['genres']: c += val['count'] elif rating and not genre: if rating == val['rating']: c += val['count'] else: if rating == val['rating'] and genre in val['genres']: c += val['count'] return c (I think there's something a little wrong with the rating stuff, here, but this code generally works and produces the right counts.) With "get_word_count" I can do stuff like this to figure out how many times "the" appears in a particular genre. vals=db[word] for val in vals: genre_ct_for_word = get_word_count(word, db, genre, rating=None) return genre_ct_for_word I've tried to extend this thinking to get TOTAL genre/rating counts for all words, but it doesn't work. I get a type error saying that string indices must be integers. I'm not sure how to overcome this. # Doesn't work: def get_full_rating_count(db, rating=None): full_rating_ct = 0 vals = db for val in vals: if not rating: full_rating_ct += val['count'] elif rating == val['rating']: if rating == val['rating']: # Um, I know this looks dumb, but in the other code it seems to be necessary for things to work. full_rating_ct += val['count'] return full_rating_ct Can anyone suggest how to do this? Thanks! Tyler Background for the curious: What I really want to know is which words are over- or under-represented in different Genre x Rating categories. "The" should be flat, but something like "wow" should be over-represented in 1-star and 10-star ratings and under-represented in 5-star ratings. Something like "gross" may be over-represented in low-star ratings for romances but if grossness is a good thing in horror movies, then we'll see "gross" over-represented in HIGH-star ratings for horror. To figure out over-representation and under-representation I need to compare "observed" counts to "expected" counts. The expected counts are probabilities and they require me to understand how many words I have in the whole corpus and how many words in each rating category and how many words in each genre category. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Getting total counts (Steven D'Aprano)
Thanks very much for the extensive comments, Steve. I can get the code you wrote to work on my toy data, but my real input data is actually contained in 10 files that are about 1.5 GB each--when I try to run the code on one of those files, everything freezes. To solve this, I tried just having the data write to a different csv file: lines = csv.reader(file(src_filename)) csv_writer = csv.writer(file(output_filename, 'w')) for line in lines: doc, g1, g2, g3, rating, ratingmax, reviewer, helpful, h_total, word, count = line row = [add_word(g1, word, count), add_word(g2, word, count), add_word(g3, word, count)] csv_writer.writerow(row) This doesn't work--I think there are problems in how the iterations happen. But my guess is that converting from one CSV to another isn't going to be as efficient as creating a shelve database. I have some code that works to create a db when I release it on a small subset of my data, but when I try to turn one of the 1.5 GB files into a db, it can't do it. I don't understand why it works for small data and not big (it makes sense to me that your table approach might choke on big amounts of data--but why the shelve() code below?) I think these are the big things I'm trying to get the code to do: - Get my giant CSV files into a useful format, probably a db (can do for small amounts of data, but not large) - Extract genre and star-rating information about particular words from the db (I seem to be able to do this) - Get total counts for all words in each genre, and for all words in each star-rating category (your table approach works on small data, but I can't get it to scale) def csv2shelve(src_filename, shelve_filename): # I open the shelve file for writing. if os.path.exists(shelve_filename): os.remove(shelve_filename) # I create the shelve db. db = shelve.open(shelve_filename, writeback=True) # The writeback stuff is a little confusing in the help pages, maybe this is a problem? # I open the src file. lines = csv.reader(file(src_filename)) for line in lines: doc, g1, g2, g3, rating, word, count = line if word not in db: db[word] = [] try: rating = int(rating) except: pass db[word].append({ "genres":{g1:True, g2:True, g3:True}, "rating":rating, "count":int(count) }) db.close() Thanks again, Steve. (And everyone/anyone else.) Tyler -Original Message- From: tutor-requ...@python.org To: tutor@python.org Sent: Sat, Oct 2, 2010 1:36 am Subject: Tutor Digest, Vol 80, Issue 10 Send Tutor mailing list submissions to tutor@python.org To subscribe or unsubscribe via the World Wide Web, visit http://mail.python.org/mailman/listinfo/tutor r, via email, send a message with subject or body 'help' to tutor-requ...@python.org You can reach the person managing the list at tutor-ow...@python.org When replying, please edit your Subject line so it is more specific han "Re: Contents of Tutor digest..." oday's Topics: 1. Re: (de)serialization questions (Lee Harr) 2. Re: regexp: a bit lost (Steven D'Aprano) 3. Re: regexp: a bit lost (Alex Hall) 4. Re: (de)serialization questions (Alan Gauld) 5. Re: Getting total counts (Steven D'Aprano) 6. data question (Roelof Wobben) - Message: 1 ate: Sat, 2 Oct 2010 03:26:21 +0430 rom: Lee Harr o: ubject: Re: [Tutor] (de)serialization questions essage-ID: ontent-Type: text/plain; charset="windows-1256" >> I have data about zip codes, street and city names (and perhaps later also f >> street numbers). I made a dictionary of the form {zipcode: (street, city)} > > One dictionary with all of the data? > > That does not seem like it will work. What happens when > 2 addresses have the same zip code? You did not answer this question. Did you think about it? Maybe my main question is as follows: what permanent object is most suitable o store a large amount of entries (maybe too many to fit into the computer's memory), which can be looked up very fast. One thing about Python is that you don't normally need to hink about how your objects are stored (memory management). It's an advantage in the normal case -- you just use the most onvenient object, and if it's fast enough and small enough ou're good to go. Of course, that means that if it is not fast enough, or not mall enough, then you've got to do a bit more work to do. Eventually, I want to create two objects: 1-one to look up street name and city using zip code So... you want to have a function like: def addresses_by_zip(zipcode): ?? '''returns list of all addresses in the given zipcode''' ?? 2-one to look up zip code using street name, apartment number and city and another one like: def zip_by_address(street_name, apt, city): ?? '''returns the zipcode for the given street
[Tutor] If/else in Python 2.6.5 vs. Python 2.4.3
Hi, I have code that works fine when I run it on Python 2.6.5, but I get an "invalid syntax" error in Python 2.4.3. I'm hoping you can help me fix it. The line in question splits a chunk of semi-colon separated words into separate elements. rgenre = re.split(r';', rf.info["genre"] if "genre" in rf.info else [] I get a syntax error at "if" in 2.4.3. I tried doing the following but it doesn't work (I don't understand why). if "genre" in rf.info: rgenre = re.split(r';', rf.info["genre"]) else: [] And I tried this, too: if "genre" in rf.info: rgenre = re.split(r';', rf.info["genre"]) if "genre" not in rf.info: [] Both of these cause problems later on in the program, specifically "UnboundLocalError: local variable 'rgenre' referenced before assignment", about this line in the code (where each of these is one of the 30 possible genres): rg1, rg2, rg3, rg4, rg5, rg6, rg7, rg8, rg9, rg10, rg11, rg12, rg13, rg14, rg15, rg16, rg17, rg18, rg19, rg20, rg21, rg22, rg23, rg24, rg25, rg26, rg27, rg28, rg29, rg30= rgenre + ["NA"]*(30-len(rgenre[:30])) Thanks for any help--I gave a little extra information, but I'm hoping there's a simple rewrite of the first line I gave that'll work in Python 2.4.3. Thanks, Tyler ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor