[Tutor] Adding to a CSV file?

2010-08-29 Thread aeneas24

Hi,
 
I'm learning Python so I can take advantage of the really cool stuff in the 
Natural Language Toolkit. But I'm having problems with some basic file 
manipulation stuff.
 
My basic question: How do I read data in from a csv, manipulate it, and then 
add it back to the csv in new columns (keeping the manipulated data in the 
"right row")?
 
Here's an example of what my data looks like ("test-8-29-10.csv"):
 



MyWord

Category

Ct

CatCt


!

A

2932

456454


!

B

2109

64451


a

C

7856

9


a

A

19911

456454


abnormal

C

174

9


abnormally

D

5

7


cats

E

1999

886454


cat

B

160

64451



 
# I want to read in the MyWord for each row and then do some stuff to it and 
add in some new columns. Specifically, I want to "lemmatize" and "stem", which 
basically means I'll turn "abnormally" into "abnormal" and "cats" into "cat".
 
import nltk
wnl=nltk.WordNetLemmatizer()
porter=nltk.PorterStemmer()
text=nltk.word_tokenize(TheStuffInMyWordColumn)
textlemmatized=[wnl.lemmatize(t) for t in text]
textPort=[porter.stem(t) for t in text]
 
# This creates the right info, but I don't really want "textlemmatized" and 
"textPort" to be independent lists, I want them inside the csv in new columns. 
 
# If I didn't want to keep the information in the Category and Counts columns, 
I would probably do something like this:
 
for word in text:
word2=wnl.lemmatize(word)
word3=porter.stem(word)
print word+";"+word2+";"+word3+"\r\n")
 
# Looking through some of the older discussions about the csv module, I found 
this code helps identify headers, but I'm still not sure how to use them--or 
how to word the for-loop that I need correctly so I iterate through each row in 
the csv file. 
 
f_out.close()
fp=open(r'c:test-8-29-10.csv', 'r')
inputfile=csv.DictReader(fp)
for record in inputfile:
print record
{'Category': 'A', 'CatCt': '456454', 'MyWord': '!', 'Ct': '2932'}
{'Category': 'B', 'CatCt': '64451', 'MyWord': '!', 'Ct': '2109'}
...
fp.close() 
 
# So I feel like I have *some* of the pieces, but I'm just missing a bunch of 
little connections. Any and all help would be much appreciated!
 
Tyler
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Adding to a CSV file?

2010-08-30 Thread aeneas24

I checked out the csv module and got a little further along, but still can't 
quite figure out how to iterate line by line properly. 

# This shows that I'm reading the file in correctly:

input_file=open("test-8-29-10.csv","rb")
for row in input_file:
   print row

MyWord,Category,Ct,CatCt
!,A,2932,456454
!,B,2109,64451
a,C,7856,9
abandoned,A,11,456454


# But when I try to add columns, I'm only filling in some static value. So 
there's something wrong with my looping.

testReader=csv.reader(open('test-8-29-10.csv', 'rb'))
for line in testReader:
 for MyWord, Category, Ct, CatCt in testReader:
   text=nltk.word_tokenize(MyWord)
   word2=wnl.lemmatize(word)
   word3=porter.stem(word)
   print 
MyWord+","+Category+","+Ct+","+CatCt+","+word+","+word2+","+word3+"\r\n"
  
!,A,2932,456454,yrs,yr,yr
!,B,2109,64451,yrs,yr,yr
a,C,7856,9,yrs,yr,yr
abandoned,A,11,456454,yrs,yr,yr
...

# I tried adding another loop, but it gives me an error.

testReader=csv.reader(open('test-8-29-10.csv', 'rb'))
for line in testReader:
   for MyWord, Category, Ct, CatCt in line:  # I thought this line inside the 
other was clever, but, uh, not so much
   text=nltk.word_tokenize(MyWord)
   word2=wnl.lemmatize(word)
   word3=porter.stem(word)
  print MyWord+","+Category+","+Ct+","+CatCt+","+word+","+word2+","+word3+"\r\n"
  
Traceback (most recent call last):
  File "", line 2, in 
for MyWord, Category, Ct, CatCt in line:
ValueError: too many values to unpack



My hope is that once I can figure out this problem, it'll be easy to write the 
csv file with the csv module. But I'm stumped about the looping.

Thanks for any suggestions,

Tyler
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] If/elif/else when a list is empty

2010-09-13 Thread aeneas24

Hi,

I'm parsing IMDB movie reviews (each movie is in its own text file). In my 
script, I'm trying to extract genre information. Movies have up to three 
categories of genres--but not all have a "genre" tag and that fact is making my 
script abort whenever it encounters a movie text file that doesn't have a 
"genre" tag. 

I thought the following should solve it, but it doesn't. The basic question is 
how I say "if genre information doesn't at all, just make rg1=rg2=rg3="NA"?

rgenre = re.split(r';', rf.info["genre"]) # When movies have genre information 
they store it as Drama;Western;Thriller

if len(rgenre)>0:
  if len(rgenre)>2:
  rg1=rgenre[0]
  rg2=rgenre[1]
  rg3=rgenre[2]
  elif len(rgenre)==2:
  rg1=rgenre[0]
  rg2=rgenre[1]
  rg3="NA"
  elif len(rgenre)==1:
  rg1=rgenre[0]
  rg2="NA"
  rg3="NA"
   else len(rgenre)<1: # I was hoping this would take care of the "there is no 
genre information" scenario but it doesn't
   rg1=rg2=rg3="NA"

This probably does a weird nesting thing, but even simpler version I have tried 
don't work. 

Thanks very much for any help!

Tyler
  



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] If/elif/else when a list is empty

2010-09-13 Thread aeneas24

Hi Vince,

Thanks very much for the one-line version--unfortunately, I still get errors. 
The overall script runs over every text file in a directory, but as soon as it 
hits a text file without a  tag, it gives this error:

Traceback (most recent call last):
  File 
"C:\Users\tylersc\Desktop\Tyler2\Tyler\words_per_review_IMDB_9-13-10.py", line 
168, in 
main(".","output.csv")
  File 
"C:\Users\tylersc\Desktop\Tyler2\Tyler\words_per_review_IMDB_9-13-10.py", line 
166, in main
os.path.walk(top_level_dir, reviewDirectory, writer )
  File "C:\Python26\lib\ntpath.py", line 259, in walk
func(arg, top, names)
  File 
"C:\Users\tylersc\Desktop\Tyler2\Tyler\words_per_review_IMDB_9-13-10.py", line 
162, in reviewDirectory
reviewFile( dirname+'/'+fileName, args )
  File 
"C:\Users\tylersc\Desktop\Tyler2\Tyler\words_per_review_IMDB_9-13-10.py", line 
74, in reviewFile
rgenre = re.split(r';', rf.info["genre"])
KeyError: 'genre'

I'm about to give what may be too much information--I really thought there must 
be a way to say "don't choke if you don't find any rgenres because 
rf.info["genre"] was empty". But maybe I need to define the "None" condition 
earlier?

Basically a text file has this structure:

High Noon
Drama;Western # But this tag doesn't exist for all text files
# etc


u493498
9 out of 10
A great flick
blah blah blah
# etc

# next review--all about the movie featured in the info tags






-Original Message-
From: Vince Spicer 
To: aenea...@priest.com
Cc: tutor@python.org
Sent: Mon, Sep 13, 2010 9:08 pm
Subject: Re: [Tutor] If/elif/else when a list is empty





On Mon, Sep 13, 2010 at 9:58 PM,  wrote:

Hi,
 
I'm parsing IMDB movie reviews (each movie is in its own text file). In my 
script, I'm trying to extract genre information. Movies have up to three 
categories of genres--but not all have a "genre" tag and that fact is making my 
script abort whenever it encounters a movie text file that doesn't have a 
"genre" tag. 
 
I thought the following should solve it, but it doesn't. The basic question is 
how I say "if genre information doesn't at all, just make rg1=rg2=rg3="NA"?
 
rgenre = re.split(r';', rf.info["genre"]) # When movies have genre information 
they store it as Drama;Western;Thriller
 
if len(rgenre)>0:
  if len(rgenre)>2:
  rg1=rgenre[0]
  rg2=rgenre[1]
  rg3=rgenre[2]
  elif len(rgenre)==2:
  rg1=rgenre[0]
  rg2=rgenre[1]
  rg3="NA"
  elif len(rgenre)==1:
  rg1=rgenre[0]
  rg2="NA"
  rg3="NA"
   else len(rgenre)<1: # I was hoping this would take care of the "there is no 
genre information" scenario but it doesn't
   rg1=rg2=rg3="NA"
 
This probably does a weird nesting thing, but even simpler version I have tried 
don't work. 
 
Thanks very much for any help!
 
Tyler
  




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor




Hey Tyler you can simplify this with a onliner.


rg1, rg2, rg3 = rgenre + ["NA"]*(3-len(rgenre[:3]))


Hope that helps, if you have any questions feel free to ask.


Vince

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Can't process all my files (need to close?)

2010-09-20 Thread aeneas24

My Python script needs to process 45,000 files, but it seems to blow up after 
about 10,000. Note that I'm outputting bazillions of rows to a csv, so that may 
be part of the issue.

Here's the error I get (I'm running it through IDLE on Windows 7):

Microsoft Visual C++ Runtime Library
Runtime Error!
Program: C:\Python26\pythonw.exe
This application has requested the Runtime to terminate it in an usual way. 

I think this might be because I don't specifically close the files I'm reading. 
Except that I'm not quite sure where to put the close. I have 3 places where I 
would think it might work but I'm not sure which one works or how exactly to do 
the closing (what it is I append ".close()" to). 

1) During the self.string here:

class ReviewFile:
# In our movie corpus, each movie is one text file. That means that each text 
file has some "info" about the movie (genre, director, name, etc), followed by 
a bunch of reviews. This class extracts the relevant information about the 
movie, which is then attached to review-specific information. 
def __init__(self, filename):
self.filename = filename
self.string = codecs.open(filename, "r", "utf8").read()
self.info = self.get_fields(self.get_field(self.string, "info")[0])
review_strings = self.get_field(self.string, "review")
review_dicts = map(self.get_fields, review_strings)
self.reviews = map(Review, review_dicts)

2) Maybe here?
def reviewFile ( file, args):
for file in glob.iglob("*.txt"):
  print "  Reviewing" + file
  rf = ReviewFile(file)

3) Or maybe here?

def reviewDirectory ( args, dirname, filenames ):
   print 'Directory',dirname
   for fileName in filenames:
  reviewFile( dirname+'/'+fileName, args )  
def main(top_level_dir,csv_out_file_name):
csv_out_file  = open(str(csv_out_file_name), "wb")
writer = csv.writer(csv_out_file, delimiter=',')
os.path.walk(top_level_dir, reviewDirectory, writer )
main(".","output.csv")

Thanks very much for any help!

Tyler


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Getting total counts

2010-10-01 Thread aeneas24

Hi,
 
I have created a csv file that lists how often each word in the Internet Movie 
Database occurs with different star-ratings and in different genres. The input 
file looks something like this--since movies can have multiple genres, there 
are three genre rows. (This is fake, simplified data.)
 
ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count
film1DramaThrillerWestern1the20
film2ComedyMusicalNA2the20
film3MusicalHistoryBiography1the
20
film4DramaThrillerWestern1the10
film5DramaThrillerWestern9the20
 
I can get the program to tell me how many occurrence of "the" there are in 
Thrillers (50), how many "the"'s in 1-stars (50), and how many 1-star drama 
"the"'s there are (30). But I need to be able to expand beyond a particular 
word and say "how many words total are in "Drama"? How many total words are in 
1-star ratings? How many words are there in the whole corpus? On these all-word 
totals, I'm stumped. 
 
What I've done so far:
I used shelve() to store my input csv in a database format. 
 
Here's how I get count information so far:
def get_word_count(word, db, genre=None, rating=None):
c = 0
vals = db[word]
for val in vals:
if not genre and not rating:
c += val['count']
elif genre and not rating:
if genre in val['genres']:
c += val['count']
elif rating and not genre:
if rating == val['rating']:
c += val['count']
else:
if rating == val['rating'] and genre in val['genres']:
c += val['count']
return c
 
(I think there's something a little wrong with the rating stuff, here, but this 
code generally works and produces the right counts.)
 
With "get_word_count" I can do stuff like this to figure out how many times 
"the" appears in a particular genre. 
vals=db[word]
for val in vals:
genre_ct_for_word = get_word_count(word, db, genre, rating=None)
return genre_ct_for_word
 
I've tried to extend this thinking to get TOTAL genre/rating counts for all 
words, but it doesn't work. I get a type error saying that string indices must 
be integers. I'm not sure how to overcome this.
 
# Doesn't work:
def get_full_rating_count(db, rating=None):
full_rating_ct = 0
vals = db
for val in vals:
if not rating:
full_rating_ct += val['count']
elif rating == val['rating']:
if rating == val['rating']: # Um, I know this looks dumb, but in 
the other code it seems to be necessary for things to work. 
full_rating_ct += val['count']
return full_rating_ct
 
Can anyone suggest how to do this? 
 
Thanks!
 
Tyler
 
 
Background for the curious:
What I really want to know is which words are over- or under-represented in 
different Genre x Rating categories. "The" should be flat, but something like 
"wow" should be over-represented in 1-star and 10-star ratings and 
under-represented in 5-star ratings. Something like "gross" may be 
over-represented in low-star ratings for romances but if grossness is a good 
thing in horror movies, then we'll see "gross" over-represented in HIGH-star 
ratings for horror. 
 
To figure out over-representation and under-representation I need to compare 
"observed" counts to "expected" counts. The expected counts are probabilities 
and they require me to understand how many words I have in the whole corpus and 
how many words in each rating category and how many words in each genre 
category.
 
 
 



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Getting total counts (Steven D'Aprano)

2010-10-02 Thread aeneas24


Thanks very much for the extensive comments, Steve. I can get the code you 
wrote to work on my toy data, but my real input data is actually contained in 
10 files that are about 1.5 GB each--when I try to run the code on one of those 
files, everything freezes. 

To solve this, I tried just having the data write to a different csv file:

lines = csv.reader(file(src_filename))
csv_writer = csv.writer(file(output_filename, 'w'))
for line in lines:
doc, g1, g2, g3, rating, ratingmax, reviewer, helpful, h_total, word, count 
= line
row = [add_word(g1, word, count), add_word(g2, word, count), add_word(g3, 
word, count)]
csv_writer.writerow(row) 


This doesn't work--I think there are problems in how the iterations happen. But 
my guess is that converting from one CSV to another isn't going to be as 
efficient as creating a shelve database. I have some code that works to create 
a db when I release it on a small subset of my data, but when I try to turn one 
of the 1.5 GB files into a db, it can't do it. I don't understand why it works 
for small data and not big (it makes sense to me that your table approach might 
choke on big amounts of data--but why the shelve() code below?)

I think these are the big things I'm trying to get the code to do:
- Get my giant CSV files into a useful format, probably a db (can do for small 
amounts of data, but not large)
- Extract genre and star-rating information about particular words from the db 
(I seem to be able to do this)
- Get total counts for all words in each genre, and for all words in each 
star-rating category (your table approach works on small data, but I can't get 
it to scale)

def csv2shelve(src_filename, shelve_filename):
# I open the shelve file for writing.
if os.path.exists(shelve_filename):
os.remove(shelve_filename)
# I create the shelve db.
db = shelve.open(shelve_filename, writeback=True) # The writeback stuff is 
a little confusing in the help pages, maybe this is a problem?
# I open the src file.
lines = csv.reader(file(src_filename))
for line in lines:
doc, g1, g2, g3, rating, word, count = line
if word not in db:
db[word] = []
try:
rating = int(rating)
except:
pass
 db[word].append({
"genres":{g1:True, g2:True, g3:True},
"rating":rating,
"count":int(count)
})

db.close()

Thanks again, Steve. (And everyone/anyone else.)

Tyler



-Original Message-
From: tutor-requ...@python.org
To: tutor@python.org
Sent: Sat, Oct 2, 2010 1:36 am
Subject: Tutor Digest, Vol 80, Issue 10


Send Tutor mailing list submissions to
   tutor@python.org
To subscribe or unsubscribe via the World Wide Web, visit
   http://mail.python.org/mailman/listinfo/tutor
r, via email, send a message with subject or body 'help' to
   tutor-requ...@python.org
You can reach the person managing the list at
   tutor-ow...@python.org
When replying, please edit your Subject line so it is more specific
han "Re: Contents of Tutor digest..."

oday's Topics:
   1. Re: (de)serialization questions (Lee Harr)
  2. Re: regexp: a bit lost (Steven D'Aprano)
  3. Re: regexp: a bit lost (Alex Hall)
  4. Re: (de)serialization questions (Alan Gauld)
  5. Re: Getting total counts (Steven D'Aprano)
  6. data question (Roelof Wobben)

-
Message: 1
ate: Sat, 2 Oct 2010 03:26:21 +0430
rom: Lee Harr 
o: 
ubject: Re: [Tutor] (de)serialization questions
essage-ID: 
ontent-Type: text/plain; charset="windows-1256"

>> I have data about zip codes, street and city names (and perhaps later also 
f
>> street numbers). I made a dictionary of the form {zipcode: (street, city)}
>
> One dictionary with all of the data?
>
> That does not seem like it will work. What happens when
> 2 addresses have the same zip code?
You did not answer this question.
Did you think about it?

 Maybe my main question is as follows: what permanent object is most suitable 
o
 store a large amount of entries (maybe too many to fit into the computer's
 memory), which can be looked up very fast.
One thing about Python is that you don't normally need to
hink about how your objects are stored (memory management).
It's an advantage in the normal case -- you just use the most
onvenient object, and if it's fast enough and small enough
ou're good to go.
Of course, that means that if it is not fast enough, or not
mall enough, then you've got to do a bit more work to do.

 Eventually, I want to create two objects:
 1-one to look up street name and city using zip code
So... you want to have a function like:
def addresses_by_zip(zipcode):
?? '''returns list of all addresses in the given zipcode'''
?? 

 2-one to look up zip code using street name, apartment number and city
and another one like:
def zip_by_address(street_name, apt, city):
?? '''returns the zipcode for the given street 

[Tutor] If/else in Python 2.6.5 vs. Python 2.4.3

2010-10-11 Thread aeneas24

Hi,

I have code that works fine when I run it on Python 2.6.5, but I get an 
"invalid syntax" error in Python 2.4.3. I'm hoping you can help me fix it.

The line in question splits a chunk of semi-colon separated words into separate 
elements.

rgenre = re.split(r';', rf.info["genre"] if "genre" in rf.info else []

I get a syntax error at "if" in 2.4.3. 

I tried doing the following but it doesn't work (I don't understand why). 



if "genre" in rf.info:
  rgenre = re.split(r';', rf.info["genre"])
else:
  []


And I tried this, too: 


if "genre" in rf.info:
  rgenre = re.split(r';', rf.info["genre"])
if "genre" not in rf.info:
  []

Both of these cause problems later on in the program, specifically 
"UnboundLocalError: local variable 'rgenre' referenced before assignment", 
about this line in the code (where each of these is one of the 30 possible 
genres):

rg1, rg2, rg3, rg4, rg5, rg6, rg7, rg8, rg9, rg10, rg11, rg12, rg13, rg14, 
rg15, rg16, rg17, rg18, rg19, rg20, rg21, rg22, rg23, rg24, rg25, rg26, rg27, 
rg28, rg29, rg30= rgenre + ["NA"]*(30-len(rgenre[:30]))

Thanks for any help--I gave a little extra information, but I'm hoping there's 
a simple rewrite of the first line I gave that'll work in Python 2.4.3.

Thanks,

Tyler



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor