Thanks very much for the extensive comments, Steve. I can get the code you wrote to work on my toy data, but my real input data is actually contained in 10 files that are about 1.5 GB each--when I try to run the code on one of those files, everything freezes.
To solve this, I tried just having the data write to a different csv file: lines = csv.reader(file(src_filename)) csv_writer = csv.writer(file(output_filename, 'w')) for line in lines: doc, g1, g2, g3, rating, ratingmax, reviewer, helpful, h_total, word, count = line row = [add_word(g1, word, count), add_word(g2, word, count), add_word(g3, word, count)] csv_writer.writerow(row) This doesn't work--I think there are problems in how the iterations happen. But my guess is that converting from one CSV to another isn't going to be as efficient as creating a shelve database. I have some code that works to create a db when I release it on a small subset of my data, but when I try to turn one of the 1.5 GB files into a db, it can't do it. I don't understand why it works for small data and not big (it makes sense to me that your table approach might choke on big amounts of data--but why the shelve() code below?) I think these are the big things I'm trying to get the code to do: - Get my giant CSV files into a useful format, probably a db (can do for small amounts of data, but not large) - Extract genre and star-rating information about particular words from the db (I seem to be able to do this) - Get total counts for all words in each genre, and for all words in each star-rating category (your table approach works on small data, but I can't get it to scale) def csv2shelve(src_filename, shelve_filename): # I open the shelve file for writing. if os.path.exists(shelve_filename): os.remove(shelve_filename) # I create the shelve db. db = shelve.open(shelve_filename, writeback=True) # The writeback stuff is a little confusing in the help pages, maybe this is a problem? # I open the src file. lines = csv.reader(file(src_filename)) for line in lines: doc, g1, g2, g3, rating, word, count = line if word not in db: db[word] = [] try: rating = int(rating) except: pass db[word].append({ "genres":{g1:True, g2:True, g3:True}, "rating":rating, "count":int(count) }) db.close() Thanks again, Steve. (And everyone/anyone else.) Tyler -----Original Message----- From: tutor-requ...@python.org To: tutor@python.org Sent: Sat, Oct 2, 2010 1:36 am Subject: Tutor Digest, Vol 80, Issue 10 Send Tutor mailing list submissions to tutor@python.org To subscribe or unsubscribe via the World Wide Web, visit http://mail.python.org/mailman/listinfo/tutor r, via email, send a message with subject or body 'help' to tutor-requ...@python.org You can reach the person managing the list at tutor-ow...@python.org When replying, please edit your Subject line so it is more specific han "Re: Contents of Tutor digest..." oday's Topics: 1. Re: (de)serialization questions (Lee Harr) 2. Re: regexp: a bit lost (Steven D'Aprano) 3. Re: regexp: a bit lost (Alex Hall) 4. Re: (de)serialization questions (Alan Gauld) 5. Re: Getting total counts (Steven D'Aprano) 6. data question (Roelof Wobben) --------------------------------------------------------------------- Message: 1 ate: Sat, 2 Oct 2010 03:26:21 +0430 rom: Lee Harr <miss...@hotmail.com> o: <tutor@python.org> ubject: Re: [Tutor] (de)serialization questions essage-ID: <snt106-w199add4fdc9da1c977f89cb1...@phx.gbl> ontent-Type: text/plain; charset="windows-1256" >> I have data about zip codes, street and city names (and perhaps later also f >> street numbers). I made a dictionary of the form {zipcode: (street, city)} > > One dictionary with all of the data? > > That does not seem like it will work. What happens when > 2 addresses have the same zip code? You did not answer this question. Did you think about it? Maybe my main question is as follows: what permanent object is most suitable o store a large amount of entries (maybe too many to fit into the computer's memory), which can be looked up very fast. One thing about Python is that you don't normally need to hink about how your objects are stored (memory management). It's an advantage in the normal case -- you just use the most onvenient object, and if it's fast enough and small enough ou're good to go. Of course, that means that if it is not fast enough, or not mall enough, then you've got to do a bit more work to do. Eventually, I want to create two objects: 1-one to look up street name and city using zip code So... you want to have a function like: def addresses_by_zip(zipcode): ?? '''returns list of all addresses in the given zipcode''' ?? .... 2-one to look up zip code using street name, apartment number and city and another one like: def zip_by_address(street_name, apt, city): ?? '''returns the zipcode for the given street name, apartment, and city''' ?? .... o me, it sounds like a job for a database (at least behind the scenes), ut you could try just creating a custom Python object that holds hese things: class Address(object): ?? street_number = '345' ?? street_name = 'Main St' ?? apt = 'B' ?? city = 'Springfield' ?? zipcode = '99999' Then create another object that holds a collection of these addresses nd has methods addresses_by_zip(self, zipcode) and ip_by_address(self, street_number, street_name, apt, city) I stored object1 in a marshalled dictionary. Its length is about 450.000 (I ive in Holland, not THAT many streets). Look-ups are incredibly fast (it has to, because it's part of an autocompletion feature of a data entry program). I haven't got the street number data needed for object2 yet, but it's going to e much larger. Many streets have different zip codes for odd or even numbers, or the zip codes are divided into street number ranges (for long streets). Remember that you don't want to try to optimize too soon. Build a simple working system and see what happens. If it s too slow or takes up too much memory, fix it. You suggest to simply use a file. I like simple solutions, but doesn't that, y definition, require a slow, linear search? You could create an index, but then any database will already have n indexing function built in. I'm not saying that rolling your own custom database is a bad idea, ut if you are trying to get some work done (and not just playing around nd learning Python) then it's probably better to use something that is lready proven to work. f you have some code you are trying out, but are not sure you re going the right way, post it and let people take a look at it. ----------------------------- Message: 2 ate: Sat, 2 Oct 2010 10:19:21 +1000 rom: Steven D'Aprano <st...@pearwood.info> o: Python Tutor <Tutor@python.org> ubject: Re: [Tutor] regexp: a bit lost essage-ID: <201010021019.21909.st...@pearwood.info> ontent-Type: text/plain; charset="iso-8859-1" On Sat, 2 Oct 2010 01:14:27 am Alex Hall wrote: >> Here is my test: >> s=re.search(r"[\d+\s+\d+\s+\d]", l) > > Try this instead: > > re.search(r'\d+\s+\D*\d+\s+\d', l) ...] Understood. My intent was to ask why my regexp would match anything at all. Square brackets create a character set, so your regex tests for a string hat contains a single character matching a digit (\d), a plus sign (+) r a whitespace character (\s). The additional \d + \s in the square rackets are redundant and don't add anything. -- teven D'Aprano ----------------------------- Message: 3 ate: Fri, 1 Oct 2010 20:47:29 -0400 rom: Alex Hall <mehg...@gmail.com> o: "Steven D'Aprano" <st...@pearwood.info> c: Python Tutor <Tutor@python.org> ubject: Re: [Tutor] regexp: a bit lost essage-ID: <aanlktin=bajcu0e8py46gjzkmur8mtsembzc6=m8s...@mail.gmail.com> ontent-Type: text/plain; charset=ISO-8859-1 On 10/1/10, Steven D'Aprano <st...@pearwood.info> wrote: On Sat, 2 Oct 2010 01:14:27 am Alex Hall wrote: > >> Here is my test: > >> s=re.search(r"[\d+\s+\d+\s+\d]", l) > > > > Try this instead: > > > > re.search(r'\d+\s+\D*\d+\s+\d', l) [...] > Understood. My intent was to ask why my regexp would match anything > at all. Square brackets create a character set, so your regex tests for a string that contains a single character matching a digit (\d), a plus sign (+) or a whitespace character (\s). The additional \d + \s in the square brackets are redundant and don't add anything. h, that explains it then. :) Thanks. -- Steven D'Aprano _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor - ave a great day, lex (msg sent from GMail website) ehg...@gmail.com; http://www.facebook.com/mehgcap ----------------------------- Message: 4 ate: Sat, 2 Oct 2010 02:01:40 +0100 rom: "Alan Gauld" <alan.ga...@btinternet.com> o: tutor@python.org ubject: Re: [Tutor] (de)serialization questions essage-ID: <i8609s$l4...@dough.gmane.org> ontent-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Albert-Jan Roskam" <fo...@yahoo.com> wrote > Maybe my main question is as follows: what permanent object is most suitable to store a large amount of entries (maybe too many to fit into the computer's memory), which can be looked up very fast. It depends on the nature of the object and the lookup but in general database would be the best solution. For special (heirarchical) ata an LDAP directory may be more appropriate. Otherwise you are looking at a custom designed file structure. > Eventually, I want to create two objects: 1-one to look up street name and city using zip code 2-one to look up zip code using street name, apartment number and city For this a simple relational database wouldbe best. QLlite should do and is part of the standard library. t can also be used in memory for faster speed with smaller data sets. > You suggest to simply use a file. I like simple solutions, but doesn't that, by definition, require a slow, linear search? No, you can use random access provided yopu can relate the key to the ocation - thats what databases do for you under the covers. > Funny you should mention sqlite: I was just considering it yesterday. Gosh, Python has so much interesting stuff to offer! Sqlite operating in-memory would be a good solution for you I think. You can get a basic tutorial on Sqllite and python in the databases opic f my tutorial... HTH, - lan Gauld uthor of the Learn to Program web site ttp://www.alan-g.me.uk/ ----------------------------- Message: 5 ate: Sat, 2 Oct 2010 11:13:04 +1000 rom: Steven D'Aprano <st...@pearwood.info> o: tutor@python.org ubject: Re: [Tutor] Getting total counts essage-ID: <201010021113.04679.st...@pearwood.info> ontent-Type: text/plain; charset="utf-8" On Sat, 2 Oct 2010 06:31:42 am aenea...@priest.com wrote: Hi, I have created a csv file that lists how often each word in the Internet Movie Database occurs with different star-ratings and in different genres. I would have thought that IMDB would probably have already made that nformation available? http://www.imdb.com/interfaces The input file looks something like this--since movies can have multiple genres, there are three genre rows. (This is fake, simplified data.) ...] I can get the program to tell me how many occurrence of "the" there are in Thrillers (50), how many "the"'s in 1-stars (50), and how many 1-star drama "the"'s there are (30). But I need to be able to expand beyond a particular word and say "how many words total are in "Drama"? How many total words are in 1-star ratings? How many words are there in the whole corpus? On these all-word totals, I'm stumped. The headings of your data look like this: ID | Genre1 | Genre2 | Genre3 | Star-rating | Word | Count and you want to map words to genres. Can you tell us how big the CSV ile is? Depending on its size, you may need to use on-disk storage perhaps shelve, as you're already doing) but for illustration purposes 'll assume it all fits in memory and just use regular dicts. I'm going o create a table that stores the counts for each word versus the enre: enre | the | scary | silly | exciting | ... ----------------------------------------------- estern | 934 | 3 | 5 | 256 | hriller | 899 | 145 | 84 | 732 | omedy | 523 | 1 | 672 | 47 | .. To do this using dicts, I'm going to use a dict for genres: genre_table = {"Western": table_of_words, ...} and each table_of_words will look like: {'the': 934, 'scary': 3, 'silly': 5, ...} et's start with a helper function and table to store the data. # Initialise the table. enres = {} def add_word(genre, word, count): genre = genre.title() # force "gEnRe" to "Genre" word = word.lower() # force "wOrD" to "word" count = int(count) row = genres.get(genre, {}) n = row.get(word, 0) row[word] = n + count genres[genre] = row e can simplify this code by using the confusingly named, but useful, etdefault method of dicts: def add_word(genre, word, count): genre = genre.title() word = word.lower() count = int(count) row = genres.setdefault(genre, {}) row[word] = row.get(word, 0) + count Now let's process the CSV file. I'm afraid I can't remember how the CSV odule works, and I'm too lazy to look it up, so this is pseudo-code ather than Python: for row in csv file: genre1 = get column Genre1 genre2 = get column Genre2 genre3 = get column Genre3 word = get column Word count = get column Count add_word(genre1, word, count) add_word(genre2, word, count) add_word(genre3, word, count) ow we can easily query our table for useful information: # list of unique words for the Western genre enres["Western"].keys() count of unique words for the Romance genre en(genres["Romance"]) number of times "underdog" is used in Sports movies enres["Sport"]["underdog"] total count of words for the Comedy genre um(genres["Comedy"].values()) Do you want to do lookups efficiently the other way as well? It's easy o add another table: Word | Western | Thriller | ... ----------------------------------------------- he | 934 | 899 | cary | 3 | 145 | .. dd a second global table: genres = {} ords = {} nd modify the helper function: def add_word(genre, word, count): genre = genre.title() word = word.lower() count = int(count) # Add word to the genres table. row = genres.setdefault(genre, {}) row[word] = row.get(word, 0) + count # And to the words table. row = words.setdefault(word, {}) row[genre] = row.get(genre, 0) + count - teven D'Aprano ----------------------------- Message: 6 ate: Sat, 2 Oct 2010 08:35:13 +0000 rom: Roelof Wobben <rwob...@hotmail.com> o: <tutor@python.org> ubject: [Tutor] data question essage-ID: <snt118-w643d8156677eb89d46d414ae...@phx.gbl> ontent-Type: text/plain; charset="iso-8859-1" Hello, s a test I would write a programm where a user can input game-data like ome-team, away-team, home-score, away-score) and makes a ranking of it. And I'm ot looking of a OOP solution because im not comfertle with OOP. ow my question is : n which datatype can I put this data in. thought myself of a dictonary of tuples. egards, oelof ------------------------------ _______________________________________________ utor maillist - Tutor@python.org ttp://mail.python.org/mailman/listinfo/tutor nd of Tutor Digest, Vol 80, Issue 10 ************************************
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor