Good evening, : It turns out that matters of efficiency appear to be VERY : important in this case. The example in my message was a very : short string but the file that I'm trying to process is pretty : big (20MB of text).
Efficiency is best addressed first and foremost, not by hardware, but by choosing the correct data structure and algorithm for processing the data. You have more than enough hardware to deal with this problem, and appear to be wrestling still with why this apparently simple problem is An earlier responder (Peter Otten) pointed out to you that efficiency was one issue: you repeat counting the number of occurrences of a word that appears N times N times And another (Steven D'Aprano) pointed out that your countWords will be inefficient, but probably tolerable for data sets under about 10000 words. : frequencies and it has been running for over half an hour without : having generated the output file yet. I'm using a pretty powerful : computer (core i7 with 8GB of RAM) so I'm a little surprised (and : a bit worried as well) that the process hasn't finished yet. I : tested the script before with a much smaller file and the output : was as desired. [your comment, snipped in here, out of order] : However, even with countWords2, which is supposed to overcome this : problem, it feels as if I've entered an infinite loop. You have a 20MB file and 8GB of RAM, and it has taken half an hour? You have entered an infinite loop (or some other horrible thing). First, I would say that you don't have to worry too much about efficiency, for your first draft, just work on correctness of result. My machine is not as snappy as yours and finishes the job in the sloppiest way in well under 1 second. However, once you have solved the problem and feel good about the correctness, then learning how to process files/data efficiently would be a step in the direction of processing the same problem on a 20GB or 20TB file. : When I look at the current processes running on my computer, I : see the Python process taking 100% of the CPU. Since my computer : has a multi-core processor, I'm assuming this process is using : only one of the cores because another monitor tells me that the : CPU usage is under 20%. Correct. : This doesn't make much sense to me. I bought a computer with a : powerful CPU precisely to do these kinds of things as fast as : possible. How can it be that Python is only using such a small : amount of processing power? <digression> This is far afield from the question of word count, but may be useful someday. The beauty of a multiple processors is that you can run independent processes simultaneously (I'm not talking about multitasking). Using most languages(*), a single process will only use one of your available processors. Obviously, this was once a significant limitation, and so we came up with the idea of threads as a way to take advantage of multiple processors inside a single process. Python supports threads. An application must be written to take advantage of threading. I do not think I would recommend that for you until you have eked out as much performance as you can using a process based model. Here are the very first links I can find which delve into the topic, though there's much more there: http://docs.python.org/library/threading.html http://www.devshed.com/c/a/Python/Basic-Threading-in-Python/ http://www.dabeaz.com/python/GIL.pdf </digression> OK, on to your code. : def countWords(wordlist): : word_table = {} : for word in wordlist: : count = wordlist.count(word) : print "word_table[%s] = %s" % (word,word_table.get(word,'<none>')) : word_table[word] = count Problem 1: You aren't returning anything from this function. Add: return word_table Problem 2: You are doing more work than you need. Peter Otten pointed out how. To see what he was observing, try this riff on your function: def countWords(wordlist): word_table = dict() for word in wordlist: count = wordlist.count(word) print "word_table[%s] = %s\n" % (word,word_table.get(word,'<none>')) word_table[word] = count return word_table What you should see is evidence that the second and third times that you iterate over the word 'the', you are updating an entry in the 'word_table' dictionary that already exists with the correct value. : def countWords2(wordlist): #as proposed by Peter Otten : word_table = {} : for word in wordlist: : if word in word_table: : word_table[word] += 1 : else: : word_table[word] = 1 : count = wordlist.count(word) : word_table[word] = count : return sorted( : word_table.items(), key=lambda item: item[1], reverse=True : ) In the above, countWords2, why not omit these lines: : count = wordlist.count(word) : word_table[word] = count And, try the function again. Let try a (bit of a labored) analogy of your problem. To approximate your algorithm. I have a clear tube with gumballs of a variety of colors. I open up one end of the tube, and mark where I'm starting. I pull out a gumball, and discover it's red. I mark down on my chalkboard that I have 1 red gumball. I count all of the red gumballs I can see in the clear tube. I mark down that I have 5 red gumballs. I put that gumball back in the other end of the tube. I take the second gumball from the tube and discover it's green. I mark down that I have 1 green gumball. I count all of the green gumballs I can see in the clear tube. I mark down that I have 1 green gumball. I put that gumball back in the other end of the tube. I pull out a gumball, and discover it's red. I mark down on my chalkboard that I have 1 red gumball. I count all of the red gumballs I can see in the clear tube. I mark down that I have 5 red gumballs. I put that gumball back in the other end of the tube. ... Do you see the duplicated effort? I would also suggest re-reading both the mail from Peter Otten and Steven D'Aprano. You appear to be much closer than earlier, but both of them are pointing out something that you don't appear to quite have grasped yet. As a side note, you may find it useful to simply play with some lists and dictionaries in the python interpreter: >>> words = 'in the garden on the bank behind the tree'.split() >>> words ['in', 'the', 'garden', 'on', 'the', 'bank', 'behind', 'the', 'tree'] >>> word_table = dict() >>> w = words[0] >>> word_table[w] = words.count(w) >>> word_table {'in': 1} >>> w = words[1] >>> word_table[w] = words.count(w) >>> word_table {'the': 3, 'in': 1} Once you gain familiarity with the lists and dicts, you can try out collections, as suggested by Peter Otten. Enjoy and good luck! -Martin * Somebody will be certain to point out a language or languages that provide some sort of facility to abstract the use of multiple processors without the explicit use of threads. -- Martin A. Brown http://linux-ip.net/ _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor