[Tutor] Simple counter to determine frequencies of words in a document
Hi, I'm trying to do something that should be very simple. I want to generate a list of the words that appear in a document according to their frequencies. So, the list generated by the script should be something like this: the : 3 book: 2 was : 2 read: 1 by: 1 [...] This would be obtained from a document that contained, for example, the following text: "The book was read by an unknown person before the librarian found that the book was missing." The code I started writing to achieve this result can be seen below. You will see that first I'm trying to create a dictionary that contains the word as the key with the frequency as its value. Later on I will transform the dictionary into a text file with the desired formatting. The problem is that, in the first test I ran, the output file that should contain the dictionary is empty. I'm afraid that this is the consequence of a very basic misunderstanding of how Python works. I've tried to piece this together from other scripts but obviously the program is not doing what it is supposed to do. I know the function should work so the problem is obviously in how I call the function. That is, how I'm trying to write the stuff (a dictionary) that the function returns into the output file. The relevant part of the code is so short that I'm sure it will take seconds for most people in the list to spot the problem but I've spent quite a lot of time changing things around and I cannot get it to work as desired. Can anybody tell me what's wrong so that I can say "duh" to myself once again? --- def countWords(a_list): words = {} for i in range(len(a_list)): item = a_list[i] count = a_list.count(item) words[item] = count return sorted(words.items(), key=lambda item: item[1], reverse=True) with open('output.txt', 'a') as token_freqs: with open('input.txt', 'r') as out_tokens: token_list = countWords(out_tokens.read()) token_freqs.write(token_list) -- Thanks in advance. Josep M. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple counter to determine frequencies of words in adocument
"Josep M. Fontana" wrote The code I started writing to achieve this result can be seen below. You will see that first I'm trying to create a dictionary that contains the word as the key with the frequency as its value. Later on I will transform the dictionary into a text file with the desired formatting. Thats the right approach... things around and I cannot get it to work as desired. Can anybody tell me what's wrong so that I can say "duh" to myself once again? I'll give some comments --- def countWords(a_list): words = {} for i in range(len(a_list)): item = a_list[i] count = a_list.count(item) words[item] = count return sorted(words.items(), key=lambda item: item[1], reverse=True) The loop is a bit clunky. it would be clearer just to iterate over a_list: for item in a_list: words[item] = a_list.count(item) And the return value is a list of tuples, which when you write it will be a single long line containing the string representation. Is tat what you want? with open('output.txt', 'a') as token_freqs: with open('input.txt', 'r') as out_tokens: token_list = countWords(out_tokens.read()) token_freqs.write(token_list) read returns a single string. Using a for loop on a string will get you the characters in the string not the words. Also you probably want to use 'w' mode for your output file to create a new one each time, otherwise the file will keep getting bigger everytime you run the code. HTH, -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] JOB AD PROJECT
wrote I have done some extensive reading on python. i want to design a classifieds site for jobs. Have you done any extensive programming yet? Reading alone will be of limited benefit and jumping into a faurly complex web project before you have the basics mastered will be a painful experience. I'm assuming you have at least some experience of web programming in other languages for you to take such a bold step? I am taking this as my first project and would appreciate any help. I am looking in the direction of python, maybe django. The choice of tools is fine, but its a big task for a first ever Python project! HTH, -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple counter to determine frequencies of words in adocument
Alan Gauld wrote: > The loop is a bit clunky. it would be clearer just to iterate over > a_list: > > for item in a_list: >words[item] = a_list.count(item) This is a very inefficient approach because you repeat counting the number of occurrences of a word that appears N times N times: >>> words = {} >>> a_list = "in the garden on the bank behind the tree".split() >>> for word in a_list: ... print "counting", word ... words[word] = a_list.count(word) ... counting in counting the # <-- 1 counting garden counting on counting the # <-- 2 counting bank counting behind counting the # <-- 3 counting tree >>> words {'on': 1, 'garden': 1, 'tree': 1, 'behind': 1, 'in': 1, 'the': 3, 'bank': 1} To avoid the duplicate effort you can check if the word was already counted: >>> words2 = {} >>> for word in a_list: ... if word not in words2: ... print "counting", word ... words2[word] = a_list.count(word) ... counting in counting the counting garden counting on counting bank counting behind counting tree >>> words == words2 True Inside the count() method you are still implicitly iterating over the entire list once for every distinct word. I would instead prefer counting manually while iterating over the list once. This has the advantage that it will even work if you don't keep the whole sequence of words in a list in memory (e. g. if you read them from a file one line at a time): >>> words3 = {} >>> for word in a_list: ... if word in words3: ... words3[word] += 1 ... else: ... words3[word] = 1 ... >>> words3 == words True Finally there's collections.defaultdict or, in Python 2.7, collections.Counter when you are more interested in the result than the way to achieve it. Peter ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple counter to determine frequencies of words in a document
Josep M. Fontana wrote: def countWords(a_list): words = {} for i in range(len(a_list)): item = a_list[i] count = a_list.count(item) words[item] = count return sorted(words.items(), key=lambda item: item[1], reverse=True) with open('output.txt', 'a') as token_freqs: with open('input.txt', 'r') as out_tokens: token_list = countWords(out_tokens.read()) token_freqs.write(token_list) When you run that code, are you SURE that it merely results in the output file being blank? When I run it, I get an obvious error: Traceback (most recent call last): File "", line 4, in TypeError: argument 1 must be string or read-only character buffer, not list Don't you get this error too? The first problem is that file.write() doesn't take a list as argument, it requires a string. You feed is a list of (word, frequency) pairs. You need to decide how you want to format the output. The second problem is that you don't actually generate word frequencies, you generate letter frequencies. When you read a file, you get a string, not a list of words. A string is equivalent to a list of letters: >>> for item in "hello": ... print(item) ... h e l l o Your countWords function itself is reasonable, apart from some stylistic issues, and some inefficiencies which are unnoticeable for small numbers of words, but will become extremely costly for large lists of words. Ignoring that, here's my suggested code: you might like to look at the difference between what I have written, and what you have, and see if you can tell why I've written what I have. def countWords(wordlist): word_table = {} for word in wordlist: count = wordlist.count(word) word_table[word] = count return sorted( word_table.items(), key=lambda item: item[1], reverse=True ) def getWords(filename): with open(filename, 'r') as f: words = f.read().split() return words def writeTable(filename, table): with open(filename, 'w') as f: for word, count in table: f.write("%s %s\n" % (word, count)) words = getWords('input.txt') table = countWords(words) writeTable('output.txt', table) For bonus points, you might want to think about why countWords will be so inefficient for large word lists, although you probably won't see any problems until you're dealing with thousands or tens of thousands of words. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] code quest
OK, I need to create or find a function that will return a list of DIRECTORIES (only) which are under 'the current directory'. Anyone got some clue on this? Please advise. -- end Very Truly yours, - Kirk Bailey, Largo Florida kniht +-+ | BOX | +-+ think ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] code quest
On 11/20/2010 11:03 AM Kirk Bailey said... OK, I need to create or find a function that will return a list of DIRECTORIES (only) which are under 'the current directory'. Anyone got some clue on this? Please advise. Use os.walk Emile Help on function walk in module os: walk(top, topdown=True, onerror=None, followlinks=False) Directory tree generator. For each directory in the directory tree rooted at top (including top itself, but excluding '.' and '..'), yields a 3-tuple dirpath, dirnames, filenames dirpath is a string, the path to the directory. dirnames is a list of the names of the subdirectories in dirpath (excluding '.' and '..'). filenames is a list of the names of the non-directory files in dirpath. Note that the names in the lists are just names, with no path components. To get a full path (which begins with top) to a file or directory in dirpath, do os.path.join(dirpath, name). ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple counter to determine frequencies of words in a document
Thanks Alan, Peter and Steve, Instead of answering each one of you independently let me try to use my response to Steve's message as the basis for an answer to all of you. It turns out that matters of efficiency appear to be VERY important in this case. The example in my message was a very short string but the file that I'm trying to process is pretty big (20MB of text). I'm writing to you as my computer is about to burst in flames. I'm exaggerating a little bit because I'm checking the temperature and things so far seem to be under control but I ran the script that I made up following your recommendations (see below) on the real file for which I wanted to get word frequencies and it has been running for over half an hour without having generated the output file yet. I'm using a pretty powerful computer (core i7 with 8GB of RAM) so I'm a little surprised (and a bit worried as well) that the process hasn't finished yet. I tested the script before with a much smaller file and the output was as desired. When I look at the current processes running on my computer, I see the Python process taking 100% of the CPU. Since my computer has a multi-core processor, I'm assuming this process is using only one of the cores because another monitor tells me that the CPU usage is under 20%. This doesn't make much sense to me. I bought a computer with a powerful CPU precisely to do these kinds of things as fast as possible. How can it be that Python is only using such a small amount of processing power? But I digress, I will start another thread to ask about this because I'm curious to know whether this can be changed in any way. Now, however, I'm more interested in getting the right answer to my original question. OK, I'll start with Steve's answer first. > When you run that code, are you SURE that it merely results in the output > file being blank? When I run it, I get an obvious error: > > Traceback (most recent call last): > File "", line 4, in > TypeError: argument 1 must be string or read-only character buffer, not list > > Don't you get this error too? Nope. I was surprised myself, but I did not get any errors. But I suspect that this is because I don't have my IDE well configured. Although (see below) I do get many other error messages, I didn't get any in this case. See, I'm not only a newbie in Python but a newbie with IDEs as well. I'm using Eclipse (probably I should have started with something smaller and simpler) and I see the following error message: Pylint: Executing command line:' /Applications/eclipse/Eclipse.app/Contents/MacOS --include-ids=y /Volumes/DATA/Documents/workspace/GCA/src/prova.py 'Pylint: The stdout of the command line is: Pylint: The stderr of the command line is: /usr/bin/python: can't find '__main__.py' in '/Applications/eclipse/Eclipse.app/Contents/MacOS' - Anyway, I tried the different alternatives all of you suggested with a small test file and everything worked perfectly. With the big file, however, none of the alternatives seems to work. Well, I don't know whether they work or not because the process takes so long that I have had to kill it out of desperation. The process I talk about at the beginning of this message is the one involving Peter's alternative. I think I'm going to kill it as well because now it has been running for 45 minutes and this seems way too long. So, here is how I wrote the code. You'll see that there are two different functions that do the same thing: countWords(wordlist) and countWords2(wordlist). countWords2 is adapted from Peter Otten's suggestion. This was the one that according to him would be more efficient. However, none of the versions (including Alan's as well) work when the file being processed is a large file. def countWords(wordlist): word_table = {} for word in wordlist: count = wordlist.count(word) word_table[word] = count def countWords2(wordlist): #as proposed by Peter Otten word_table = {} for word in wordlist: if word in word_table: word_table[word] += 1 else: word_table[word] = 1 count = wordlist.count(word) word_table[word] = count return sorted( word_table.items(), key=lambda item: item[1], reverse=True ) def getWords(filename): with open(filename, 'r') as f: words = f.read().split() return words def writeTable(filename, table): with open(filename, 'w') as f: for word, count in table: f.write("%s\t%s\n" % (word, count)) words = getWords('tokens_short.txt') table = countWords(words) # or table = countWords2(words) writeTable('output.txt', table) > For bonus points, you might want to think about why countWords will be so > inefficient for large word lists, although you probably won't see any > problems until you're dealing with thousands or tens of thousands of words. Well, now it will be clear to you that I AM seeing bi
[Tutor] List help
x=0 y=0 w=raw_input("Input: ") w=list(w) for x in range(len(w)): a=w[x] t=0 print a if a==2 or a==4 or a==6 or a==8 or a==10: t=a/2 print "hi" When I run this program, it doesn't print "hi". Can you please tell me why? ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] code quest
"Kirk Bailey" wrote OK, I need to create or find a function that will return a list of DIRECTORIES (only) which are under 'the current directory'. Anyone got some clue on this? Please advise. You can use os.walk() to get a list of directories and files and then just throw away the files... HTH, -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple counter to determine frequencies of words in a document
If the file is big use Peter's method, but 45 minutes still seems very long so it may be theres a hidden bug in there somehwew. However... > When I look at the current processes running on my computer, I see the > Python process taking 100% of the CPU. Since my computer has a > multi-core processor, I'm assuming this process is using only one of > the cores because another monitor tells me that the CPU usage is under > 20%. This doesn't make much sense to me. Its perfectly normal. The computer asssigns Python to one core and uses the other cores to run other tasks. Thats why its called muylti-tasking. There are tricks to spread the Python load over multiple cores but that is rarely necessaryy, and I don't think we need it here. > any in this case. See, I'm not only a newbie in Python but a newbie > with IDEs as well. I'm using Eclipse (probably I should have started > with something smaller and simpler) and I see the following error > message: Don;t run your code inside the IDE except for testing. IDEs are Development Environments, they are not ideal for executing production code. Run your file from the Terminal command prompt directly. > def countWords2(wordlist): #as proposed by Peter Otten > word_table = {} > for word in wordlist: > if word in word_table: > word_table[word] += 1 > else: > word_table[word] = 1 OK to here... > count = wordlist.count(word) > word_table[word] = count But you don;t need these lines, they are calling count for every word which causes Python to reread the string for every word. You are counting the occurences as you go in this approach with the += 1 line And in fact the assignment to word_table here is overwriting the incremental counter and negating the value of the optimisation! > return sorted( >word_table.items(), key=lambda item: item[1], reverse=True >) > words = getWords('tokens_short.txt') > table = countWords(words) # or table = countWords2(words) > writeTable('output.txt', table) It would be worth utting some print statements between these functions just to monitor progress. Something like print " reading file..." print " counting words..." print "writing file..." That way you can see which function is running slowly, although it is almost certainly the counting. But as a general debugging tip its worth remembering. A few (and I mean a few, dont go mad!) print statements can narrow things down very quickly. > every time you encounter the same word in the loop. This is more or > less what Peter said of the solution proposed by Alan, right? Correct, but you have replicated that i Peters optimised version. HTH, Alan G. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple counter to determine frequencies of words in a document
Good evening, : It turns out that matters of efficiency appear to be VERY : important in this case. The example in my message was a very : short string but the file that I'm trying to process is pretty : big (20MB of text). Efficiency is best addressed first and foremost, not by hardware, but by choosing the correct data structure and algorithm for processing the data. You have more than enough hardware to deal with this problem, and appear to be wrestling still with why this apparently simple problem is An earlier responder (Peter Otten) pointed out to you that efficiency was one issue: you repeat counting the number of occurrences of a word that appears N times N times And another (Steven D'Aprano) pointed out that your countWords will be inefficient, but probably tolerable for data sets under about 1 words. : frequencies and it has been running for over half an hour without : having generated the output file yet. I'm using a pretty powerful : computer (core i7 with 8GB of RAM) so I'm a little surprised (and : a bit worried as well) that the process hasn't finished yet. I : tested the script before with a much smaller file and the output : was as desired. [your comment, snipped in here, out of order] : However, even with countWords2, which is supposed to overcome this : problem, it feels as if I've entered an infinite loop. You have a 20MB file and 8GB of RAM, and it has taken half an hour? You have entered an infinite loop (or some other horrible thing). First, I would say that you don't have to worry too much about efficiency, for your first draft, just work on correctness of result. My machine is not as snappy as yours and finishes the job in the sloppiest way in well under 1 second. However, once you have solved the problem and feel good about the correctness, then learning how to process files/data efficiently would be a step in the direction of processing the same problem on a 20GB or 20TB file. : When I look at the current processes running on my computer, I : see the Python process taking 100% of the CPU. Since my computer : has a multi-core processor, I'm assuming this process is using : only one of the cores because another monitor tells me that the : CPU usage is under 20%. Correct. : This doesn't make much sense to me. I bought a computer with a : powerful CPU precisely to do these kinds of things as fast as : possible. How can it be that Python is only using such a small : amount of processing power? This is far afield from the question of word count, but may be useful someday. The beauty of a multiple processors is that you can run independent processes simultaneously (I'm not talking about multitasking). Using most languages(*), a single process will only use one of your available processors. Obviously, this was once a significant limitation, and so we came up with the idea of threads as a way to take advantage of multiple processors inside a single process. Python supports threads. An application must be written to take advantage of threading. I do not think I would recommend that for you until you have eked out as much performance as you can using a process based model. Here are the very first links I can find which delve into the topic, though there's much more there: http://docs.python.org/library/threading.html http://www.devshed.com/c/a/Python/Basic-Threading-in-Python/ http://www.dabeaz.com/python/GIL.pdf OK, on to your code. : def countWords(wordlist): : word_table = {} : for word in wordlist: : count = wordlist.count(word) : print "word_table[%s] = %s" % (word,word_table.get(word,'')) : word_table[word] = count Problem 1: You aren't returning anything from this function. Add: return word_table Problem 2: You are doing more work than you need. Peter Otten pointed out how. To see what he was observing, try this riff on your function: def countWords(wordlist): word_table = dict() for word in wordlist: count = wordlist.count(word) print "word_table[%s] = %s\n" % (word,word_table.get(word,'')) word_table[word] = count return word_table What you should see is evidence that the second and third times that you iterate over the word 'the', you are updating an entry in the 'word_table' dictionary that already exists with the correct value. : def countWords2(wordlist): #as proposed by Peter Otten : word_table = {} : for word in wordlist: : if word in word_table: : word_table[word] += 1 : else: : word_table[word] = 1 : count = wordlist.count(word) : word_table[word] = count : return sorted( : word_table.items(), key=lambda item: item[1], reverse=True : ) In the above, countWords2, why not omit these lines: : count = wordlist.count(w
Re: [Tutor] List help
On 11/20/2010 11:06 AM george wu said... x=0 y=0 w=raw_input("Input: ") w=list(w) for x in range(len(w)): a=w[x] t=0 print a if a==2 or a==4 or a==6 or a==8 or a==10: t=a/2 print "hi" When I run this program, it doesn't print "hi". Can you please tell me why? When you're comparing a to 2,4,6,8,10 a is a string representing the nth position of the input value w as iterated over with x. Specifically, a=w[x] which is the list of the input value. >>> list("hello") ['h', 'e', 'l', 'l', 'o'] >>> list("1234") ['1', '2', '3', '4'] You then compare a string to the numbers 2,4,6,8,10 making the test line: if a=="2" or a=="4 or a=="6" or a=="8": should make it print hi. Note I dropped the 10 as a single charter string would never match. All this is likely beside the point -- what were you trying to have happen? Emile ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple counter to determine frequencies of words in a document
"Martin A. Brown" wrote * Somebody will be certain to point out a language or languages that provide some sort of facility to abstract the use of multiple processors without the explicit use of threads. ISTR Occam did that? Occam being the purpose designed language for the transputer, one of the first multiprocessor hardware architectures :-) Alan G. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple counter to determine frequencies of words in a document
--> snip >However, even with countWords2, which is supposed to overcome this >problem, it feels as if I've entered an infinite loop. >Josep M. Just my twopenneth, I'm a noob and I'm not going to try such a big file on my old machine, but: 1. Maybe create a *set* from the wordlist, loop through that, so you call "count" on wordlist only once. OR 2. Use collections.defaultdict(int) and loop through wordlist and do dic[word] += 1 Maybe, possibly. Regards Colin -- if not ErrorMessage: check_all() >>>check_all.__doc__ " noErrorMessage != correctAnswer" ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple counter to determine frequencies of words in adocument
"col speed" wrote Just my twopenneth, I'm a noob and I'm not going to try such a big file on my old machine, but: 1. Maybe create a *set* from the wordlist, loop through that, so you call "count" on wordlist only once. OR This would be an improvement but still involves traversing the entire list N times where N is the number of unique words. 2. Use collections.defaultdict(int) and loop through wordlist and do dic[word] += 1 This is what his second version is supposed to do and is the best solution since it only involves a single trasverse of the file. Alan G. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] JOB AD PROJECT
Hi People, I am afraid only Alan has said something to me. Is it that solution would never come or u guys are waiting to gimme the best? Please help. Sent from my BlackBerry wireless device from MTN -Original Message- From: "Alan Gauld" Sender: tutor-bounces+delegbede=dudupay@python.org Date: Sat, 20 Nov 2010 09:00:52 To: Subject: Re: [Tutor] JOB AD PROJECT wrote > I have done some extensive reading on python. > i want to design a classifieds site for jobs. Have you done any extensive programming yet? Reading alone will be of limited benefit and jumping into a faurly complex web project before you have the basics mastered will be a painful experience. I'm assuming you have at least some experience of web programming in other languages for you to take such a bold step? > I am taking this as my first project and would appreciate any help. > I am looking in the direction of python, maybe django. The choice of tools is fine, but its a big task for a first ever Python project! HTH, -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor