Re: [Tutor] Is the difference in outputs with different size input lists due to limits on memory with PYTHON?

Art Kendall Thu, 06 May 2010 09:44:05 -0700


On 5/6/2010 11:14 AM, Dave Angel wrote:

Art Kendall wrote:
I am running Windows 7 64bit Home premium. with quad cpus and 8Gmemory. I am using Python 2.6.2.
I have all the Federalist Papers concatenated into one .txt file.
Which is how big? Currently you (unnecessarily) load the entire thinginto memory with readlines(). And then you do confusing work to splitit apart again, into one list element per paper. And for a whilethere, you have three copies of the entire text. You're keeping twocopies, in the form of alltext and papers.You print out the len(papers). What do you see there? Is itcorrectly 87 ? If it's not, you have to fix the problem here, beforeeven going on.
I want to prepare a file with a row for each paper and a column foreach term. The cells would contain the count of a term in thatpaper. In the original application in the 1950's 30 single wordterms were used. I can now use NoteTab to get a list of all the 8708separate words in allWords.txt. I can then use that data instatistical exploration of the set of texts.
I have the python program(?) syntax(?) script(?) below that I amusing to learn PYTHON. The comments starting with "later" are thingsI will try to do to make this more useful. I am getting one step atat time to work
It works when the number of terms in the term list is small e.g.,10. I get a file with the correct number of rows (87) and countcolumns (10) in termcounts.txt. The termcounts.txt file is notcorrect when I have a larger number of terms, e.g., 100. I get a filewith only 40 rows and the correct number of columns. With 8700 termsI get only 40 rows I need to be able to have about 8700 terms. (Ifthis were FORTRAN I would say that the subscript indices were gettingscrambled.) (As I develop this I would like to be open-ended withthe numbers of input papers and open ended with the number ofwords/terms.)
# word counts: Federalist papers

import re, textwrap
# read the combined file and split into individual papers
# later create a new version that deals with all files in a folderrather than having papers concatenated
alltext = file("C:/Users/Art/Desktop/fed/feder16v3.txt").readlines()
papers= re.split(r'FEDERALIST No\.'," ".join(alltext))
print len(papers)

countsfile = file("C:/Users/Art/desktop/fed/TermCounts.txt", "w")
syntaxfile = file("C:/Users/Art/desktop/fed/TermCounts.sps", "w")
# later create a python program that extracts all words instead ofusing NoteTab
termfile   = open("C:/Users/Art/Desktop/fed/allWords.txt")
termlist = termfile.readlines()
termlist = [item.rstrip("\n") for item in termlist]
print len(termlist)
# check for SPSS reserved words
varnames = textwrap.wrap(" ".join([v.lower() in ['and', 'or', 'not','eq', 'ge','gt', 'le', 'lt', 'ne', 'all', 'by', 'to','with'] and (v+"_r") or vfor v in termlist]))syntaxfile.write("data list file='c:/users/Art/desktop/fed/termcounts.txt' free/docnumber\n")
syntaxfile.writelines([v + "\n" for v in varnames])
syntaxfile.write(".\n")
# before using the syntax manually replace spaces internal to astring to underscore // replace (ltrtim(rtrim(varname))," ","_")replace any special characters with @ in variable names
for p in range(len(papers)):
range(len()) is un-pythonic.  Simply do
        for paper in papers:

and of course use paper below instead of papers[p]
   counts = []
   for t in termlist:
counts.append(len(re.findall(r"\b" + t + r"\b", papers[p],re.IGNORECASE)))
   if sum(counts) > 0:
      papernum = re.search("[0-9]+", papers[p]).group(0)
countsfile.write(str(papernum) + " " + " ".join([str(s) for sin counts]) + "\n")
Art
If you're memory limited, you really should sequence through thefiles, only loading one at a time, rather than all at once. It's noharder. Use dirlist() to make a list of files, then your loop becomessomething like:
for  infile in filelist:
     paper = " ".join(open(infile, "r").readlines())
Naturally, to do it right, you should use with... Or at leastclose each file when done.
DaveA

Thank you for getting back to me. I am trying to generalize a processthat 50 years ago used 30 terms on the whole file and I am using thetask of generalizing the process to learn python. In the post I sentthere were comments to myself about things that I would want to learnabout. One of the first is to learn about processing all files in afolder, so your reply will be very helpful. It seems that dirlist()should allow me to include the filespec in the output file which wouldbe very helpful.


to rephrase my questions.
Is there a way to tell python to use more RAM?

Does python use the same array space over as it counts the occurrencesfor each input document? Or does it keep every row of the outputsomeplace even after it has written it to the output? If it does keepold arrays, is there a way to "close" the output array in RAM betweendocuments

I narrowed down the problem. With 4035 terms it runs OK. With 4040 theend of the output matrix is messed up. I do not think it is a limit ofmy resources that gets in the way. I have 352G of free hard disk if itgoes virtual. I have 8G of RAM. Even if python turns out to bestrictly 32Bit I think it would be able to use 3G of RAM. The inputfile is 1.1M so that should be able to fit in RAM many times.


P.S. I hope I remembered correctly that this list put replies at the bottom.
Art
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Is the difference in outputs with different size input lists due to limits on memory with PYTHON?

Reply via email to