Re: Finding size of Variable

Ayushi Dalmia Tue, 04 Feb 2014 21:22:22 -0800

On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <[email protected]> Wrote in message:
> 
> 
> 
> > 
> 
> > Where am I going wrong? What are the alternatives I can try?
> 
> 
> 
> You've rejected all the alternatives so far without showing your
> 
>  code, or even properly specifying your problem.
> 
> 
> 
> To get the "total" size of a list of strings,  try (untested):
> 
> 
> 
> a = sys.getsizeof (mylist )
> 
> for item in mylist:
> 
>     a += sys.getsizeof (item)
> 
> 
> 
> This can be high if some of the strings are interned and get
> 
>  counted twice. But you're not likely to get closer without some
> 
>  knowledge of the data objects and where they come
> 
>  from.
> 
> 
> 
> -- 
> 
> DaveA


Hello Dave, 

I just thought that saving others time is better and hence I explained only the 
subset of my problem. Here is what I am trying to do:

I am trying to index the current wikipedia dump without using databases and 
create a search engine for Wikipedia documents. Note, I CANNOT USE DATABASES.
My approach:

I am parsing the wikipedia pages using SAX Parser, and then, I am dumping the 
words along with the posting list (a list of doc ids in which the word is 
present) into different files after reading 'X' number of pages. Now these 
files may have the same word and hence I need to merge them and write the final 
index again. Now these final indexes must be of limited size as I need to be of 
limited size. This is where I am stuck. I need to know how to determine the 
size of content in a variable before I write into the file.

Here is the code for my merging:

def mergeFiles(pathOfFolder, countFile):
    listOfWords={}
    indexFile={}
    topOfFile={}
    flag=[0]*countFile
    data=defaultdict(list)
    heap=[]
    countFinalFile=0
    for i in xrange(countFile):
        fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
        indexFile[i]= bz2.BZ2File(fileName, 'rb')
        flag[i]=1
        topOfFile[i]=indexFile[i].readline().strip()
        listOfWords[i] = topOfFile[i].split(' ')
        if listOfWords[i][0] not in heap:
            heapq.heappush(heap, listOfWords[i][0])        
            
    while any(flag)==1:
        temp = heapq.heappop(heap)
        for i in xrange(countFile):
            if flag[i]==1:
                if listOfWords[i][0]==temp:

                    //This is where I am stuck. I cannot wait until memory 
//error, as I need to do some postprocessing too.
                    try:
                        data[temp].extend(listOfWords[i][1:])
                    except MemoryError:
                        writeFinalIndex(data, countFinalFile, pathOfFolder)
                        data=defaultdict(list)
                        countFinalFile+=1

                    topOfFile[i]=indexFile[i].readline().strip()   
                    if topOfFile[i]=='':
                            flag[i]=0
                            indexFile[i].close()
                            os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
                    else:
                        listOfWords[i] = topOfFile[i].split(' ')
                        if listOfWords[i][0] not in heap:
                            heapq.heappush(heap, listOfWords[i][0])
    writeFinalIndex(data, countFinalFile, pathOfFolder)

countFile is the number of files and writeFileIndex method writes into the file.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Finding size of Variable

Reply via email to