On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <[email protected]> Wrote in message:
>
>
>
> >
>
> > Where am I going wrong? What are the alternatives I can try?
>
>
>
> You've rejected all the alternatives so far without showing your
>
> code, or even properly specifying your problem.
>
>
>
> To get the "total" size of a list of strings, try (untested):
>
>
>
> a = sys.getsizeof (mylist )
>
> for item in mylist:
>
> a += sys.getsizeof (item)
>
>
>
> This can be high if some of the strings are interned and get
>
> counted twice. But you're not likely to get closer without some
>
> knowledge of the data objects and where they come
>
> from.
>
>
>
> --
>
> DaveA
Hello Dave,
I just thought that saving others time is better and hence I explained only the
subset of my problem. Here is what I am trying to do:
I am trying to index the current wikipedia dump without using databases and
create a search engine for Wikipedia documents. Note, I CANNOT USE DATABASES.
My approach:
I am parsing the wikipedia pages using SAX Parser, and then, I am dumping the
words along with the posting list (a list of doc ids in which the word is
present) into different files after reading 'X' number of pages. Now these
files may have the same word and hence I need to merge them and write the final
index again. Now these final indexes must be of limited size as I need to be of
limited size. This is where I am stuck. I need to know how to determine the
size of content in a variable before I write into the file.
Here is the code for my merging:
def mergeFiles(pathOfFolder, countFile):
listOfWords={}
indexFile={}
topOfFile={}
flag=[0]*countFile
data=defaultdict(list)
heap=[]
countFinalFile=0
for i in xrange(countFile):
fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
indexFile[i]= bz2.BZ2File(fileName, 'rb')
flag[i]=1
topOfFile[i]=indexFile[i].readline().strip()
listOfWords[i] = topOfFile[i].split(' ')
if listOfWords[i][0] not in heap:
heapq.heappush(heap, listOfWords[i][0])
while any(flag)==1:
temp = heapq.heappop(heap)
for i in xrange(countFile):
if flag[i]==1:
if listOfWords[i][0]==temp:
//This is where I am stuck. I cannot wait until memory
//error, as I need to do some postprocessing too.
try:
data[temp].extend(listOfWords[i][1:])
except MemoryError:
writeFinalIndex(data, countFinalFile, pathOfFolder)
data=defaultdict(list)
countFinalFile+=1
topOfFile[i]=indexFile[i].readline().strip()
if topOfFile[i]=='':
flag[i]=0
indexFile[i].close()
os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
else:
listOfWords[i] = topOfFile[i].split(' ')
if listOfWords[i][0] not in heap:
heapq.heappush(heap, listOfWords[i][0])
writeFinalIndex(data, countFinalFile, pathOfFolder)
countFile is the number of files and writeFileIndex method writes into the file.
--
https://mail.python.org/mailman/listinfo/python-list