A faster method to generate a nested list from a template?

2005-05-04 Thread DJTB
Hi all,

I'm new to Python. I'm trying to create a fast function to do the following:

t = [['a1','a2'],['b1'],['c1'],['d1']]
l = [1,2,3,4,5]
>>> create_nested_list(t,l)
[[1, 2], [3], [4], [5]]

t is some sort of template. This is what I have now:

def create_nested_list(template,l_orig):
'''Uses a template to create a new list

t = [['a1','a2'],['b1'],['c1'],['d1']]
l = [1,2,3,4,5]
>>> create_nested_list(t,l)
[[1, 2], [3], [4], [5]]
'''

tl = map(len, template)
# Input size check
if reduce(lambda x,y: x+y, tl) != len(l_orig):
raise "Wrong input size"

l = l_orig
new_nested_list = []
for x in tl:
q = []
i = 0
while i < x:
q.append(l.pop(0))
i += 1
new_nested_list.append(q)
return new_nested_list 

I'd like to know if it is possible to make this faster (using Python magic I
don't know of yet), because this function will be called a lot
('constantly').

Thanks in advance,
Stan.


-- 
http://mail.python.org/mailman/listinfo/python-list


processing a Very Large file

2005-05-17 Thread DJTB
Hi,

I'm trying to manually parse a dataset stored in a file. The data should be
converted into Python objects.

Here is an example of a single line of a (small) dataset:

3 13 17 19 -626177023 -1688330994 -834622062 -409108332 297174549 955187488
589884464 -1547848504 857311165 585616830 -749910209 194940864 -1102778558
-1282985276 -1220931512 792256075 -340699912 1496177106 1760327384
-1068195107 95705193 1286147818 -416474772 745439854 1932457456 -1266423822
-1150051085 1359928308 129778935 1235905400 532121853

The first integer specifies the length of a tuple object. In this case, the
tuple has three element: (13, 17, 19)
The other values (-626177023 to 532121853) are elements of a Set.

I use the following code to process a file:


from time import time
from sets import Set
from string import split
file = 'pathtable_ht.dat'
result = []
start_time = time ()
f=open(file,'r')
for line in f:
splitres = line.split()
tuple_size = int(splitres[0])+1
path_tuple = tuple(splitres[1:tuple_size])
conflicts = Set(map(int,splitres[tuple_size:-1]))
# do something with 'path_tuple' and 'conflicts'
# ... do some processing ...
result.append(( path_tuple, conflicts))

f.close()
print time() - start_time


The elements (integer objects) in these Sets are being shared between the
sets, in fact, there are as many distinct element as there are lines in the
file (eg 1000 lines -> 1000 distinct set elements). AFAIK, the elements are
stored only once and each Set contains a pointer to the actual object

This works fine with relatively small datasets, but it doesn't work at all
with large datasets (4500 lines, 45000 chars per line).

After a few seconds of loading, all main memory is consumed by the Python
process and the computer starts swapping. After a few more seconds, CPU
usage drops from 99% to 1% and all swap memory is consumed:

Mem:386540k total,   380848k used, 4692k free,  796k buffers
Swap:   562232k total,   562232k used,0k free,27416k cached

At this point, my computer becomes unusable.

I'd like to know if I should buy some more memory (a few GB?) or if it is
possible to make my code more memory efficient.

Thanks in advance,
Stan.
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: processing a Very Large file

2005-05-18 Thread DJTB
Robert Brewer wrote:

> DJTB wrote:
>> I'm trying to manually parse a dataset stored in a file. The
>> data should be converted into Python objects.
>> 
> 
> The first question I would ask is: what are you doing with "result", and
> can the consumption of "result" be done iteratively?
> 
> 

The processed data in result is input data for a real-time simulator. The
simulator chooses the data it needs from 'result', this should be done
fast, so all data in 'result' should always be accessable in O(1) -- and as
fast as possible. That's why everything should be in RAM.

In other words: the data in 'result' is not being 'consumed' sequentially
and the data is not thrown away after 'consumption' by the simulator
process. Furthermore, after loading the file, 'result' is read-only.

By the way, 'result' is now a dict:


1. Read raw data from file
2. Process data (convert to Python Object, do some preprocessing)
4. Add mapping hash(path_tuple) --> Object to 'result' dictionary
5. Start simulation process: while True:
a. given a hash, retrieve the Object from the dictionary 
b. use the Object for further calculations



Jeffrey Maitland wrote:

>
> Well 1 thing I would do is use the for loop for the file itself.
> Now the thing is that you are reading a file into a list  now the more you
> add to the list the more memit will take (as you know). What I this is
> doing is a line by line read instead of opening the entire file and then
> reading it.  not sure if it is due to the fact that you are creating sets
> in addition to the list.  Something else that might help (not 100% sure if
> it will is using the del option on the sets after they are put into the
> list. That way it is no longer being used by memory then overwritten
> instead it frees the memory first. Not sure if any of this will help
> exactly but I hope it does.
>
> from time import time
> from sets import Set
> from string import split
> file_name = 'pathtable_ht.dat' #change the name to avoid confusion
> result = []
> start_time = time ()
> #Note there is no open(file_name,r) needed here.

What if the data file is gzipped?
(I'm now using gzip to keep the data file small, the added time is 
neglectable compared to the time needed for the actual processing)

> for line in file(file_name):
> splitres = line.split()
> tuple_size = int(splitres[0])+1
> path_tuple = tuple(splitres[1:tuple_size])
> conflicts = Set(map(int,splitres[tuple_size:-1]))
> # do something with 'path_tuple' and 'conflicts'
> # ... do some processing ...
> result.append(( path_tuple, conflicts))
> del(conflicts) # freeing the mem
> del(tuple) # by deleting these 2 objects.
>

I'm not a Python memory specialist, but does del immediately release/free
the memory to the OS? I thought it was impossible to let Python immediately
release memory.


By the way, does anyone know if there's a profile-like module to keep track
of memory usage per object?

Thanks in advance,
Stan.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: processing a Very Large file

2005-05-18 Thread DJTB
Tim Peters wrote:

> 
>>tuple_size = int(splitres[0])+1
>>path_tuple = tuple(splitres[1:tuple_size])
>>conflicts = Set(map(int,splitres[tuple_size:-1]))
> 
> Do you really mean to throw away the last value on the line?  That is,
> why is the slice here [tuple_size:-1] rather than [tuple_size:]?
> 

Thanks, you saved me from another bug-hunting hell...
(In a previous test version, split returned a '\n' as the last item in the
list...)

> 
> You could manually do something akin to Python's "string interning" to
> store ints uniquely, like:
> 
> int_table = {}
> def uniqueint(i):
> return int_table.setdefault(i, i)
> 
> Then, e.g.,
> 
 uniqueint(100 * 100) is uniqueint(100 * 100)
> True
 uniqueint(int("12345")) is uniqueint(int("12345"))
> True
> 
> Doing Set(map(uniqueint, etc)) would then feed truly shared int
> (and/or long) objects to the Set constructor.
> 

I've implemented this and it does seem to work, thanks.

Stan.
-- 
http://mail.python.org/mailman/listinfo/python-list


ZODB memory problems (was: processing a Very Large file)

2005-05-21 Thread DJTB
[posted to comp.lang.python, mailed to [EMAIL PROTECTED]

Hi,

I'm having problems storing large amounts of objects in a ZODB.
After committing changes to the database, elements are not cleared from
memory. Since the number of objects I'd like to store in the ZODB is too
large to fit in RAM, my program gets killed with signal 11 or signal 9...

Below a minimal working (or actually: it doesn't work because of memory
errors)
example code with hopefully enough comments:



# This was suggested by Tim Peters in comp.lang.python thread
# 'processing a Very Large file'
# It is to make sure that no two or more copies of the same object
# reside in memory
class ObjectInterning:
def __init__(self):
self.object_table = {}

def object_intern(self,o):
return self.object_table.setdefault(o, o)


from sets import Set

# An ExtentedTuple is a tuple with some extra information
# (hence: 'Extended'). Furthermore, the elements of the tuple are
# unique.
# As you can see, ExtendedTuple does not inheret from Persistent.
# It will not be stored in the root of a database directly, it will
# be stored in a Persistent ExtendedTupleTable (see below).
class ExtendedTuple(tuple):

def __init__(self, els):
tuple.__init__(self,els)

# This is a set containing other ExtendedTuple objects
# which conflicts with self
# e.g. if self = ExtendedTuple([1,2,3,4]) and
# other = ExtendedTuple([3,4,5]) then self conflicts with
# other, because they share one or more elements (in this
# case:. 3 and 4)
# So, self.conflicts = Set([ExtendedTuple([3,4,5])])
#other.conflicts = Set([ExtendedTuple([1,2,3,4])])
self.conflicts = Set()

def __hash__(self):
return hash(tuple(self))

def __repr__(self):
return 'ExtendedTuple(%s)' % str(list(self))


import ZODB
from persistent import Persistent
import random

# The Persistent ExtendedTupleTable generates and stores a large
# amount of ExtendedTuple objects. Since ExtendedTuple contains a
# Set with other ExtendedTuple objects, each ExtendedTuple object
# may get very large.
class ExtendedTupleTable(Persistent):
def __init__(self):
self.interning = ObjectInterning()

# This Set stores all generated ExtendedTuple objects.
self.ets = Set() # et(s): ExtendedTuple object(s)
# This dictionary stores a mapping of elements to Sets of
# ExtendedTuples.
# eg: self.el2ets[3] = Set([(1,2,3), (3,4,5), (1,3,9)])
# self.el2ets[4] = Set([(3,4,5), (2,4,9)])
self.el2ets = {}  # el: element of an ExtendedTuple object

# These dictionaries are here for performance optimizations.
# It is being used to prevent billions of hash()
# calculations (relatively slow compared to dictionary
# lookups)
self._v_el2hs = {} # h(s): hash(es) of ExtendedTuple object(s)
self._v_h2et = {}
self._v_et2h = {}

# The keys of el2ets (and thus the elements of the
# ExtendedTuple objects) are all in a prespecified range.
# In this example: range(200):
self.__el_count = 200
# Number of ExtendedTuple objects in this ExtendedTupleTable
self.__et_count = 5000

# Start generation of ExtendedTuple objects and calculation of
# conflicts for each ExtendedTuple object
def calculate_all(self):
self.calc_ets()
self.calc_el2ets()
self.calc_conflicts()



def add(self, et_uninterned):
et = self.interning.object_intern(et_uninterned)
h = self.interning.object_intern(hash(et))
self.ets.add(et)
self._v_h2et[h] = et
self._v_et2h[et] = h
self._p_changed = True

def calc_ets(self):
# Calculate a large amount of ExtendedTuple objects.
# In this example, the tuples are random, the elements of
# the tuples are within a prespecified range.
# The elements of each tuple are unique.
print 'generating %s random ExtendedTuple objects' % self.__et_count
for i in xrange(self.__et_count):
# Create random tuple with unique elements
l = []
for el in xrange(self.__el_count/3):
l.append(random.randint(0,self.__el_count-1))
et = ExtendedTuple(tuple(Set(l)))
self.add(et)
self.__et_count = len(self.ets)

def calc_el2ets(self):
'''For each el, calculate which et uses that el'''
for el in xrange(self.__el_count):
print 'calculating all ExtendedTuple objects using', el
self.el2ets[el] = Set([ et for et in self.ets if el in et])
self._v_el2hs[el] = Set([ self._v_et2h[et] for et in
self.el2ets[el] ])
self._p_changed = True

def calc_conflicts(self):
'''For each et, calculate the set of conflicting ets'''
self.__et_count = len(self.ets)
commit_interval = 100
for i, e