writing pickle function

2009-01-23 Thread perfreem
hello,

i am using nested defaultdict from collections and i would like to
write it as a pickle object to a file. when i try:

from collections import defaultdict
x = defaultdict(lambda: defaultdict(list))

and then try to write to a pickle file, it says:

TypeError: can't pickle function objects

is there a way around this? it's simply a dictionary that i want to
write to file.. this works no problems with ordinary dicts.

thank you.
--
http://mail.python.org/mailman/listinfo/python-list


scatterhist and resizing figures

2009-01-26 Thread perfreem
i am using scatterhist to plot some data. i find that when i plot the
data
and matlab shows it in the figure window, stretching the figure window
(with the mouse) to enlarge it actually changes the properties of the
figure.
for example, making it bigger sometimes reveals more tick marks
- like the y limit of the y axis, which i have set but was not shown
until i enlarged the window. also, more crucially enlarging can make
bars that appear at 0 or not at all to show... when i save the figure
window
as pdf, depending on which of these is shown, i get different pdfs.

here's an example:

x=rand(1, 100);
y=x+5;
scatterhist(x,y);
set(gca, 'Box' , 'off' , ...
 'LineWidth'   , 1);
set(gca , 'FontSize' , 12);
set(gca, 'FontName'   , 'Helvetica');
set(gca, 'TickDir', 'out');

first question: how can i programtically save the figure as pdf in a
way that shows maximal
info? i don't want to lose tick marks on my axis or bars in my
histogram.

second: how can i plot with scatterhist but make the scatter plot
points filled? with ordinary scatter, i can simply do: scatter(x, y,
'filled') but the 'filled' argument doesn't appear to work for
scatterhist.

thank you.
--
http://mail.python.org/mailman/listinfo/python-list


how to optimize object creation/reading from file?

2009-01-28 Thread perfreem
hi,

i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:

class myclass(object):
__slots__ = ("a", "b", "c", "d")
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __str__(self):
return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
def __hash__(self):
return hash((self.a, self.b, self.c, self.d))
def __eq__(self, other):
return (self.a == other.a and \
self.b == other.b and \
self.c == other.c and \
self.d == other.d)
__repr__ = __str__

n = 1500
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):
myobj = myclass('a' + str(k), 'b', 'c', 'd')
table[myobj] = 1
t2 = time.time()
print "time: ", float((t2-t1)/60.0)

this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.

in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...
--
http://mail.python.org/mailman/listinfo/python-list


Re: how to optimize object creation/reading from file?

2009-01-28 Thread perfreem
On Jan 28, 10:06 am, Bruno Desthuilliers  wrote:
> [email protected] a écrit :
>
>
>
> > hi,
>
> > i am doing a series of very simple string operations on lines i am
> > reading from a large file (~15 million lines). i store the result of
> > these operations in a simple instance of a class, and then put it
> > inside of a hash table. i found that this is unusually slow... for
> > example:
>
> > class myclass(object):
> >     __slots__ = ("a", "b", "c", "d")
> >     def __init__(self, a, b, c, d):
> >         self.a = a
> >         self.b = b
> >         self.c = c
> >         self.d = d
> >     def __str__(self):
> >         return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
> >     def __hash__(self):
> >         return hash((self.a, self.b, self.c, self.d))
> >     def __eq__(self, other):
> >         return (self.a == other.a and \
> >                 self.b == other.b and \
> >                 self.c == other.c and \
> >                 self.d == other.d)
> >     __repr__ = __str__
>
> If your class really looks like that, a tuple would be enough.
>
> > n = 1500
> > table = defaultdict(int)
> > t1 = time.time()
> > for k in range(1, n):
>
> hint : use xrange instead.
>
> >     myobj = myclass('a' + str(k), 'b', 'c', 'd')
> >     table[myobj] = 1
>
> hint : if all you want is to ensure unicity, use a set instead.
>
> > t2 = time.time()
> > print "time: ", float((t2-t1)/60.0)
>
> hint : use timeit instead.
>
> > this takes a very long time to run: 11 minutes!. for the sake of the
> > example i am not reading anything from file here but in my real code i
> > do. also, i do 'a' + str(k) but in my real code this is some simple
> > string operation on the line i read from the file. however, i found
> > that the above code shows the real bottle neck, since reading my file
> > into memory (using readlines()) takes only about 4 seconds. i then
> > have to iterate over these lines, but i still think that is more
> > efficient than the 'for line in file' approach which is even slower.
>
> iterating over the file, while indeed a bit slower on a per-line basis,
> avoid useless memory comsuption which can lead to disk swapping - so for
>   "huge" files, it might still be better wrt/ overall performances.
>
> > in the above code is there a way to optimize the creation of the class
> > instances ? i am using defaultdicts instead of ordinary ones so i dont
> > know how else to optimize that part of the code. is there a way to
> > perhaps optimize the way the class is written? if takes only 3 seconds
> > to read in 15 million lines into memory it doesnt make sense to me
> > that making them into simple objects while at it would take that much
> > more...
>
> Did you bench the creation of a 15.000.000 ints list ?-)
>
> But anyway, creating 15.000.000 instances (which is not a small number)
> of your class takes many seconds - 23.466073989868164 seconds on my
> (already heavily loaded) machine. Building the same number of tuples
> only takes about 2.5 seconds - that is, almost 10 times less. FWIW,
> tuples have all the useful characteristics of your above class (wrt/
> hashing and comparison).
>
> My 2 cents...

thanks for your insight ful reply - changing to tuples made a big
change!
--
http://mail.python.org/mailman/listinfo/python-list


writing large dictionaries to file using cPickle

2009-01-28 Thread perfreem
hello all,

i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,

mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}

in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).

what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:

from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()

this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.

is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB.  but
i do care about speed...

i have tried optimizing this by using this:

s = pickle.dumps(mydict, 2)
pfile.write(s)

but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread perfreem
On Jan 28, 11:32 am, [email protected] wrote:
> Hi,
>
> Change:
>
> pickle.dump(mydict, pfile)
>
> to:
>
> pickle.dump(mydict, pfile, -1 )
>
> I think you will see a big difference in performance and also a much
> smaller file on disk.
>
> BTW: What type of application are you developing that creates so many
> dictionaries? Sounds interesting.
>
> Malcolm

hi!

thank you for your reply. unfortunately i tried this but it doesn't
change the speed. it's still writing the file extremely slowly.  i'm
not sure why?

thank you.
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-28 Thread perfreem
On Jan 28, 5:14 pm, John Machin  wrote:
> On Jan 29, 3:13 am, [email protected] wrote:
>
>
>
> > hello all,
>
> > i have a large dictionary which contains about 10 keys, each key has a
> > value which is a list containing about 1 to 5 million (small)
> > dictionaries. for example,
>
> > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > 'world'}, ...],
> >                 key2: [...]}
>
> > in total there are about 10 to 15 million lists if we concatenate
> > together all the values of every key in 'mydict'. mydict is a
> > structure that represents data in a very large file (about 800
> > megabytes).
>
> > what is the fastest way to pickle 'mydict' into a file? right now i am
> > experiencing a lot of difficulties with cPickle when using it like
> > this:
>
> > from cPickle import pickle
> > pfile = open(my_file, 'w')
> > pickle.dump(mydict, pfile)
> > pfile.close()
>
> > this creates extremely large files (~ 300 MB) though it does so
> > *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
> > it gets slower and slower. it takes almost an hour if not more to
> > write this pickle object to file.
>
> > is there any way to speed this up? i dont mind the large file... after
> > all the text file with the data used to make the dictionary was larger
> > (~ 800 MB) than the file it eventually creates, which is 300 MB.  but
> > i do care about speed...
>
> > i have tried optimizing this by using this:
>
> > s = pickle.dumps(mydict, 2)
> > pfile.write(s)
>
> > but this takes just as long... any ideas ? is there a different module
> > i could use that's more suitable for large dictionaries ?
> > thank you very much.
>
> Pardon me if I'm asking the "bleedin' obvious", but have you checked
> how much virtual memory this is taking up compared to how much real
> memory you have? If the slowness is due to pagefile I/O, consider
> doing "about 10" separate pickles (one for each key in your top-level
> dictionary).

the slowness is due to CPU when i profile my program using the unix
program 'top'... i think all the work is in the file I/O. the machine
i am using several GB of ram and ram memory is not heavily taxed at
all. do you know how file I/O can be sped up?

in reply to the other poster: i thought 'shelve' simply calls pickle.
if thats the case, it wouldnt be any faster, right ?
--
http://mail.python.org/mailman/listinfo/python-list


Re: writing large dictionaries to file using cPickle

2009-01-30 Thread perfreem
On Jan 28, 6:08 pm, Aaron Brady  wrote:
> On Jan 28, 4:43 pm, [email protected] wrote:
>
> > On Jan 28, 5:14 pm, John Machin  wrote:
>
> > > On Jan 29, 3:13 am, [email protected] wrote:
>
> > > > hello all,
>
> > > > i have a large dictionary which contains about 10 keys, each key has a
> > > > value which is a list containing about 1 to 5 million (small)
> > > > dictionaries. for example,
>
> > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
> > > > 'world'}, ...],
> > > >                 key2: [...]}
>
> > > > in total there are about 10 to 15 million lists if we concatenate
> > > > together all the values of every key in 'mydict'. mydict is a
> > > > structure that represents data in a very large file (about 800
> > > > megabytes).
>
> snip
>
> > in reply to the other poster: i thought 'shelve' simply calls pickle.
> > if thats the case, it wouldnt be any faster, right ?
>
> Yes, but not all at once.  It's a clear winner if you need to update
> any of them later, but if it's just write-once, read-many, it's about
> the same.
>
> You said you have a million dictionaries.  Even if each took only one
> byte, you would still have a million bytes.  Do you expect a faster I/
> O time than the time it takes to write a million bytes?
>
> I want to agree with John's worry about RAM, unless you have several+
> GB, as you say.  You are not dealing with small numbers.

in my case, i just write the pickle file once and then read it in
later. in that case, cPickle and shelve would be identical, if i
understand correctly?

the file i'm reading in is ~800 MB file, and the pickle file is around
300 MB. even if it were 800 MB, it doesn't make sense to me that
python's i/o would be that slow... it takes roughly 5 seconds to write
one megabyte of a binary file (the pickled object in this case), which
just seems wrong. does anyone know anything about this? about how i/o
can be sped up for example?

the dictionary might have a million keys, but each key's value is very
small. i tried the same example where the keys are short strings (and
there are about 10-15 million of them) and each value is an integer,
and it is still very slow. does anyone know how to test whether i/o is
the bottle neck, or whether it's something specific about pickle?

thanks.
--
http://mail.python.org/mailman/listinfo/python-list