writing pickle function
hello, i am using nested defaultdict from collections and i would like to write it as a pickle object to a file. when i try: from collections import defaultdict x = defaultdict(lambda: defaultdict(list)) and then try to write to a pickle file, it says: TypeError: can't pickle function objects is there a way around this? it's simply a dictionary that i want to write to file.. this works no problems with ordinary dicts. thank you. -- http://mail.python.org/mailman/listinfo/python-list
scatterhist and resizing figures
i am using scatterhist to plot some data. i find that when i plot the data and matlab shows it in the figure window, stretching the figure window (with the mouse) to enlarge it actually changes the properties of the figure. for example, making it bigger sometimes reveals more tick marks - like the y limit of the y axis, which i have set but was not shown until i enlarged the window. also, more crucially enlarging can make bars that appear at 0 or not at all to show... when i save the figure window as pdf, depending on which of these is shown, i get different pdfs. here's an example: x=rand(1, 100); y=x+5; scatterhist(x,y); set(gca, 'Box' , 'off' , ... 'LineWidth' , 1); set(gca , 'FontSize' , 12); set(gca, 'FontName' , 'Helvetica'); set(gca, 'TickDir', 'out'); first question: how can i programtically save the figure as pdf in a way that shows maximal info? i don't want to lose tick marks on my axis or bars in my histogram. second: how can i plot with scatterhist but make the scatter plot points filled? with ordinary scatter, i can simply do: scatter(x, y, 'filled') but the 'filled' argument doesn't appear to work for scatterhist. thank you. -- http://mail.python.org/mailman/listinfo/python-list
how to optimize object creation/reading from file?
hi,
i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:
class myclass(object):
__slots__ = ("a", "b", "c", "d")
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __str__(self):
return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
def __hash__(self):
return hash((self.a, self.b, self.c, self.d))
def __eq__(self, other):
return (self.a == other.a and \
self.b == other.b and \
self.c == other.c and \
self.d == other.d)
__repr__ = __str__
n = 1500
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):
myobj = myclass('a' + str(k), 'b', 'c', 'd')
table[myobj] = 1
t2 = time.time()
print "time: ", float((t2-t1)/60.0)
this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.
in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...
--
http://mail.python.org/mailman/listinfo/python-list
Re: how to optimize object creation/reading from file?
On Jan 28, 10:06 am, Bruno Desthuilliers wrote: > [email protected] a écrit : > > > > > hi, > > > i am doing a series of very simple string operations on lines i am > > reading from a large file (~15 million lines). i store the result of > > these operations in a simple instance of a class, and then put it > > inside of a hash table. i found that this is unusually slow... for > > example: > > > class myclass(object): > > __slots__ = ("a", "b", "c", "d") > > def __init__(self, a, b, c, d): > > self.a = a > > self.b = b > > self.c = c > > self.d = d > > def __str__(self): > > return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d) > > def __hash__(self): > > return hash((self.a, self.b, self.c, self.d)) > > def __eq__(self, other): > > return (self.a == other.a and \ > > self.b == other.b and \ > > self.c == other.c and \ > > self.d == other.d) > > __repr__ = __str__ > > If your class really looks like that, a tuple would be enough. > > > n = 1500 > > table = defaultdict(int) > > t1 = time.time() > > for k in range(1, n): > > hint : use xrange instead. > > > myobj = myclass('a' + str(k), 'b', 'c', 'd') > > table[myobj] = 1 > > hint : if all you want is to ensure unicity, use a set instead. > > > t2 = time.time() > > print "time: ", float((t2-t1)/60.0) > > hint : use timeit instead. > > > this takes a very long time to run: 11 minutes!. for the sake of the > > example i am not reading anything from file here but in my real code i > > do. also, i do 'a' + str(k) but in my real code this is some simple > > string operation on the line i read from the file. however, i found > > that the above code shows the real bottle neck, since reading my file > > into memory (using readlines()) takes only about 4 seconds. i then > > have to iterate over these lines, but i still think that is more > > efficient than the 'for line in file' approach which is even slower. > > iterating over the file, while indeed a bit slower on a per-line basis, > avoid useless memory comsuption which can lead to disk swapping - so for > "huge" files, it might still be better wrt/ overall performances. > > > in the above code is there a way to optimize the creation of the class > > instances ? i am using defaultdicts instead of ordinary ones so i dont > > know how else to optimize that part of the code. is there a way to > > perhaps optimize the way the class is written? if takes only 3 seconds > > to read in 15 million lines into memory it doesnt make sense to me > > that making them into simple objects while at it would take that much > > more... > > Did you bench the creation of a 15.000.000 ints list ?-) > > But anyway, creating 15.000.000 instances (which is not a small number) > of your class takes many seconds - 23.466073989868164 seconds on my > (already heavily loaded) machine. Building the same number of tuples > only takes about 2.5 seconds - that is, almost 10 times less. FWIW, > tuples have all the useful characteristics of your above class (wrt/ > hashing and comparison). > > My 2 cents... thanks for your insight ful reply - changing to tuples made a big change! -- http://mail.python.org/mailman/listinfo/python-list
writing large dictionaries to file using cPickle
hello all,
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
key2: [...]}
in total there are about 10 to 15 million lists if we concatenate
together all the values of every key in 'mydict'. mydict is a
structure that represents data in a very large file (about 800
megabytes).
what is the fastest way to pickle 'mydict' into a file? right now i am
experiencing a lot of difficulties with cPickle when using it like
this:
from cPickle import pickle
pfile = open(my_file, 'w')
pickle.dump(mydict, pfile)
pfile.close()
this creates extremely large files (~ 300 MB) though it does so
*extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and
it gets slower and slower. it takes almost an hour if not more to
write this pickle object to file.
is there any way to speed this up? i dont mind the large file... after
all the text file with the data used to make the dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB. but
i do care about speed...
i have tried optimizing this by using this:
s = pickle.dumps(mydict, 2)
pfile.write(s)
but this takes just as long... any ideas ? is there a different module
i could use that's more suitable for large dictionaries ?
thank you very much.
--
http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 28, 11:32 am, [email protected] wrote: > Hi, > > Change: > > pickle.dump(mydict, pfile) > > to: > > pickle.dump(mydict, pfile, -1 ) > > I think you will see a big difference in performance and also a much > smaller file on disk. > > BTW: What type of application are you developing that creates so many > dictionaries? Sounds interesting. > > Malcolm hi! thank you for your reply. unfortunately i tried this but it doesn't change the speed. it's still writing the file extremely slowly. i'm not sure why? thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 28, 5:14 pm, John Machin wrote: > On Jan 29, 3:13 am, [email protected] wrote: > > > > > hello all, > > > i have a large dictionary which contains about 10 keys, each key has a > > value which is a list containing about 1 to 5 million (small) > > dictionaries. for example, > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > > 'world'}, ...], > > key2: [...]} > > > in total there are about 10 to 15 million lists if we concatenate > > together all the values of every key in 'mydict'. mydict is a > > structure that represents data in a very large file (about 800 > > megabytes). > > > what is the fastest way to pickle 'mydict' into a file? right now i am > > experiencing a lot of difficulties with cPickle when using it like > > this: > > > from cPickle import pickle > > pfile = open(my_file, 'w') > > pickle.dump(mydict, pfile) > > pfile.close() > > > this creates extremely large files (~ 300 MB) though it does so > > *extremely* slowly. it writes about 1 megabyte per 5 or 10 seconds and > > it gets slower and slower. it takes almost an hour if not more to > > write this pickle object to file. > > > is there any way to speed this up? i dont mind the large file... after > > all the text file with the data used to make the dictionary was larger > > (~ 800 MB) than the file it eventually creates, which is 300 MB. but > > i do care about speed... > > > i have tried optimizing this by using this: > > > s = pickle.dumps(mydict, 2) > > pfile.write(s) > > > but this takes just as long... any ideas ? is there a different module > > i could use that's more suitable for large dictionaries ? > > thank you very much. > > Pardon me if I'm asking the "bleedin' obvious", but have you checked > how much virtual memory this is taking up compared to how much real > memory you have? If the slowness is due to pagefile I/O, consider > doing "about 10" separate pickles (one for each key in your top-level > dictionary). the slowness is due to CPU when i profile my program using the unix program 'top'... i think all the work is in the file I/O. the machine i am using several GB of ram and ram memory is not heavily taxed at all. do you know how file I/O can be sped up? in reply to the other poster: i thought 'shelve' simply calls pickle. if thats the case, it wouldnt be any faster, right ? -- http://mail.python.org/mailman/listinfo/python-list
Re: writing large dictionaries to file using cPickle
On Jan 28, 6:08 pm, Aaron Brady wrote: > On Jan 28, 4:43 pm, [email protected] wrote: > > > On Jan 28, 5:14 pm, John Machin wrote: > > > > On Jan 29, 3:13 am, [email protected] wrote: > > > > > hello all, > > > > > i have a large dictionary which contains about 10 keys, each key has a > > > > value which is a list containing about 1 to 5 million (small) > > > > dictionaries. for example, > > > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > > > > 'world'}, ...], > > > > key2: [...]} > > > > > in total there are about 10 to 15 million lists if we concatenate > > > > together all the values of every key in 'mydict'. mydict is a > > > > structure that represents data in a very large file (about 800 > > > > megabytes). > > snip > > > in reply to the other poster: i thought 'shelve' simply calls pickle. > > if thats the case, it wouldnt be any faster, right ? > > Yes, but not all at once. It's a clear winner if you need to update > any of them later, but if it's just write-once, read-many, it's about > the same. > > You said you have a million dictionaries. Even if each took only one > byte, you would still have a million bytes. Do you expect a faster I/ > O time than the time it takes to write a million bytes? > > I want to agree with John's worry about RAM, unless you have several+ > GB, as you say. You are not dealing with small numbers. in my case, i just write the pickle file once and then read it in later. in that case, cPickle and shelve would be identical, if i understand correctly? the file i'm reading in is ~800 MB file, and the pickle file is around 300 MB. even if it were 800 MB, it doesn't make sense to me that python's i/o would be that slow... it takes roughly 5 seconds to write one megabyte of a binary file (the pickled object in this case), which just seems wrong. does anyone know anything about this? about how i/o can be sped up for example? the dictionary might have a million keys, but each key's value is very small. i tried the same example where the keys are short strings (and there are about 10-15 million of them) and each value is an integer, and it is still very slow. does anyone know how to test whether i/o is the bottle neck, or whether it's something specific about pickle? thanks. -- http://mail.python.org/mailman/listinfo/python-list
