On Thursday 02 August 2007, Joshua J. Kugler wrote: > I am using shelve to store some data since it is probably the best solution > to my "data formats, number of columns, etc can change at any time" > problem. However, I seem to be dealing with bloat. > > My original data is 33MB. When each row is converted to python lists, and > inserted into a shelve DB, it balloons to 69MB. Now, there is some > additional data in there namely a list of all the keys containing data (vs. > the keys that contain version/file/config information), BUT if I copy all > the data over to a dict and dump the dict to a file using cPickle, that > file is only 49MB. I'm using pickle protocol 2 in both cases. > > Is this expected? Is there really that much overhead to using shelve and > dbm files? Are there any similar solutions that are more space efficient? > I'd use straight pickle.dump, but loading requires pulling the entire thing > into memory, and I don't want to have to do that every time. > > [Note, for those that might suggest a standard DB. Yes, I'd like to use a > regular DB, but I have a domain where the number of data points in a sample > may change at any time, so a timestamp-keyed dict is arguably the best > solution, thus my use of shelve.]
Have you considered a directory full of pickle files ? (In effect, replacing
the dbm with the file system) i.e. something like (untested)
class DirShelf(dict):
def __init__(self, dirname):
self.dir = dirname
self.__repl_dict = {}
def __contains__(self, key):
assert isinstance(key, str)
assert key.isalnum() # or similar portable check for is-name-ok
return os.path.exists(os.path.join(self.dir, key))
def has_key(self, key):
return key in self
def __getitem__(self, key):
try:
if key not in self.__repl_dict:
self.__repl_dict[key] = \
cPickle.load(file(os.path.join(self.dir, key), 'rb'),
protocol=2)
return self.__repl_dict[key]
except IOError, e:
raise KeyError(e)
def __setitem__(self, key, val):
assert isinstance(key, str)
assert key.isalnum() # or similar portable check for is-name-ok
self.__repl_dict[key] = val
self.flush()
def flush(self):
for k, v in self.__repl_dict.iteritems():
cPickle.dump(v, file(os.path.join(self.dir, k), 'wb'),
protocol=2)
def __del__(self):
self.flush()
--
Regards, Thomas Jollans
GPG key: 0xF421434B may be found on various keyservers, eg pgp.mit.edu
Hacker key <http://hackerkey.com/>:
v4sw6+8Yhw4/5ln3pr5Ock2ma2u7Lw2Nl7Di2e2t3/4TMb6HOPTen5/6g5OPa1XsMr9p-7/-6
signature.asc
Description: This is a digitally signed message part.
-- http://mail.python.org/mailman/listinfo/python-list
