Re: [Python-Dev] Pickler/Unpickler API clarification
Collin Winter wrote: > [...] I've found a few examples of code using the memo attribute ([1], [2], > [3]) [...] As author of [2] (current version here [4]) I can tell you my reason. cvs2svn has to store a vast number of small objects in a database, then read them in random order. I spent a lot of time optimizing this part of the code because it is crucial for the overall performance when converting large CVS repositories. The objects are not all of the same class and sometimes contain other objects, so it is convenient to use pickling instead of, say, marshaling. It is easy to optimize the pickling of instances by giving them __getstate__() and __setstate__() methods. But the pickler still records the type of each object (essentially, the name of its class) in each record. The space for these strings constituted a large fraction of the database size. So I "prime" the picklers/unpicklers by pickling then unpickling a "primer" that contains the classes that I know will appear, and storing the resulting memos once in the database. Then for each record I create a new pickler/unpickler and initialize its memo to the "primer"'s memo before using it to read the actual object. This removes a lot of redundancy across database records. I only prime my picklers/unpicklers with the classes. But note that the same technique could be used for any repeated subcomponents. This would have the added advantage that all of the unpickled instances would share copies of the objects that appear in the primer, which could be a semantic advantage and a significant savings in RAM in addition to the space and processing time advantages described above. It might even be a useful feature to the "shelve" module. > So my questions are these: > 1) Should Pickler/Unpickler objects automatically clear their memos > when dumping/loading? > 2) Is memo an intentionally exposed, supported part of the > Pickler/Unpickler API, despite the lack of documentation and tests? For my application, either of the following alternatives would also suffice: - The ability to pickle picklers and unpicklers themselves (including their memos). This is, of course, awkward because they are hard-wired to files. - Picklers and unpicklers could have get_memo() and set_memo() methods that return an opaque (but pickleable) memo object. In other words, I don't need to muck around inside the memo object; I just need to be able to save and restore it. Please note that the memo for a pickler is not equal to the memo of the corresponding unpickler. A similar effect could *almost* be obtained without accessing the memos by saving the pickled primer itself in the database. The unpickler would be primed by using it to load the primer before loading the record of interest. But AFAIK there is no way to prime new picklers, because there is no guarantee that pickling the same primer twice will result in the same id->object mapping in the pickler's memo. Michael > [2] - > http://google.com/codesearch/p?hl=en#M-DDI-lCOgE/lib/python2.4/site-packages/cvs2svn_lib/primed_pickle.py&q=lang:py%20%5C.memo [4] http://cvs2svn.tigris.org/source/browse/cvs2svn/trunk/cvs2svn_lib/serializer.py?view=markup ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Pickler/Unpickler API clarification
Antoine Pitrou wrote: > Michael Haggerty alum.mit.edu> writes: >> It is easy to optimize the pickling of instances by giving them >> __getstate__() and __setstate__() methods. But the pickler still >> records the type of each object (essentially, the name of its class) in >> each record. The space for these strings constituted a large fraction >> of the database size. > > If these strings are not interned, then perhaps they should be. > There is a similar optimization proposal (w/ patch) for attribute names: > http://bugs.python.org/issue5084 If I understand correctly, this would not help: - on writing, the strings are identical anyway, because they are read out of the class's __name__ and __module__ fields. Therefore the Pickler's usual memoizing behavior will prevent the strings from being written more than once. - on reading, the strings are only used to look up the class. Therefore they are garbage collected almost immediately. This is a different situation that that of attribute names, which are stored persistently as the keys in the instance's __dict__. Michael ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Pickler/Unpickler API clarification
Antoine Pitrou wrote: > Le vendredi 06 mars 2009 à 13:44 +0100, Michael Haggerty a écrit : >> Antoine Pitrou wrote: >>> Michael Haggerty alum.mit.edu> writes: >>>> It is easy to optimize the pickling of instances by giving them >>>> __getstate__() and __setstate__() methods. But the pickler still >>>> records the type of each object (essentially, the name of its class) in >>>> each record. The space for these strings constituted a large fraction >>>> of the database size. >>> If these strings are not interned, then perhaps they should be. >>> There is a similar optimization proposal (w/ patch) for attribute names: >>> http://bugs.python.org/issue5084 >> If I understand correctly, this would not help: >> >> - on writing, the strings are identical anyway, because they are read >> out of the class's __name__ and __module__ fields. Therefore the >> Pickler's usual memoizing behavior will prevent the strings from being >> written more than once. > > Then why did you say that "the space for these strings constituted a > large fraction of the database size", if they are already shared? Are > your objects so tiny that even the space taken by the pointer to the > type name grows the size of the database significantly? Sorry for the confusion. I thought you were suggesting the change to help the more typical use case, when a single Pickler is used for a lot of data. That use case will not be helped by interning the class __name__ and __module__ strings, for the reasons given in my previous email. In my case, the strings are shared via the Pickler memoizing mechanism because I pre-populate the memo (using the API that the OP proposes to remove), so your suggestion won't help my current code, either. It was before I implemented the pre-populated memoizer that "the space for these strings constituted a large fraction of the database size". But your suggestion wouldn't help that case, either. Here are the main use cases: 1. Saving and loading one large record. A class's __name__ string is the same string object every time it is retrieved, so it only needs to be stored once and the Pickler memo mechanism works. Similarly for the class's __module__ string. 2. Saving and loading lots of records sequentially. Provided a single Pickler is used for all records and its memo is never cleared, this works just as well as case 1. 3. Saving and loading lots of records in random order, as for example in the shelve module. It is not possible to reuse a Pickler with retained memo, because the Unpickler might not encounter objects in the right order. There are two subcases: a. Use a clean Pickler/Unpickler object for each record. In this case the __name__ and __module__ of a class will appear once in each record in which the class appears. (This is the case regardless of whether they are interned.) On reading, the __name__ and __module__ are only used to look up the class, so interning them won't help. It is thus impossible to avoid wasting a lot of space in the database. b. Use a Pickler/Unpickler with a preset memo for each record (my unorthodox technique). In this case the class __name__ and __module__ will be memoized in the shared memo, so in other records only their ID needs to be stored (in fact, only the ID of the class object itself). This allows the database to be smaller, but does not have any effect on the RAM usage of the loaded objects. If the OP's proposal is accepted, 3b will become impossible. The technique seems not to be well known, so maybe it doesn't need to be supported. It would mean some extra work for me on the cvs2svn project though :-( Michael ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Pickler/Unpickler API clarification
Greg Ewing wrote: > Michael Haggerty wrote: >> A similar effect could *almost* be obtained without accessing the memos >> by saving the pickled primer itself in the database. The unpickler >> would be primed by using it to load the primer before loading the record >> of interest. But AFAIK there is no way to prime new picklers, because >> there is no guarantee that pickling the same primer twice will result in >> the same id->object mapping in the pickler's memo. > > Would it help if, when creating a pickler or unpickler, > you could specify another pickler or unpickler whose > memo is used to initialise the memo of the new one? > > Then you could keep the pickler that you used to pickle > the primer and "fork" new picklers off it, and similarly > with the unpicklers. Typically, the purpose of a database is to persist data across program runs. So typically, your suggestion would only help if there were a way to persist the primed Pickler across runs. (The primed Unpickler is not quite so important because it can be primed by reading a pickle of the primer, which in turn can be stored somewhere in the DB.) In the particular case of cvs2svn, each of our databases is in fact written in a single pass, and then in later passes only read, not written. So I suppose we could do entirely without pickleable Picklers, if they were copyable within a single program run. But that constraint would make the feature even less general. Michael ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Pickler/Unpickler API clarification
Guido van Rossum wrote: > On Sat, Mar 7, 2009 at 8:04 AM, Michael Haggerty wrote: >> Typically, the purpose of a database is to persist data across program >> runs. So typically, your suggestion would only help if there were a way >> to persist the primed Pickler across runs. > > I haven't followed all this, but isn't is at least possible to > conceive of the primed pickler as being recreated from scratch from > constant data each run? If there were a guarantee that pickling the same data would result in the same memo ID -> object mapping, that would also work. But that doesn't seem to be a realistic guarantee to make. AFAIK the memo IDs are integers chosen consecutively in the order that objects are pickled, which doesn't seem so bad. But compound objects are a problem. For example, when pickling a map, the map entries would have to be pickled in an order that remains consistent across runs (and even across Python versions). Even worse, all user-written __getstate__() methods would have to return exactly the same result, even across program runs. >> (The primed Unpickler is not quite so important because it can be primed >> by reading a pickle of the primer, which in turn can be stored somewhere >> in the DB.) >> >> In the particular case of cvs2svn, each of our databases is in fact >> written in a single pass, and then in later passes only read, not >> written. So I suppose we could do entirely without pickleable Picklers, >> if they were copyable within a single program run. But that constraint >> would make the feature even less general. > > Being copyable is mostly equivalent to being picklable, but it's > probably somewhat weaker because it's easier to define it as a pointer > copy for some types that aren't easily picklable. Indeed. And pickling the memo should not present any fundamental problems, since by construction it can only contain pickleable objects. Michael ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] suggestion for try/except program flow
Mark Donald wrote: > I frequently have this situation: > > try: > try: > raise Thing > except Thing, e: > # handle Thing exceptions > raise > except: > # handle all exceptions, including Thing This seems like an unusual pattern. Are you sure you can't use try: raise Thing except Thing, e: # handle Thing exceptions raise finally: # handle *all situations*, including Thing Obviously, the finally: block is also invoked in the case that no exceptions are triggered, but often this is what you want anyway... Michael ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 389: argparse - new command line parsing module
Steven Bethard wrote: > On Tue, Sep 29, 2009 at 1:31 PM, Paul Moore wrote: >> 2009/9/28 Yuvgoog Greenle : >>> 1. There is no chance of the script killing itself. In argparse and optparse >>> exit() is called on every parsing error (btw because of this it sucks to >>> debug parse_args in an interpreter). > [...] > > This is behavior that argparse inherits from optparse, but I believe > it's still what 99.9% of users want. If you're writing a command line > interface, you don't want a stack trace when there's an error message > (which is what you'd get if argparse just raised exceptions) you want > an exit with an error code. That's what command line applications are > supposed to do. > > If you're not using argparse to write command line applications, then > I don't feel bad if you have to do a tiny bit of extra work to take > care of that use case. In this particular situation, all you have to > do is subclass ArgumentParser and override exit() to do whatever you > think it should do. > >>> 2. There is no chance the parser will print things I don't want it to print. > [...] > > There is only a single method in argparse that prints things, > _print_message(). So if you want it to do something else, you can > simply override it in a subclass. I can make that method public if > this is a common use case. Instead of forcing the user to override the ArgumentParser class to change how errors are handled, I suggest adding a separate method ArgumentParser.parse_args_with_exceptions() that raises exceptions instead of writing to stdout/stderr and never calls sys.exit(). Then implement ArgumentParser.parse_args() as a wrapper around parse_args_with_exceptions(): class ArgparseError(Exception): """argparse-specific exception type.""" pass class ArgumentError(ArgparseError): # ... class ArgumentParser(...): # ... def parse_args_with_exceptions(...): # like the old parse_args(), except raises exceptions instead of # writing to stdout/stderr or calling sys.exit()... def parse_args(self, *args, **kwargs): try: self.parse_args_with_exceptions(*args, **kwargs) except ArgparseError as e: self.print_usage(_sys.stderr) self.exit(status=2, message=(_('%s: error: %s\n') % (self.prog, e,))) # perhaps catch other exceptions that need special handling... def error(self, message): raise ArgparseError(message) The exception classes should hold enough information to be useful to non-command-line users, and obviously contain error messages that are output to stderr by default. This would allow non-command-line users to call parse_args_with_exceptions() and handle the exceptions however they like. Michael ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] The interpreter accepts f(**{'5':'foo'}); is this intentional?
I can't find documentation about whether there are constraints imposed on the keys in the map passed to a function via **, as in f(**d). According to http://docs.python.org/reference/expressions.html#id9 , d must be a mapping. test_extcall.py implies that the keys of this map must be strings in the following test: >>> f(**{1:2}) Traceback (most recent call last): ... TypeError: f() keywords must be strings But must the keys be valid python identifiers? In particular, the following is allows by the Python 2.5.2 and the Jython 2.2.1 interpreters: >>> f(**{'1':2}) {'1': 2} Is this behavior required somewhere by the Python language spec, or is it an error that just doesn't happen to be checked, or is it intentionally undefined whether this is allowed? Michael ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com