Re: [Python-Dev] Pickler/Unpickler API clarification

2009-03-06 Thread Michael Haggerty
Collin Winter wrote:
> [...] I've found a few examples of code using the memo attribute ([1], [2],
> [3]) [...]

As author of [2] (current version here [4]) I can tell you my reason.
cvs2svn has to store a vast number of small objects in a database, then
read them in random order.  I spent a lot of time optimizing this part
of the code because it is crucial for the overall performance when
converting large CVS repositories.  The objects are not all of the same
class and sometimes contain other objects, so it is convenient to use
pickling instead of, say, marshaling.

It is easy to optimize the pickling of instances by giving them
__getstate__() and __setstate__() methods.  But the pickler still
records the type of each object (essentially, the name of its class) in
each record.  The space for these strings constituted a large fraction
of the database size.

So I "prime" the picklers/unpicklers by pickling then unpickling a
"primer" that contains the classes that I know will appear, and storing
the resulting memos once in the database.  Then for each record I create
a new pickler/unpickler and initialize its memo to the "primer"'s memo
before using it to read the actual object.  This removes a lot of
redundancy across database records.

I only prime my picklers/unpicklers with the classes.  But note that the
same technique could be used for any repeated subcomponents.  This would
have the added advantage that all of the unpickled instances would share
copies of the objects that appear in the primer, which could be a
semantic advantage and a significant savings in RAM in addition to the
space and processing time advantages described above.  It might even be
a useful feature to the "shelve" module.

> So my questions are these:
> 1) Should Pickler/Unpickler objects automatically clear their memos
> when dumping/loading?
> 2) Is memo an intentionally exposed, supported part of the
> Pickler/Unpickler API, despite the lack of documentation and tests?

For my application, either of the following alternatives would also suffice:

- The ability to pickle picklers and unpicklers themselves (including
their memos).  This is, of course, awkward because they are hard-wired
to files.

- Picklers and unpicklers could have get_memo() and set_memo() methods
that return an opaque (but pickleable) memo object.  In other words, I
don't need to muck around inside the memo object; I just need to be able
to save and restore it.

Please note that the memo for a pickler is not equal to the memo of the
corresponding unpickler.

A similar effect could *almost* be obtained without accessing the memos
by saving the pickled primer itself in the database.  The unpickler
would be primed by using it to load the primer before loading the record
of interest.  But AFAIK there is no way to prime new picklers, because
there is no guarantee that pickling the same primer twice will result in
the same id->object mapping in the pickler's memo.

Michael

> [2] - 
> http://google.com/codesearch/p?hl=en#M-DDI-lCOgE/lib/python2.4/site-packages/cvs2svn_lib/primed_pickle.py&q=lang:py%20%5C.memo
[4]
http://cvs2svn.tigris.org/source/browse/cvs2svn/trunk/cvs2svn_lib/serializer.py?view=markup
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Pickler/Unpickler API clarification

2009-03-06 Thread Michael Haggerty
Antoine Pitrou wrote:
> Michael Haggerty  alum.mit.edu> writes:
>> It is easy to optimize the pickling of instances by giving them
>> __getstate__() and __setstate__() methods.  But the pickler still
>> records the type of each object (essentially, the name of its class) in
>> each record.  The space for these strings constituted a large fraction
>> of the database size.
> 
> If these strings are not interned, then perhaps they should be.
> There is a similar optimization proposal (w/ patch) for attribute names:
> http://bugs.python.org/issue5084

If I understand correctly, this would not help:

- on writing, the strings are identical anyway, because they are read
out of the class's __name__ and __module__ fields.  Therefore the
Pickler's usual memoizing behavior will prevent the strings from being
written more than once.

- on reading, the strings are only used to look up the class.  Therefore
they are garbage collected almost immediately.

This is a different situation that that of attribute names, which are
stored persistently as the keys in the instance's __dict__.

Michael
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Pickler/Unpickler API clarification

2009-03-06 Thread Michael Haggerty
Antoine Pitrou wrote:
> Le vendredi 06 mars 2009 à 13:44 +0100, Michael Haggerty a écrit :
>> Antoine Pitrou wrote:
>>> Michael Haggerty  alum.mit.edu> writes:
>>>> It is easy to optimize the pickling of instances by giving them
>>>> __getstate__() and __setstate__() methods.  But the pickler still
>>>> records the type of each object (essentially, the name of its class) in
>>>> each record.  The space for these strings constituted a large fraction
>>>> of the database size.
>>> If these strings are not interned, then perhaps they should be.
>>> There is a similar optimization proposal (w/ patch) for attribute names:
>>> http://bugs.python.org/issue5084
>> If I understand correctly, this would not help:
>>
>> - on writing, the strings are identical anyway, because they are read
>> out of the class's __name__ and __module__ fields.  Therefore the
>> Pickler's usual memoizing behavior will prevent the strings from being
>> written more than once.
> 
> Then why did you say that "the space for these strings constituted a
> large fraction of the database size", if they are already shared? Are
> your objects so tiny that even the space taken by the pointer to the
> type name grows the size of the database significantly?

Sorry for the confusion.  I thought you were suggesting the change to
help the more typical use case, when a single Pickler is used for a lot
of data.  That use case will not be helped by interning the class
__name__ and __module__ strings, for the reasons given in my previous email.

In my case, the strings are shared via the Pickler memoizing mechanism
because I pre-populate the memo (using the API that the OP proposes to
remove), so your suggestion won't help my current code, either.  It was
before I implemented the pre-populated memoizer that "the space for
these strings constituted a large fraction of the database size".  But
your suggestion wouldn't help that case, either.

Here are the main use cases:

1. Saving and loading one large record.  A class's __name__ string is
the same string object every time it is retrieved, so it only needs to
be stored once and the Pickler memo mechanism works.  Similarly for the
class's __module__ string.

2. Saving and loading lots of records sequentially.  Provided a single
Pickler is used for all records and its memo is never cleared, this
works just as well as case 1.

3. Saving and loading lots of records in random order, as for example in
the shelve module.  It is not possible to reuse a Pickler with retained
memo, because the Unpickler might not encounter objects in the right
order.  There are two subcases:

   a. Use a clean Pickler/Unpickler object for each record.  In this
case the __name__ and __module__ of a class will appear once in each
record in which the class appears.  (This is the case regardless of
whether they are interned.)  On reading, the __name__ and __module__ are
only used to look up the class, so interning them won't help.  It is
thus impossible to avoid wasting a lot of space in the database.

   b. Use a Pickler/Unpickler with a preset memo for each record (my
unorthodox technique).  In this case the class __name__ and __module__
will be memoized in the shared memo, so in other records only their ID
needs to be stored (in fact, only the ID of the class object itself).
This allows the database to be smaller, but does not have any effect on
the RAM usage of the loaded objects.

If the OP's proposal is accepted, 3b will become impossible.  The
technique seems not to be well known, so maybe it doesn't need to be
supported.  It would mean some extra work for me on the cvs2svn project
though :-(

Michael

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Pickler/Unpickler API clarification

2009-03-07 Thread Michael Haggerty
Greg Ewing wrote:
> Michael Haggerty wrote:
>> A similar effect could *almost* be obtained without accessing the memos
>> by saving the pickled primer itself in the database.  The unpickler
>> would be primed by using it to load the primer before loading the record
>> of interest.  But AFAIK there is no way to prime new picklers, because
>> there is no guarantee that pickling the same primer twice will result in
>> the same id->object mapping in the pickler's memo.
> 
> Would it help if, when creating a pickler or unpickler,
> you could specify another pickler or unpickler whose
> memo is used to initialise the memo of the new one?
> 
> Then you could keep the pickler that you used to pickle
> the primer and "fork" new picklers off it, and similarly
> with the unpicklers.

Typically, the purpose of a database is to persist data across program
runs.  So typically, your suggestion would only help if there were a way
to persist the primed Pickler across runs.

(The primed Unpickler is not quite so important because it can be primed
by reading a pickle of the primer, which in turn can be stored somewhere
in the DB.)

In the particular case of cvs2svn, each of our databases is in fact
written in a single pass, and then in later passes only read, not
written.  So I suppose we could do entirely without pickleable Picklers,
if they were copyable within a single program run.  But that constraint
would make the feature even less general.

Michael
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Pickler/Unpickler API clarification

2009-03-07 Thread Michael Haggerty
Guido van Rossum wrote:
> On Sat, Mar 7, 2009 at 8:04 AM, Michael Haggerty  wrote:
>> Typically, the purpose of a database is to persist data across program
>> runs.  So typically, your suggestion would only help if there were a way
>> to persist the primed Pickler across runs.
> 
> I haven't followed all this, but isn't is at least possible to
> conceive of the primed pickler as being recreated from scratch from
> constant data each run?

If there were a guarantee that pickling the same data would result in
the same memo ID -> object mapping, that would also work.  But that
doesn't seem to be a realistic guarantee to make.  AFAIK the memo IDs
are integers chosen consecutively in the order that objects are pickled,
which doesn't seem so bad.  But compound objects are a problem.  For
example, when pickling a map, the map entries would have to be pickled
in an order that remains consistent across runs (and even across Python
versions).  Even worse, all user-written __getstate__() methods would
have to return exactly the same result, even across program runs.

>> (The primed Unpickler is not quite so important because it can be primed
>> by reading a pickle of the primer, which in turn can be stored somewhere
>> in the DB.)
>>
>> In the particular case of cvs2svn, each of our databases is in fact
>> written in a single pass, and then in later passes only read, not
>> written.  So I suppose we could do entirely without pickleable Picklers,
>> if they were copyable within a single program run.  But that constraint
>> would make the feature even less general.
> 
> Being copyable is mostly equivalent to being picklable, but it's
> probably somewhat weaker because it's easier to define it as a pointer
> copy for some types that aren't easily picklable.

Indeed.  And pickling the memo should not present any fundamental
problems, since by construction it can only contain pickleable objects.

Michael
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] suggestion for try/except program flow

2009-03-28 Thread Michael Haggerty
Mark Donald wrote:
> I frequently have this situation:
> 
> try:
> try:
> raise Thing
> except Thing, e:
> # handle Thing exceptions
> raise
> except:
> # handle all exceptions, including Thing

This seems like an unusual pattern.  Are you sure you can't use

try:
raise Thing
except Thing, e:
# handle Thing exceptions
raise
finally:
# handle *all situations*, including Thing

Obviously, the finally: block is also invoked in the case that no
exceptions are triggered, but often this is what you want anyway...

Michael
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 389: argparse - new command line parsing module

2009-09-29 Thread Michael Haggerty
Steven Bethard wrote:
> On Tue, Sep 29, 2009 at 1:31 PM, Paul Moore  wrote:
>> 2009/9/28 Yuvgoog Greenle :
>>> 1. There is no chance of the script killing itself. In argparse and optparse
>>> exit() is called on every parsing error (btw because of this it sucks to
>>> debug parse_args in an interpreter).
> [...]
> 
> This is behavior that argparse inherits from optparse, but I believe
> it's still what 99.9% of users want.  If you're writing a command line
> interface, you don't want a stack trace when there's an error message
> (which is what you'd get if argparse just raised exceptions) you want
> an exit with an error code. That's what command line applications are
> supposed to do.
> 
> If you're not using argparse to write command line applications, then
> I don't feel bad if you have to do a tiny bit of extra work to take
> care of that use case. In this particular situation, all you have to
> do is subclass ArgumentParser and override exit() to do whatever you
> think it should do.
> 
>>> 2. There is no chance the parser will print things I don't want it to print.
> [...]
> 
> There is only a single method in argparse that prints things,
> _print_message(). So if you want it to do something else, you can
> simply override it in a subclass. I can make that method public if
> this is a common use case.

Instead of forcing the user to override the ArgumentParser class to
change how errors are handled, I suggest adding a separate method
ArgumentParser.parse_args_with_exceptions() that raises exceptions
instead of writing to stdout/stderr and never calls sys.exit().  Then
implement ArgumentParser.parse_args() as a wrapper around
parse_args_with_exceptions():

class ArgparseError(Exception):
"""argparse-specific exception type."""
pass

class ArgumentError(ArgparseError):
# ...

class ArgumentParser(...):
# ...
def parse_args_with_exceptions(...):
   # like the old parse_args(), except raises exceptions instead of
   # writing to stdout/stderr or calling sys.exit()...

def parse_args(self, *args, **kwargs):
try:
self.parse_args_with_exceptions(*args, **kwargs)
except ArgparseError as e:
self.print_usage(_sys.stderr)
self.exit(status=2,
  message=(_('%s: error: %s\n') % (self.prog, e,)))
# perhaps catch other exceptions that need special handling...

def error(self, message):
raise ArgparseError(message)

The exception classes should hold enough information to be useful to
non-command-line users, and obviously contain error messages that are
output to stderr by default.  This would allow non-command-line users to
call parse_args_with_exceptions() and handle the exceptions however they
like.

Michael
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] The interpreter accepts f(**{'5':'foo'}); is this intentional?

2009-02-05 Thread Michael Haggerty
I can't find documentation about whether there are constraints imposed
on the keys in the map passed to a function via **, as in f(**d).

According to

http://docs.python.org/reference/expressions.html#id9

, d must be a mapping.

test_extcall.py implies that the keys of this map must be strings in the
following test:

>>> f(**{1:2})
Traceback (most recent call last):
  ...
TypeError: f() keywords must be strings

But must the keys be valid python identifiers?

In particular, the following is allows by the Python 2.5.2 and the
Jython 2.2.1 interpreters:

>>> f(**{'1':2})
{'1': 2}

Is this behavior required somewhere by the Python language spec, or is
it an error that just doesn't happen to be checked, or is it
intentionally undefined whether this is allowed?

Michael
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com