[Python-Dev] On a new version of pickle [PEP 3154]: self-referential frozensets
Hello, I'm one of this year's Google Summer of Code students working on improving pickle by creating a new version. My name is Stefan and my mentor is Alexandre Vassalotti. If you're interested, you can monitor the progress in the dedicated blog at [2] and the bitbucket repository at [3]. One of the goals for picklev4 is to add native opcodes for pickling of sets and frozensets. Currently these 4 opcodes were added: * EMPTY_SET, EMPTY_FROZENSET: push an empty set/frozenset in the stack * UPDATE_SET: update the set in the stack with the top stack slice stack before: ... pyset mark stackslice stack after : ... pyset effect: pyset.update(stackslice) # inplace union * UNION_FROZENSET: like UPDATE_SET, but create a new frozenset stack before: ... pyfrozenset mark stackslice stack after : ... pyfrozenset.union(stackslice) While this design allows pickling of self-referential sets, self-referential frozensets are still problematic. For instance, trying to pickle `fs': a=A(); fs=frozenset([a]); a.fs = fs (when unpickling, the object a has to be initialized before it is added to the frozenset) The only way I can think of to make this work is to postpone the initialization of all the objects inside the frozenset until after UNION_FROZENSET. I believe this is doable, but there might be memory penalties if the approach is to simply store all the initialization opcodes in memory until pickling the frozenset is finished. Currently, pickle.dumps(fs,4) generates: EMPTY_FROZENSET BINPUT 0 MARK BINGLOBAL_COMMON '0 A' # same as GLOBAL '__main__ A' in v3 EMPTY_TUPLE NEWOBJ EMPTY_DICT SHORT_BINUNICODE 'fs' BINGET 0 # retrieves the frozenset which is empty at this point, and it # will never be filled because it's immutable SETITEM BUILD # a.__setstate__({'fs' : frozenset()}) UNION_FROZENSET By postponing the initialization of a, it should instead generate: EMPTY_FROZENSET BINPUT 0 MARK BINGLOBAL_COMMON '0 A' # same as GLOBAL '__main__ A' in v3 EMPTY_TUPLE NEWOBJ # create the object but don't initialize its state yet BINPUT 1 UNION_FROZENSET BINGET 1 EMPTY_DICT SHORT_BINUNICODE 'fs' BINGET 0 SETITEM BUILD POP While self-referential frozensets are uncommon, a far more problematic situation is with the self-referential objects created with REDUCE. While pickle uses the idea of creating empty collections and then filling them, reduce tipically creates already-filled objects. For instance: cnt = collections.Counter(); cnt[a]=3; a.cnt=cnt; cnt.__reduce__() (, ({<__main__.A object at 0x0286E8F8>: 3},)) where the A object contains a reference to the counter. Unpickling an object pickled with this reduce function is not possible, because the reduce function, which "explains" how to create the object, is asking for the object to exist before being created. The fix here would be to pass Counter's dictionary in the state argument, as opposed to the "constructor parameters" one, as follows: (, (), {<__main__.A object at 0x0286E8F8>: 3}) When unpickling this, an empty Counter will be created first, and then __setstate__ will be called to fill it, at which point self-references are allowed. I assume this modification has to be done in the implementations of the data structures rather than in pickle itself. Pickle could try to fix this by detecting when reduce returns a class type as the first tuple arg and move the dict ctor parameter to the state, but this may not always be intended. It's also a bit strange that __getstate__ is never used anywhere in pickle directly. I'm looking forward to hearing your suggestions and opinions in this matter. Regards, Stefan [1] http://www.python.org/dev/peps/pep-3154/ [2] http://pypickle4.wordpress.com/ [3] http://bitbucket.org/mstefanro/pickle4 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 0424: A method for exposing a length hint
On 7/16/2012 9:54 AM, Stefan Behnel wrote: Mark Shannon, 15.07.2012 16:14: Alex Gaynor wrote: CPython currently defines an ``__length_hint__`` method on several types, such as various iterators. This method is then used by various other functions (such as ``map``) to presize lists based on the estimated returned by Don't use "map" as an example. map returns an iterator so it doesn't need __length_hint__ Right. It's a good example for something else, though. As I mentioned before, iterators should be able to propagate the length hint of an underlying iterator, e.g. in generator expressions or map(). I consider that an important feature that the protocol must support. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mstefanro%40gmail.com map() is quite problematic in this matter, and may actually benefit from the existence of __length_hint__. It is very easy to create an infinite loop currently by doing stuff like x=[1]; x+=map(str,x) [61081 refs] >>> x=[1]; x+=map(str,x) Traceback (most recent call last): ... MemoryError [120959834 refs] >>> len(x) 120898752 Obviously, this won't cause an infinite loop in Python2 where map is non-lazy. Also, this won't work for all mutable containers, because not all of them permit adding elements while iterating: >>> s=set([1]); s.update(map(str,s)) Traceback (most recent call last): ... RuntimeError: Set changed size during iteration [61101 refs] >>> s {1, '1'} [61101 refs] >>> del s [61099 refs] If map objects were to disallow changing the size of the container while iterating (I can't really think of an use-case in which such a limitation would be harmful), it might as well be with __length_hint__. Also, what would iter([1,2,3]).__length_hint__() return? 3 or unknown? If 3, then the semantics of l=[1,2,3]; l += iter(l) will change (infinite loop without __length_hint__ vs. list of 6 elements with __length_hint__). If unknown, then it doesn't seem like there are very many places where __length_hint__ can return anything but unknown. Regards, Stefan M ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Unbinding of methods
Hey, As part of pickle4, I found it interesting to add the possibility of pickling bound functions (instance methods). This is done by pickling f.__self__ and f.__func__ separately, and then adding a BIND opcode to tie them together. While this appears to work fine for python methods (non-builtin), some issues arise with builtins. These are partly caused because not all builtin function types support __func__, partly because not all of them fill __module__ when they should and partly because there are many (7) types a function can actually have: ClassMethodDescriptorType = type(??) BuiltinFunctionType = type(len) FunctionType = type(f) MethodType = type(A().f()) MethodDescriptorType = type(list.append) WrapperDescriptorType = type(list.__add__) MethodWrapperType = type([].__add__) AllFunctionTypes = (ClassMethodDescriptorType, BuiltinFunctionType, FunctionType, MethodType, MethodDescriptorType, WrapperDescriptorType, MethodWrapperType) repr(AllFunctionTypes) = ( , , , , , , ) I have created a patch at [1], which adds __func__ to some other function types, as well as: 1) adds AllFunctionTypes etc. to Lib/types.py 2) inspect.isanyfunction(), inspect.isanyboundfunction(), inspect.isanyunboundfunction() 3) functools.unbind Note that I am not knowledgeable of cpython internals and therefore the patch needs to be carefully reviewed. Possible issues: Should classmethods be considered bound or unbound? If cm is a classmethod, then should cm.__func__.__self__ = cm.__self__ or cm.__func__.__self__ = None? Currently does the latter: >>> cm.__self__, hasattr(cm,'__self__'), hasattr(cm.__func__, '__self__') (, True, False) This requires treating classmethods separately when pickling, so I'm not sure if this is ideal. Let me know if I should have opened an issue instead. I look forward to hearing your opinions/suggestions on this matter. Regards, Stefan M [1] https://gist.github.com/3145210 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unbinding of methods
On 7/19/2012 9:54 PM, Antoine Pitrou wrote: On Thu, 19 Jul 2012 19:53:27 +0300 M Stefan wrote: Hey, As part of pickle4, I found it interesting to add the possibility of pickling bound functions (instance methods). This is done by pickling f.__self__ and f.__func__ separately, and then adding a BIND opcode to tie them together. Instead of a specific opcode, can't you use a suitable __reduce__ magic (or __getnewargs__, perhaps)? We want to limit the number of opcodes except for performance-critical types (and I don't think bound methods are performance-critical for the purpose of serialization). Yes, I agree that doing it with __reduce__ would be better approach than adding a new opcode, I'll consider switching. I have created a patch at [1], which adds __func__ to some other function types, as well as: 1) adds AllFunctionTypes etc. to Lib/types.py 2) inspect.isanyfunction(), inspect.isanyboundfunction(), inspect.isanyunboundfunction() 3) functools.unbind That sounds like a lot of changes if the goal is simply to make those types picklable. Regards Antoine. Indeed they are, I just thought there may be a chance this code would be used elsewhere too. It's a bit weird that you can use inspect to check for certain types of functions but not others, as well as be able to "unbind" certain types of methods but not others. Admittedly, these changes have little use-case and are not a priority. Yours, Stefan M ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com