Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

2014-01-28 Thread tds...@gmail.com

Hi,

yes I know the main usage is to generate pyc files. But marshal is also 
used for other stuff
and is the fastest built in serialization method. For some use cases it 
makes sense to use it instead of

pickle or others. And people use it not only to generate pyc files.

I only found one case with a performance regression in the newer 
protocol versions for
3.4. We should take care of it and improve it. Now it is possible to 
handle this in a beta phase
and fix it for the upcoming release. Or even document all this. I think 
it is also useful for others

to know about the new versions and their usage and the behavior.

I also noticed the new versions can be faster in some use cases. I like 
the work done for this
and think it was also useful to reduce the size of the resulting 
serialization. I 'm not against it

nor want to criticize it. I only want to improve all this further.

Regards,

Wolfgang

On 28.01.2014 06:14, Kristján Valur Jónsson wrote:

Hi there.
I think you should modify your program to marshal (and load) a compiled module.
This is where the optimizations in versions 3 and 4 become important.
K


-Original Message-
From: Python-Dev [mailto:python-dev-
bounces+kristjan=ccpgames@python.org] On Behalf Of Victor Stinner
Sent: Monday, January 27, 2014 23:35
To: Wolfgang
Cc: Python-Dev
Subject: Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3
protocol)

Hi,

I'm surprised: marshal.dumps() doesn't raise an error if you pass an invalid
version. In fact, Python 3.3 only supports versions 0, 1 and 2. If you pass 3, 
it
will use the version 2. (Same apply for version
99.)

Python 3.4 has two new versions: 3 and 4. The version 3 "shares common
object references", the version 4 adds short tuples and short strings
(produce smaller files).

It would be nice to document the differences between marshal versions.

And what do you think of raising an error if the version is unknown in
marshal.dumps()?

I modified your benchmark to test also loads() and run the benchmark
10 times. Results:
---
Python 3.3.3+ (3.3:50aa9e3ab9a4, Jan 27 2014, 16:11:26) [GCC 4.8.2 20131212
(Red Hat 4.8.2-7)] on linux

dumps v0: 391.9 ms
data size v0: 45582.9 kB
loads v0: 616.2 ms

dumps v1: 384.3 ms
data size v1: 45582.9 kB
loads v1: 594.0 ms

dumps v2: 153.1 ms
data size v2: 41395.4 kB
loads v2: 549.6 ms

dumps v3: 152.1 ms
data size v3: 41395.4 kB
loads v3: 535.9 ms

dumps v4: 152.3 ms
data size v4: 41395.4 kB
loads v4: 549.7 ms
---

And:
---
Python 3.4.0b3+ (default:dbad4564cd12, Jan 27 2014, 16:09:40) [GCC 4.8.2
20131212 (Red Hat 4.8.2-7)] on linux

dumps v0: 389.4 ms
data size v0: 45582.9 kB
loads v0: 564.8 ms

dumps v1: 390.2 ms
data size v1: 45582.9 kB
loads v1: 545.6 ms

dumps v2: 165.5 ms
data size v2: 41395.4 kB
loads v2: 470.9 ms

dumps v3: 425.6 ms
data size v3: 41395.4 kB
loads v3: 528.2 ms

dumps v4: 369.2 ms
data size v4: 37000.9 kB
loads v4: 550.2 ms
---

Version 2 is the fastest in Python 3.3 and 3.4, but version 4 with Python 3.4
produces the smallest file.

Victor

2014-01-27 Wolfgang :

Hi,

I tested the latest beta from 3.4 (b3) and noticed there is a new
marshal protocol version 3.
The documentation is a little silent about the new features, not going
into detail.

I've run a performance test with the new protocol version and noticed
the new version is two times slower in serialization than version 2. I
tested it with a simple value tuple in a list (50 elements).
Nothing special. (happens only if the tuple contains also a tuple)

Copy of the test code:


from time import time
from marshal import dumps

def genData(amount=50):
   for i in range(amount):
 yield (i, i+2, i*2, (i+1,i+4,i,4), "my string template %s" % i,
1.01*i,
True)

data = list(genData())
print(len(data))
t0 = time()
result = dumps(data, 2)
t1 = time()
print("duration p2: %f" % (t1-t0))
t0 = time()
result = dumps(data, 3)
t1 = time()
print("duration p3: %f" % (t1-t0))



Is the overhead for the recursion detection so high ?

Note this happens only if there is a tuple in the tuple of the datalist.


Regards,

Wolfgang


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-

dev/victor.stinner%40gm

ail.com



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Add PyType_GetSlot

2014-01-28 Thread Martin v. Löwis
I'd like to resolve a long-standing issue of the stable ABI in 3.4:

http://bugs.python.org/issue17162

The issue is that, since PyTypeObject is opaque, module authors cannot
get at tp_free, which they may need to in order to implement tp_dealloc
properly.

Rather than providing the proposed specific wrapper for tp_dealloc, I
propose to add a generic PyType_GetSlot function. From a stability point
of view, exposing slot values is uncritical - it's just that the layout
of the type object is hidden.

Any objection to adding this before RC1?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

2014-01-28 Thread Martin v. Löwis
I've debugged this a little bit. I couldn't originally see where the
problem is, since I expected that the code dealing with shared
references shouldn't ever trigger - none of the tuples in the example
are actually shared (i.e. they all have a ref-count of 1, except for
the outer list, which is both a parameter and bound in a variable).

Debugging reveals that it is actually the many integer objects which
trigger the sharing code. So a much simplified example of Victor's
benchmarking code can use

data = [0]*1000

The difference between version 2 and version 3 here is that v2 marshals
a lot of "0" integers, whereas version 3 marshals a single one, and then
a lot of references to this integer.

Since "0" is a small integer, and thus a singleton anyway, this doesn't
affect the unmarshal result. If the integers were larger, and actually
shared, the umarshal result under v2 would be "more correct".

If the integers are not shared, v2 and v3 have about the same runtime,
e.g. seen when using

data = [1000*1000 for i in range(1000)]

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

2014-01-28 Thread Barry Warsaw
On Jan 28, 2014, at 09:17 AM, tds...@gmail.com wrote:

>yes I know the main usage is to generate pyc files. But marshal is also used
>for other stuff and is the fastest built in serialization method. For some
>use cases it makes sense to use it instead of pickle or others. And people
>use it not only to generate pyc files.

marshall is not guaranteed to be backward compatible between Python versions,
so it's generally not a good idea to use it for serialization.

-Barry
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

2014-01-28 Thread Victor Stinner
2014-01-28 "Martin v. Löwis" :
> Debugging reveals that it is actually the many integer objects which
> trigger the sharing code. So a much simplified example of Victor's
> benchmarking code can use
>
> data = [0]*1000
>
> The difference between version 2 and version 3 here is that v2 marshals
> a lot of "0" integers, whereas version 3 marshals a single one, and then
> a lot of references to this integer.

Since the output size looks to be the same, it may be interesting to
special-case small integers, or even integers and floats in general.
Handling references to these numbers takes probably more CPU, whereas
the gain on the file size is probably minor.

I wrote a short patch:
http://bugs.python.org/issue20416

"dumps v3 is 60% faster, loads v3 is also 14% *faster*."

"dumps v4 is 66% faster, loads v4 is 16% faster."

"file size (on version 3 and 4) is unchanged with my patch."

"So with the patch, the Python 3.4 default version (4) is *faster*
(dump 20% faster, load 16% faster) and produces *smaller files* (10%
smaller)."

It looks like a win-win patch :-)

The drawback is that files storing many duplicated huge numbers will
not be smaller with marshal version >= 3.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

2014-01-28 Thread Antoine Pitrou
On Tue, 28 Jan 2014 11:22:40 +0100
Victor Stinner  wrote:
> 2014-01-28 "Martin v. Löwis" :
> > Debugging reveals that it is actually the many integer objects which
> > trigger the sharing code. So a much simplified example of Victor's
> > benchmarking code can use
> >
> > data = [0]*1000
> >
> > The difference between version 2 and version 3 here is that v2 marshals
> > a lot of "0" integers, whereas version 3 marshals a single one, and then
> > a lot of references to this integer.
> 
> Since the output size looks to be the same, it may be interesting to
> special-case small integers, or even integers and floats in general.
> Handling references to these numbers takes probably more CPU, whereas
> the gain on the file size is probably minor.

Please remember file size is only one factor. Another factor is runtime
size after unmarshalling.

For the typical case of pyc files, dump times are not very important.
Load times are.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

2014-01-28 Thread tds...@gmail.com

On 28.01.2014 10:23, Barry Warsaw wrote:

On Jan 28, 2014, at 09:17 AM, tds...@gmail.com wrote:


yes I know the main usage is to generate pyc files. But marshal is also used
for other stuff and is the fastest built in serialization method. For some
use cases it makes sense to use it instead of pickle or others. And people
use it not only to generate pyc files.

marshall is not guaranteed to be backward compatible between Python versions,
so it's generally not a good idea to use it for serialization.



Yes I know. And because of that I use it only if nothing persists and 
the exchange is between

the same Python version (even the same architecture and Interpreter type).
But there are use cases for inter process communication with no 
persistence and no need
to serialize custom classes and so on. And if speed matters and security 
is not the problem

you use the marshal module to serialize data.

Assume something like multiprocessing for Windows (no fork available) 
and only a pipe to exchange
a lot of simple data and pickle is to slow. (Sometimes distributed to 
other computers.)


Another use case can be a persistent cache with ultra fast serialization 
(dump/load) needs but
not with critical data normally stored in a database. Can be regenerated 
easily if Python version

changes from main data. (think pyc files are such a use case)

I have tested a lot of modules for some needs (JSON, Thrift, 
MessagePack, Pickle, ProtoBuffers, ...)

all are very useful and has their usage scenario.
The same applies to marshal if all the limitations are no problem for you.
(I've read the manual and have some knowledge about the limitations)

But all these serialization modules are not as fast as marshal. (for my 
use case)


I hear you and registered the warning about this. And will not complain 
if something will be incompatible. :-)


If someone knows something faster to serialize basic Python types. I'm 
glad to use it.



Regards,

Wolfgang


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Negative times behaviour in itertools.repeat for Python maintenance releases (2.7, 3.3 and maybe 3.4)

2014-01-28 Thread Steven D'Aprano
On Mon, Jan 27, 2014 at 10:06:57PM -0800, Larry Hastings wrote:

> If I were writing it, it might well come out like this:
[snip example]

+1 on this wording, with one minor caveat:

>.. note:  if "times" is specified using a keyword argument, and
>provided with a negative value, repeat yields the object forever.
>This is a bug, its use is unsupported, and this behavior may be
>removed in a future version of Python.

How about changing "may be removed" to "will be removed", he asks 
hopefully? :-)


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Negative times behaviour in itertools.repeat for Python maintenance releases (2.7, 3.3 and maybe 3.4)

2014-01-28 Thread Ethan Furman

On 01/28/2014 04:37 AM, Steven D'Aprano wrote:

On Mon, Jan 27, 2014 at 10:06:57PM -0800, Larry Hastings wrote:


If I were writing it, it might well come out like this:

[snip example]

+1 on this wording, with one minor caveat:


.. note:  if "times" is specified using a keyword argument, and
provided with a negative value, repeat yields the object forever.
This is a bug, its use is unsupported, and this behavior may be
removed in a future version of Python.


How about changing "may be removed" to "will be removed", he asks
hopefully? :-)


+1

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Negative times behaviour in itertools.repeat for Python maintenance releases (2.7, 3.3 and maybe 3.4)

2014-01-28 Thread Armin Rigo
Hi Vajrasky,

On 28 January 2014 03:05, Vajrasky Kok  wrote:
> I get your point. But strangely enough, I can still recover from
> list(repeat('a', 2**29)). It only slows down my computer. I can ^Z the
> application then kill it later. But with list(repeat('a', times=-1)),
> rebooting the machine is compulsory.

Actually you get the early OverflowError if the value doesn't fit a C
long, and any value up to sys.maxint gets past this check.  Try with
2**31-1.


A bientôt,

Armin.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add PyType_GetSlot

2014-01-28 Thread Larry Hastings

On 01/28/2014 12:27 AM, "Martin v. Löwis" wrote:

I'd like to resolve a long-standing issue of the stable ABI in 3.4:

http://bugs.python.org/issue17162

The issue is that, since PyTypeObject is opaque, module authors cannot
get at tp_free, which they may need to in order to implement tp_dealloc
properly.

Rather than providing the proposed specific wrapper for tp_dealloc, I
propose to add a generic PyType_GetSlot function. From a stability point
of view, exposing slot values is uncritical - it's just that the layout
of the type object is hidden.

Any objection to adding this before RC1?


So this would be a new public ABI function?

Would it be 100% new code, or would you need to refactor code internally 
to achieve it?


In general I'm in favor of it but I'd like to review the patch before it 
goes in.


Also, just curious: what is typeslots.h used for?  I tried searching for 
a couple of those macros, and their only appearance in trunk was their 
definition.



Cheers,


//arry/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Negative times behaviour in itertools.repeat for Python maintenance releases (2.7, 3.3 and maybe 3.4)

2014-01-28 Thread Larry Hastings

On 01/28/2014 06:18 AM, Ethan Furman wrote:

On 01/28/2014 04:37 AM, Steven D'Aprano wrote:

On Mon, Jan 27, 2014 at 10:06:57PM -0800, Larry Hastings wrote:

.. note:  if "times" is specified using a keyword argument, and
provided with a negative value, repeat yields the object forever.
This is a bug, its use is unsupported, and this behavior may be
removed in a future version of Python.


How about changing "may be removed" to "will be removed", he asks
hopefully? :-)


+1



See the recent discussion "Deprecation policy" right here in python-dev 
for a cogent discussion on this issue.  I agree with Raymond's view, 
posted on 1/25:


   * A good use for deprecations is for features that were flat-out
   misdesigned
   and prone to error.  For those, there is nothing wrong with
   deprecating them
   right away.  Once deprecated though, there doesn't need to be a rush to
   actually remove it -- that just makes it harder for people with
   currently
   working code to upgrade to newer versions of Python.

   * When I became a core developer well over a decade ago, I was a little
   deprecation happy (old stuff must go, keep everything nice and
   clean, etc).
   What I learned though is that deprecations are very hard on users
   and that
   the purported benefits usually aren't really important.

I think the "times behaves differently when passed by name versus passed 
by position" behavior falls exactly into this category, and its advice 
on how to handle it is sound.


Cheers,


//arry/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Need help designing subprocess API for Tulip

2014-01-28 Thread Guido van Rossum
If you're interested, please see us on the python-tulip mailing list at
Google Groups.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

2014-01-28 Thread Kristján Valur Jónsson
“Note this happens only if there is a tuple in the tuple of the datalist.”
This is rather odd.
Protocol 3 adds support for object instancing.  Non-trivial Objects are looked 
up in the memo dictionary if they have a reference count larger than 1.
I suspect that the internal tuple has this property, for some reason.
However, my little test in 2.7 does not bear out this hypothesis:


def genData(amount=50):
  for i in range(amount):
yield (i, i+2, i*2, (i+1,i+4,i,4), "my string template %s" % i, 1.01*i, 
True)

l = list(genData())
import sys
print sys.getrefcount(l[1000])
print sys.getrefcount(l[1000][0])
print sys.getrefcount(l[1000][3])

C:\Program Files\Perforce>python d:\pyscript\data.py
2
3
2

K

From: Python-Dev [mailto:python-dev-bounces+kristjan=ccpgames@python.org] 
On Behalf Of Wolfgang
Sent: Monday, January 27, 2014 22:41
To: Python-Dev
Subject: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

Hi,
I tested the latest beta from 3.4 (b3) and noticed there is a new marshal 
protocol version 3.
The documentation is a little silent about the new features, not going into 
detail.
I've run a performance test with the new protocol version and noticed the new 
version is two times slower in serialization than version 2. I tested it with a 
simple value tuple in a list (50 elements).
Nothing special. (happens only if the tuple contains also a tuple)
Copy of the test code:


from time import time
from marshal import dumps

def genData(amount=50):
  for i in range(amount):
yield (i, i+2, i*2, (i+1,i+4,i,4), "my string template %s" % i, 1.01*i, 
True)

data = list(genData())
print(len(data))
t0 = time()
result = dumps(data, 2)
t1 = time()
print("duration p2: %f" % (t1-t0))
t0 = time()
result = dumps(data, 3)
t1 = time()
print("duration p3: %f" % (t1-t0))


Is the overhead for the recursion detection so high ?

Note this happens only if there is a tuple in the tuple of the datalist.


Regards,

Wolfgang

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

2014-01-28 Thread Kristján Valur Jónsson
How often I hear this argument :)
For many people, serialized data is not persisted.  But used e.g. for sending 
information over the wire, or between processes.
Marshal is very good for that.  Additionally, it doesn't have any side effects 
since it just stores primitive types and is thus "safe".
EVE Online uses its own extended version of the marshal system, and has for 
years, because it is fast and it can be
tuned to an application domain by adding custom opcodes.

> -Original Message-
> From: Python-Dev [mailto:python-dev-
> bounces+kristjan=ccpgames@python.org] On Behalf Of Barry Warsaw
> Sent: Tuesday, January 28, 2014 17:23
> To: python-dev@python.org
> Subject: Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3
> protocol)


> marshall is not guaranteed to be backward compatible between Python
> versions, so it's generally not a good idea to use it for serialization.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3.4, marshal dumps slower (version 3 protocol)

2014-01-28 Thread Terry Reedy

On 1/28/2014 10:02 PM, Kristján Valur Jónsson wrote:


marshall is not guaranteed to be backward compatible between Python
versions, so it's generally not a good idea to use it for serialization.



How often I hear this argument :)
For many people, serialized data is not persisted.  But used e.g. for sending 
information over the wire, or between processes.
Marshal is very good for that.  Additionally, it doesn't have any side effects since it 
just stores primitive types and is thus "safe".
EVE Online uses its own extended version of the marshal system, and has for 
years, because it is fast and it can be
tuned to an application domain by adding custom opcodes.


I think the proper message is this:

"Marshal is designed for caching compiled message objects and has the 
function needed for that goal. When the need changes, marshal changes 
(with a change in magic number). Other uses should take into account the 
limitations of function and stability."


It appears you did just that by making a custom version with the 
function and stability you need.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Negative times behaviour in itertools.repeat for Python maintenance releases (2.7, 3.3 and maybe 3.4)

2014-01-28 Thread Ethan Furman

On 01/28/2014 06:50 PM, Larry Hastings wrote:


See the recent discussion "Deprecation policy" right here in python-dev for a 
cogent discussion on this issue.  I agree
with Raymond's view, posted on 1/25:

* A good use for deprecations is for features that were flat-out misdesigned
and prone to error.  For those, there is nothing wrong with deprecating them
right away.  Once deprecated though, there doesn't need to be a rush to
actually remove it -- that just makes it harder for people with currently
working code to upgrade to newer versions of Python.

* When I became a core developer well over a decade ago, I was a little
deprecation happy (old stuff must go, keep everything nice and clean, etc).
What I learned though is that deprecations are very hard on users and that
the purported benefits usually aren't really important.


I also agree with this view.



I think the "times behaves differently when passed by name versus passed by 
position" behavior falls exactly into this
category, and its advice on how to handle it is sound.


I don't agree with this.  This is a bug.  Somebody going through (for example) a code review and making minor changes so 
the code is more readable shouldn't have to be afraid that [inserting | removing] the keyword in the function call is 
going to *drastically* [1] change the behavior.  I understand the need for a cycle of deprecation [2], but not fixing it 
in 3.5 is folly.


--
~Ethan~

[1] or change the behavior *at all*, for that matter

[2] speaking of deprecations, are all the 3.1, 3.2, etc., etc., deprecations 
being added to 2.7?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com