from:"John Arbash Meinel"

[Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I've been doing some memory profiling of my application, and I've found
some interesting results with how intern() works. I was pretty surprised
to see that the "interned" dict was actually consuming a significant
amount of total memory.
To give the specific values, after doing:
  bzr branch A B
of a small project, the total memory consumption is ~21MB

Of that, the largest single object is the 'interned' dict, at 1.57MB,
which contains 22k strings. One interesting bit, the size of it + the
referenced strings is only 2.4MB. So the "interned" dict *by itself* is
2/3rds the size of the dict + strings it contains.

It also means that the average size of a referenced string is 37.4
bytes. A 'str' has 24 bytes of overhead, so the average string is 13.5
characters long. So to save references to 13.5*22k ~ 300kB of character
data, we are paying 2.4MB, or about 8:1 overhead.

When I looked at the actual references from interned, I saw mostly
variable names. Considering that every variable goes through the python
intern dict. And when you look at the intern function, it doesn't use
setdefault logic, it actually does a get() followed by a set(), which
means the cost of interning is 1-2 lookups depending on likelyhood, etc.
(I saw a whole lot of strings as the error codes in win32all /
winerror.py, and windows error codes tend to be longer-than-average
variable length.)

Anyway, I the internals of intern() could be done a bit better. Here are
some concrete things:

  a) Don't keep a double reference to both key and value to the same
 object (1 pointer per entry), this could be as simple as using a
 Set() instead of a dict()

  b) Don't cache the hash key in the set, as strings already cache them.
 (1 long per entry). This is a big win for space, but would need to
 be balanced against lookup and collision resolving speed.

 My guess is that reducing the size of the set will actually improve
 speed more, because more items can fit in cache. It depends on how
 many times you need to resolve a collision. If the string hash is
 sufficiently spread out, and the load factor is reasonable, then
 likely when you actually find an item in the set, it will be the
 item you want, and you'll need to bring the string object into
 cache anyway, so that you can do a string comparison (rather than
 just a hash comparison.)

  c) Use the existing lookup function one time. (PySet->lookup())
 Sets already have a "lookup" which is optimized for strings, and
 returns a pointer to where the object would go if it exists. Which
 means the intern() function can do a single lookup resolving any
 collisions, and return the object or insert without doing a second
 lookup.

  d) Having a special structure might also allow for separate optimizing
 of things like 'default size', 'grow rate', 'load factor', etc. A
 lot of this could be tuned specifically knowing that we really only
 have 1 of these objects, and it is going to be pointing at a lot of
 strings that are < 50 bytes long.

 If hashes of variable name strings are well distributed, we could
 probably get away with a load factor of 2. If we know we are likely
 to have lots and lots that never go away (you rarely *unload*
 modules, and all variable names are in the intern dict), that would
 suggest having a large initial size, and probably a wide growth
 factor to avoid spending a lot of time resizing the set.

  e) How tuned is String.hash() for the fact that most of these strings
 are going to be ascii text? (I know that python wants to support
 non-ascii variable names, but I still think there is going to be an
 overwhelming bias towards characters in the range 65-122 ('A'-'z').

Also note that the performance of the "interned" dict gets even worse on
64-bit platforms. Where the size of a 'dictentry' doubles, but the
average length of a variable name wouldn't change.

Anyway, I would be happy to implement something along the lines of a
"StringSet", or maybe the "InternSet", etc. I just wanted to check if
people would be interested or not.

John
=:->

PS> I'm not yet subscribed to python-dev, so if you could make sure to
CC me in replies, I would appreciate it.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkneDfYACgkQJdeBCYSNAAPMywCfQVWOg51dtIkWT/jttVTARV0g
WJ4An1w7ypB+akHT5hiSwRKoUhH7ez4j
=9TTp
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel

...

>> Anyway, I the internals of intern() could be done a bit better. Here are
>> some concrete things:
>> 
>
> [snip]
>
> Memory usage is definitely something we're interested in improving.
> Since you've already looked at this in some detail, could you try
> implementing one or two of your ideas and see if it makes a difference
> in memory consumption? Changing from a dict to a set looks promising,
> and should be a fairly self-contained way of starting on this. If it
> works, please post the patch on http://bugs.python.org with your
> results and assign it to me for review.
>
> Thanks,
> Collin Winter
>   
(I did end up subscribing, just with a different email address :)

What is the best branch to start working from? "trunk"?

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel

Christian Heimes wrote:
> John Arbash Meinel wrote:
>> When I looked at the actual references from interned, I saw mostly
>> variable names. Considering that every variable goes through the python
>> intern dict. And when you look at the intern function, it doesn't use
>> setdefault logic, it actually does a get() followed by a set(), which
>> means the cost of interning is 1-2 lookups depending on likelyhood, etc.
>> (I saw a whole lot of strings as the error codes in win32all /
>> winerror.py, and windows error codes tend to be longer-than-average
>> variable length.)
> 
> I've read your posting twice but I'm still not sure if you are aware of
> the most important feature of interned strings. In the first place
> interning not about saving some bytes of memory but a speed
> optimization. Interned strings can be compared with a simple and fast
> pointer comparison. With interend strings you can simple write:
> 
> char *a, *b;
> if (a == b) {
> ...
> }
> 
> Instead of:
> 
> char *a, *b;
> if (strcmp(a, b) == 0) {
> ...
> }
> 
> A compiler can optimize the pointer comparison much better than a
> function call.
> 

Certainly. But there is a cost associated with calling intern() in the
first place. You created a string, and you are now trying to de-dup it.
That cost is both in the memory to track all strings interned so far,
and the cost to do a dict lookup. And the way intern is currently
written, there is a third cost when the item doesn't exist yet, which is
another lookup to insert the object.

I'll also note that increasing memory does have a semi-direct effect on
performance, because more memory requires more time to bring memory back
and forth from main memory to CPU caches.

...

> I agree that a dict is not the most memory efficient data structure for
> interned strings. However dicts are extremely well tested and highly
> optimized. Any specialized data structure needs to be desinged and
> tested very carefully. If you happen to break the interning system it's
> going to lead to rather nasty and hard to debug problems.

Sure. My plan was to basically take the existing Set/Dict design, and
just tweak it slightly for the expected operations of "interned".

> 
>>   e) How tuned is String.hash() for the fact that most of these strings
>>  are going to be ascii text? (I know that python wants to support
>>  non-ascii variable names, but I still think there is going to be an
>>  overwhelming bias towards characters in the range 65-122 ('A'-'z').
> 
> Python 3.0 uses unicode for all names. You have to design something that
> can be adopted to unicode, too. By the way do you know that dicts have
> an optimized lookup function for strings? It's called lookdict_unicode /
>  lookdict_string.

Sure, but so does PySet. I'm not sure about lookset_unicode, but I would
guess that exists or should exist for py3k.

> 
>> Also note that the performance of the "interned" dict gets even worse on
>> 64-bit platforms. Where the size of a 'dictentry' doubles, but the
>> average length of a variable name wouldn't change.
>>
>> Anyway, I would be happy to implement something along the lines of a
>> "StringSet", or maybe the "InternSet", etc. I just wanted to check if
>> people would be interested or not.
> 
> Since interning is mostly used in the core and extension modules you
> might want to experiment with a different growth rate. The interning
> data structure could start with a larger value and have a slower, non
> progressive data growth rate.
> 
> Christian

I'll also mention that there are other uses for intern() where it is
uniquely suitable. Namely, if you are parsing lots of text with
redundant strings, it is a way to decrease total memory consumption.
(And potentially speed up future comparisons, etc.)

The main reason why intern() is useful for this is because it doesn't
make strings immortal, as would happen if you used some other structure.
Because strings know about the "interned" object.

The options for a 3rd-party structure fall down into something like:

1) A cache that makes the strings immortal. (IIRC this is what older
versions of Python did.)

2) A cache that is periodically walked to see if any of the objects are
no longer externally referenced. The main problem here is that walking
is O(all-objects), whereas doing the checking at refcount=0 time means
you only check objects when you think the last reference has gone away.

3) Hijacking PyStringType->dealloc, so that when the refcount goes to 0
and Python want's to destroy the string, you then trigger your own cache
to l

Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel

Alexander Belopolsky wrote:
> On Thu, Apr 9, 2009 at 11:02 AM, John Arbash Meinel
>  wrote:
> ...
>>  a) Don't keep a double reference to both key and value to the same
>> object (1 pointer per entry), this could be as simple as using a
>> Set() instead of a dict()
>>
> 
> There is a rejected patch implementing just that:
> http://bugs.python.org/issue1507011 .
> 

Thanks for the heads up.

So reading that thread, the final reason it was rejected was 2 part:

  Without reviewing the patch again, I also doubt it is capable of
  getting rid of the reference count cheating: essentially, this
  cheating enables the interning dictionary to have weak references to
  strings, this is important to allow automatic collection of certain
  interned strings. This feature needs to be preserved, so the cheating
  in the reference count must continue.

That specific argument was invalid. Because the patch just changed the
refcount trickery to use +- 1. And I'm pretty sure Alexander's argument
was just that +- 2 was weird, not that the "weakref" behavior was bad.

The other argument against the patch was based on the idea that:
  The operation "give me the member equal but not identical to E" is
  conceptually a lookup operation; the mathematical set construct has no
  such operation, and the Python set models it closely. IOW, set is
  *not* a dict with key==value.

I don't know if there was any consensus reached on this, since only
Martin responded this way.

I can say that for my "do some work with a medium size code base", the
overhead of "interned" as a dictionary was 1.5MB out of 20MB total memory.

Simply changing it to a Set would drop this to 1.0MB. I have no proof
about the impact on performance, since I haven't benchmarked it yet.

Changing it to a StringSet could further drop it to 0.5MB. I would guess
that any performance impact would depend on whether the total size of
'interned' would fit inside L2 cache or not.

There is a small bug in the original patch adding the string to the set
failed. Namely it would return "t == NULL" which would be "t != s" and
the intern in place would end up setting your pointer to NULL rather
than doing nothing and clearing the error code.

So I guess some of it comes down to whether "loweis" would also reject
this change on the basis that mathematically a "set is not a dict".
Though given that his claim "nobody else is speaking in favor of the
patch", while at least Colin Winter has expressed some interest at this
point.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel


...

> I like your rationale (save memory) much more, and was asking in the
> tracker for specific numbers, which weren't forthcoming.
> 

...

> Now that you brought up a specific numbers, I tried to verify them,
> and found them correct (although a bit unfortunate), please see my
> test script below. Up to 21800 interned strings, the dict takes (only)
> 384kiB. It then grows, requiring 1536kiB. Whether or not having 22k
> interned strings is "typical", I still don't know.

Given that every variable name in any file is interned, it can grow
pretty rapidly. As an extreme case, consider the file
"win32/lib/winerror.py" which tracks all possible win32 errors.

>>> import winerror
>>> print len(winerror.__dict__)
1872

So a single error file has 1.9k strings.

My python version (2.5.2) doesn't have 'sys.getsizeof()', but otherwise
your code looks correct.

If all I do is find the interned dict, I see:
>>> print len(d)
5037

So stock python, without importing much extra (just os, sys, gc, etc.)
has almost 5k strings already.

I don't have a great regex yet for just extracting how many unique
strings there are in a given bit of source code.

However, if I do:

import gc, sys
def find_interned_dict():
cand = None
for o in gc.get_objects():
if not isinstance(o, dict):
continue
if "find_interned_dict" not in o:
continue
for k,v in o.iteritems():
if k is not v:
break
else:
assert not cand
cand = o
return cand

d = find_interned_dict()
print len(d)

# Just import a few of the core structures
from bzrlib import branch, repository, workingtree, builtins
print len(d)

I start at 5k strings, and after just importing the important bits of
bzrlib, I'm at:
19,316

Now, the bzrlib source code isn't particularly huge. It is about 3.7MB /
91k lines of .py files (that is, without importing the test suite).

Memory consumption with just importing bzrlib shows up at 15MB, with
300kB taken up by the intern dict.

If I then import some extra bits of bzrlib, like http support, ftp
support, and sftp support (which brings in python's httplib, and
paramiko, and ssh/sftp implementation), I'm up to:
>>> print len(d)
25186

Memory has jumped to 23MB, (interned is now 1.57MB) and I haven't
actually done anything but import python code yet. If I sum the size of
the PyString objects held in intern() it ammounts to 940KB. Though they
refer to only 335KB of char data. (or an average of 13 bytes per string).

> 
> Wrt. your proposed change, I would be worried about maintainability,
> in particular if it would copy parts of the set implementation.

Right, so in the first part, I would just use Set(), as it could then
save 1/3rd of the memory it uses today. (Dropping down to 1MB from 1.5MB.)

I don't have numbers on how much that would improve CPU times, I would
imagine improving 'intern()' would impact import times more than run
times, simply because import time is interning a *lot* of strings.

Though honestly, Bazaar would really like this, because startup overhead
for us is almost 400ms to 'do nothing', which is a lot for a command
line app.

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel

Martin v. Löwis wrote:
>> I don't have numbers on how much that would improve CPU times, I would
>> imagine improving 'intern()' would impact import times more than run
>> times, simply because import time is interning a *lot* of strings.
>>
>> Though honestly, Bazaar would really like this, because startup overhead
>> for us is almost 400ms to 'do nothing', which is a lot for a command
>> line app.
> 
> Maybe I misunderstand your proposed change: how could the representation
> of the interning dict possibly change the runtime of interning? (let
> alone significantly)
> 
> Regards,
> Martin
> 

Decreasing memory consumption lets more things fit in cache. Once the
size of 'interned' is greater than fits into L2 cache, you start paying
the cost of a full memory fetch, which is usually measured in 100s of
cpu cycles.

Avoiding double lookups in the dictionary would be less overhead, though
the second lookup is probably pretty fast if there are no collisions,
since everything would already be in the local CPU cache.

If we were dealing in objects that were KB in size, it wouldn't matter.
But as the intern dict quickly gets into MB, it starts to make a bigger
difference.

How big of a difference would be very CPU and dataset size specific. But
certainly caches make certain things much faster, and once you overflow
a cache, performance can take a surprising turn.

So my primary goal is certainly a decrease of memory consumption. I
think it will have a small knock-on effect of improving performance, I
don't have anything to give concrete numbers.

Also, consider that resizing has to evaluate every object, thus paging
in all X bytes, and assigning to another 2X bytes. Cutting X by
(potentially 3), would probably have a small but measurable effect.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel

Greg Ewing wrote:
> John Arbash Meinel wrote:
>> And the way intern is currently
>> written, there is a third cost when the item doesn't exist yet, which is
>> another lookup to insert the object.
> 
> That's even rarer still, since it only happens the first
> time you load a piece of code that uses a given variable
> name anywhere in any module.
> 

Somewhat true, though I know it happens 25k times during startup of
bzr... And I would be a *lot* happier if startup time was 100ms instead
of 400ms.

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel


...
>> Somewhat true, though I know it happens 25k times during startup of
>> bzr... And I would be a *lot* happier if startup time was 100ms instead
>> of 400ms.
> 
> I don't want to quash your idealism too severely, but it is extremely
> unlikely that you are going to get anywhere near that kind of speed up
> by tweaking string interning.  25k times doing anything (computation)
> just isn't all that much.
> 
> $ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in
> xrange(25000): d.get(x)'
> 100 loops, best of 3: 8.28 msec per loop
> 
> Perhaps this isn't representative (int hashing is ridiculously cheap,
> for instance), but the dict itself is far bigger than the dict you are
> dealing with and such would have similar cache-busting properties.  And
> yet, 25k accesses (plus python->c dispatching costs which you are paying
> with interning) consume only ~10ms.  You could do more good by
> eliminating a handful of disk seeks by reducing the number of imported
> modules...
> 
> -Mike
> 

You're also using timeit over the same set of 25k keys, which means it
only has to load that subset. And as you are using identical runs each
time, those keys are already loaded into your cache lines... And given
how hash(int) works, they are all sequential in memory, and all 10M in
your original set have 0 collisions. Actually, at 10M, you'll have a
dict of size 20M entries, and the first 10M entries will be full, and
the trailing 10M entries will all be empty.

That said, you're right, the benefits of a smaller structure are going
to be small. I'll just point that if I just do a small tweak to your
timing and do:

$ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in
  xrange(25000): d.get(x)'
100 loops, best of 3: 6.27 msec per loop

So slightly faster than yours, *but*, lets try a much smaller dict:

$ python -mtimeit -s 'd=dict.fromkeys(xrange(25000))' 'for x in
  xrange(25000): d.get(x)'
100 loops, best of 3: 6.35 msec per loop

Pretty much the same time. Well within the noise margin. But if I go
back to the "big dict" and actually select 25k keys across the whole set:

$ TIMEIT -s 'd=dict.fromkeys(xrange(1000));' \
 -s keys=range(0,1000,1000/25000)' \
 'for x in keys: d.get(x)'
100 loops, best of 3: 13.1 msec per loop

Now I'm still accessing 25k keys, but I'm doing it across the whole
range, and suddenly the time *doubled*.

What about slightly more random access:
$ TIMEIT -s 'import random; d=dict.fromkeys(xrange(1000));'
-s 'bits = range(0, 1000, 400); random.shuffle(bits)'\
'for x in bits: d.get(x)'
100 loops, best of 3: 15.5 msec per loop

Not as big of a difference as I thought it would be... But I bet if
there was a way to put the random shuffle in the inner loop, so you
weren't accessing the same identical 25k keys internally, you might get
more interesting results.

As for other bits about exercising caches:

$ shuffle(range(0, 1000, 400))
100 loops, best of 3: 15.5 msec per loop

$ shuffle(range(0, 1000, 40))
10 loops, best of 3: 175 msec per loop

10x more keys, costs 11.3x, pretty close to linear.

$ shuffle(range(0, 1000, 10))
10 loops, best of 3: 739 msec per loop

4x the keys, 4.5x the time, starting to get more into nonlinear effects.

Anyway, you're absolutely right. intern() overhead is a tiny fraction of
'import bzrlib.*' time, so I don't expect to see amazing results. That
said, accessing 25k keys in a smaller structure is 2x faster than
accessing 25k keys spread across a larger structure.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Easy way to detect filesystem case-sensitivity?

2009-05-07 Thread John Arbash Meinel

Andrew Bennetts wrote:
> Antoine Pitrou wrote:
>> Robert Kern  gmail.com> writes:
>>> Since one may have more than one filesystem side-by-side, this can't be just
>> be 
>>> a system-wide boolean somewhere. One would have to query the target 
>>> directory 
>>> for this information. I am not aware of the existence of code that does such
>> a 
>>> query, though.
>> Or you can just be practical and test for it. Create a file "foobar" and see 
>> if
>> you can open "FOOBAR" in read mode...
> 
> Agreed.  That is how Bazaar's test suite detects this, and it works well.
> 
> -Andrew.

Actually, I believe we do:

open('format', 'wb').close()
try:
  os.lstat('FoRmAt')
except IOError, e:
  if e.errno == errno.ENOENT:
   ...

I don't know that it really matters, just wanted to indicate we use
'lstat' rather than 'open()' to check. I could be wrong about the test
suite, but I know that is what we do for 'live' files. (We always create
a format file, so we know it is there to 'stat' it via a different name.)

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 385: the eol-type issue

2009-08-05 Thread John Arbash Meinel

Mark Hammond wrote:
> On 5/08/2009 8:14 PM, Dirkjan Ochtman wrote:
>> endings. Typically, in my case, that was either Notepad2 (an awesomely
>> light-weight Notepad replacement) or Komodo (Edit). That solved all of
>> my issues, so I haven't had a need for win32text so far.
> 
> FWIW, I use komodo and scite as my primary editors, and as mentioned, am
> personally responsible for accidentally checking in \r\n files into what
> should be a \n repo.  I am slowly and painfully learning to be more
> careful - IMO, I shouldn't need to...
> 
> Cheers,
> 
> Mark

IIRC one of the main problems in Copy & Paste. I believe both Scite and
Visual Studio have had issues where they "preserve" the line endings of
files, but if you paste from another source, it will continue to
"preserve" the line endings of the pasted content.

That said, you also have the "create a new file defaults to CRLF" that
has similar problems.

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] VC++ versions to match python versions?

2009-08-17 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Chris Withers wrote:
> Michael Foord wrote:
>> D'oh. For 2.5 I mean. It may be *possible* though - just as you *can*
>> build extensions for Python 2.5 on windows with mingw (with the
>> appropriate distutils configuration), but there are pitfalls with
>> doing this.
> 
> Yes, in my case I'm trying to compile guppy (for heapy, which is an
> amazing tool) but that blows up with mingw...
> 
> (But I'm also likely to want to do some python dev on windows, httplib
> download problems and all...)
> 
> Chris
> 

Guppy doesn't compile on Windows. Pretty much full-stop. It uses static
references to DLL functions, which on Windows is not allowed.

I've tried patching it to remove such things, and I finally got it to
compile, only to have it go "boom!" in actual use.

If you can get it to work, certainly post something so that I can cheer.

John
=:->


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqJc8EACgkQJdeBCYSNAAPFWgCghbyZ4MDcA3xich0mBOO1/VoY
5mcAnjjv1kS8Ln3dhbG6/W75zmGacWQw
=x6ZX
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] GIL behaviour under Windows

2009-10-21 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Antoine Pitrou wrote:
>> I don't really know how this test works, so I won't claim to understand
>> the results either. However, here you go:
> 
> Thanks.
> 
> Interesting results. I wonder what they would be like on a multi-core
> machine. The GIL seems to behave perfectly on your setup (no performance
> degradation due to concurrency, and zero latencies).
> 

C:\downloads>C:\Python26\python.exe ccbench.py
- --- Throughput ---

Pi calculation (Python)

threads=1: 691 iterations/s.
threads=2: 400 ( 57 %)
threads=3: 453 ( 65 %)
threads=4: 467 ( 67 %)

^- seems to have some contention



regular expression (C)

threads=1: 592 iterations/s.
threads=2: 598 ( 101 %)
threads=3: 587 ( 99 %)
threads=4: 586 ( 99 %)

bz2 compression (C)

threads=1: 536 iterations/s.
threads=2: 1056 ( 196 %)
threads=3: 1040 ( 193 %)
threads=4: 1060 ( 197 %)


^- seems to properly show that I have 2 cores here.


- --- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 0 ms.)
CPU threads=2: 0 ms. (std dev: 0 ms.)
CPU threads=3: 0 ms. (std dev: 0 ms.)
CPU threads=4: 0 ms. (std dev: 0 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 38 ms. (std dev: 18 ms.)
CPU threads=2: 173 ms. (std dev: 77 ms.)
CPU threads=3: 518 ms. (std dev: 264 ms.)
CPU threads=4: 661 ms. (std dev: 343 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 0 ms.)
CPU threads=2: 0 ms. (std dev: 0 ms.)
CPU threads=3: 0 ms. (std dev: 0 ms.)
CPU threads=4: 0 ms. (std dev: 0 ms.)


John
=;->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrfJawACgkQJdeBCYSNAANQlgCgwx0TCLh7YhLSJxkfOuMi1/YF
XhkAoIONtdP0rR1YW0nDza+wpKpAlInd
=L4WZ
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-ideas] Remove GIL with CAS instructions?

2009-10-21 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Kristján Valur Jónsson wrote:
...

> This depends entirely on the platform and primitives used to implement the 
> GIL.
> I'm interested in windows.  There, I found this article:
> http://fonp.blogspot.com/2007/10/fairness-in-win32-lock-objects.html
> So, you may be on to something.  Perhaps a simple C test is in order then?
> 
> I did that.  I found, on my dual-core vista machine, that running "release", 
> that both Mutexes and CriticalSections behaved as you describe, with no 
> "fairness".  Using a "semaphore" seems to retain fairness, however.
> "fairness" was retained in debug builds too, strangely enough.
> 
> Now, Python uses none of these.  On windows, it uses an "Event" object 
> coupled with an atomically updated counter.  This also behaves fairly.
> 
> The test application is attached.
> 
> 
> I think that you ought to sustantiate your claims better, maybe with a 
> specific platform and using some test like the above.
> 
> On the other hand, it shows that we must be careful what we use.  There has 
> been some talk of using CriticalSections for the GIL on windows.  This test 
> ought to show the danger of that.  The GIL is different than a regular lock.  
> It is a reverse-lock, really, and therefore may need to be implemented in its 
> own special way, if we want very fast mutexes for the rest of the system (cc 
> to python-dev)
> 
> Cheers,
> 
> Kristján

I can compile and run this, but I'm not sure I know how to interpret the
results. If I understand it correctly, then everything but "Critical
Sections" are fair on my Windows Vista machine.

To run, I changed the line "#define EVENT" to EVENT, MUTEX, SEMAPHORE
and CRIT. I then built and ran in "Release" environment (using VS 2008
Express)

For all but CRIT, I saw things like:
thread   5532 reclaims GIL
thread   5532 working 51234 units
thread   5532 worked 51234 units: 1312435761
thread   5532 flashing GIL
thread   5876 reclaims GIL
thread   5876 working 51234 units
thread   5876 worked 51234 units: 1312435761
thread   5876 flashing GIL

Where there would be 4 lines for one thread, then 4 lines for the other
thread.

for CRIT, I saw something more like 50 lines for one thread, and then 50
lines for the other thread.

This is Vista Home Basic, and VS 2008 Express Edition, with a 2-core
machine.

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrfKFAACgkQJdeBCYSNAANbuQCgudU0IChylofTwvUk/JglChBd
9gsAoIJHj63/CagKpduUsd68HV8pP3QX
=CuUj
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] GIL behaviour under Windows

2009-10-21 Thread John Arbash Meinel

Antoine Pitrou wrote:
> Sturla Molden  molden.no> writes:
>> It does not crash the interpreter, but it seems it can deadlock.
> 
> Kristján sent me a patch which I applied and is supposed to fix this.
> Anyway, thanks for the numbers. The GIL does seem to fare a bit better (zero
> latency with the Pi calculation in the background) than under Linux, although 
> it
> may be caused by the limited resolution of time.time() under Windows.
> 
> Regards
> 
> Antoine.

You can use time.clock() instead to get <15ms resolution. Changing all
instances of 'time.time' to 'time.clock' gives me this result:

(2-core machine, python 2.6.2)

$ py ccbench.py
--- Throughput ---

Pi calculation (Python)

threads=1: 675 iterations/s.
threads=2: 388 ( 57 %)
threads=3: 374 ( 55 %)
threads=4: 445 ( 65 %)

regular expression (C)

threads=1: 588 iterations/s.
threads=2: 519 ( 88 %)
threads=3: 511 ( 86 %)
threads=4: 513 ( 87 %)

bz2 compression (C)

threads=1: 536 iterations/s.
threads=2: 949 ( 176 %)
threads=3: 900 ( 167 %)
threads=4: 927 ( 172 %)

--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 24727 ms. (std dev: 0 ms.)
CPU threads=1: 27930 ms. (std dev: 0 ms.)
CPU threads=2: 31029 ms. (std dev: 0 ms.)
CPU threads=3: 34170 ms. (std dev: 0 ms.)
CPU threads=4: 37292 ms. (std dev: 0 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 40454 ms. (std dev: 0 ms.)
CPU threads=1: 43674 ms. (std dev: 21 ms.)
CPU threads=2: 47100 ms. (std dev: 165 ms.)
CPU threads=3: 50441 ms. (std dev: 304 ms.)
CPU threads=4: 53707 ms. (std dev: 377 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 56138 ms. (std dev: 0 ms.)
CPU threads=1: 59332 ms. (std dev: 0 ms.)
CPU threads=2: 62436 ms. (std dev: 0 ms.)
CPU threads=3: 66130 ms. (std dev: 0 ms.)
CPU threads=4: 69859 ms. (std dev: 0 ms.)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] GIL behaviour under Windows

2009-10-21 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Antoine Pitrou wrote:
> Le mercredi 21 octobre 2009 à 12:42 -0500, John Arbash Meinel a écrit :
>> You can use time.clock() instead to get <15ms resolution. Changing all
>> instances of 'time.time' to 'time.clock' gives me this result:
> [snip]
>> --- Latency ---
>>
>> Background CPU task: Pi calculation (Python)
>>
>> CPU threads=0: 24727 ms. (std dev: 0 ms.)
>> CPU threads=1: 27930 ms. (std dev: 0 ms.)
>> CPU threads=2: 31029 ms. (std dev: 0 ms.)
>> CPU threads=3: 34170 ms. (std dev: 0 ms.)
>> CPU threads=4: 37292 ms. (std dev: 0 ms.)
> 
> Well apparently time.clock() has a per-process time reference, which
> makes it unusable for this benchmark :-(
> (the numbers above are obviously incorrect)
> 
> Regards
> 
> Antoine.

I believe that 'time.count()' is measured as seconds since the start of
the process. So yeah, I think spawning a background process will reset
this counter back to 0.

John
=:->

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrfTFYACgkQJdeBCYSNAAObWQCfRJsRENbcp6kuo2x1k+HvhYGZ
ftsAn2PNnNHNj6D4esNBMhlSdH4IjeMA
=1KWG
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Retrieve an arbitrary element from a set without removing it

2009-10-23 Thread John Arbash Meinel

Vitor Bosshard wrote:
> 2009/10/23 Willi Richert :
>> Hi,
>>
>> recently I wrote an algorithm, in which very often I had to get an arbitrary
>> element from a set without removing it.
>>
>> Three possibilities came to mind:
>>
>> 1.
>> x = some_set.pop()
>> some_set.add(x)
>>
>> 2.
>> for x in some_set:
>>break
>>
>> 3.
>> x = iter(some_set).next()
>>
>>
>> Of course, the third should be the fastest. It nevertheless goes through all
>> the iterator creation stuff, which costs some time. I wondered, why the 
>> builtin
>> set does not provide a more direct and efficient way for retrieving some 
>> element
>> without removing it. Is there any reason for this?
>>
>> I imagine something like
>>
>> x = some_set.get()
>>
> 
> 
> I see this as being useful for frozensets as well, where you can't get
> an arbitrary element easily due to the obvious lack of .pop(). I ran
> into this recently, when I had a frozenset that I knew had 1 element
> (it was the difference between 2 other sets), but couldn't get to that
> element easily (get the pun?)

So in my testing (2) was actually the fastest. I assumed because .next()
was a function call overhead, while:
for x in some_set:
  break

Was evaluated inline. It probably still has to call PyObject_GetIter,
however it doesn't have to create a stack frame for it.

This is what "timeit" tells me. All runs are of the form:
python -m timeit -s "s = set([10])" ...

0.101us "for x in s: break; x"
0.130us "for x in s: pass; x"
0.234us -s "n = next; i = iter" "x = n(i(s)); x"
0.248us "x = next(iter(s)); x"
0.341us "x = iter(s).next(); x"

So 'for x in s: break' is about 2x faster than next(iter(s)) and 3x
faster than (iter(s).next()).
I was pretty surprised that it was 30% faster than "for x in s: pass". I
assume it has something to do with a potential "else:" statement?

Note that all of these are significantly < 1us. So this only matters if
it is something you are doing often.

I don't know your specific timings, but I would guess that:
  for x in s: break

Is actually going to be faster than your
  s.get()

Primarily because s.get() requires an attribute lookup. I would think
your version might be faster for:
  stat2 = "g = s.get; for i in xrange(100): g()"

However, that is still a function call, which may be treated differently
by the interpreter than the for:break loop. I certainly suggest you try
it and compare.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Retrieve an arbitrary element from a set without removing it

2009-10-23 Thread John Arbash Meinel

Terry Reedy wrote:
> John Arbash Meinel wrote:
>> So 'for x in s: break' is about 2x faster than next(iter(s)) and 3x
>> faster than (iter(s).next()).
>> I was pretty surprised that it was 30% faster than "for x in s: pass". I
>> assume it has something to do with a potential "else:" statement?
> 
> for x in s: pass
> 
> iterates through *all* the elements in s and leaves x bound to the
> arbritrary *last* one instead of the arbitrary *first* one. For a large
> set, this would be a lot slower, not just a little.
> 
> fwiw, I think the use case for this is sufficiently rare that it does
> not need a separate method just for this purpose.
> 
> tjr

The point of my test was that it was a set with a *single* item, and
'break' was 30% faster than 'pass'. Which was surprising. Certainly the
difference is huge if there are 10k items in the set.

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Retrieve an arbitrary element from a set without removing it

2009-10-24 Thread John Arbash Meinel

Adam Olsen wrote:
> On Fri, Oct 23, 2009 at 11:04, Vitor Bosshard  wrote:
>> I see this as being useful for frozensets as well, where you can't get
>> an arbitrary element easily due to the obvious lack of .pop(). I ran
>> into this recently, when I had a frozenset that I knew had 1 element
>> (it was the difference between 2 other sets), but couldn't get to that
>> element easily (get the pun?)
> 
> item, = set_of_one
> 
> 

Interesting. It depends a bit on the speed of tuple unpacking, but
presumably that is quite fast. On my system it is pretty darn good:

0.101us "for x in s: break"
0.112us "x, = s"
0.122us "for x in s: pass"

So not quite as good as the for loop, but quite close.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Retrieve an arbitrary element from a set without removing it

2009-10-25 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Martin v. Löwis wrote:
>> Hmm, perhaps when using sets as work queues?
> 
> A number of comments:
> 
> - it's somewhat confusing to use a set as a *queue*, given
>   that it won't provide FIFO semantics.
> - there are more appropriate and direct container structures
>   available, including a dedicated Queue module (even though
>   this might be doing to much with its thread-safety).
> - if you absolutely want to use a set as a work queue,
>   then the .pop() method should be sufficient, right?
> 
> Regards,
> Martin

We were using sets to track the tips of a graph, and to compare whether
one node was an ancestor of another one. We were caching that answer
into frozensets, since that made them immutable. If

res = heads(node1, node2)
if len(res) == 1:
  # What is the 'obvious' way to get the node out?

I posit that there *isn't* an obvious way to get the single item out of
a 1-entry frozenset.

for x in res: break
list(res)[0]
set(res).pop()
iter(res).next()
[x for x in res][0]
x, = res   # I didn't think of this one before recently

Are all answers, but none of them I would consider *obvious*. At the
least, none of them are obviously better than another, so you look into
the performance characteristics to give you a reason to pick one over
the other.

res.get() would be a fairly obvious way to do it. Enough that I would
probably never have gone searching for any of the other answers. Though
personally, I think I would call it "set.peek()", but the specific name
doesn't really matter to me.

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrlB4MACgkQJdeBCYSNAAN0lQCgrtyXWlqIbjj01qB4AKPhKrMq
QH8An0z2gCWZHoceEJsqRJOUdEl/VLTB
=fJXI
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Reworking the GIL

2009-11-02 Thread John Arbash Meinel

Sturla Molden wrote:
> Antoine Pitrou skrev:
>> It certainly is.
>> But once again, I'm no Windows developer and I don't have a native
>> Windost host
>> to test on; therefore someone else (you?) has to try.
>>   
> I'd love to try, but I don't have VC++ to build Python, I use GCC on
> Windows.
> 
> Anyway, the first thing to try then is to call
>  
>   timeBeginPeriod(1);
> 
> once on startup, and leave the rest of the code as it is. If 2-4 ms is
> sufficient we can use timeBeginPeriod(2), etc. Microsoft is claiming
> Windows performs better with high granularity, which is why it is 10 ms
> by default.
> 
> 
> Sturla

That page claims:
  Windows uses the lowest value (that is, highest resolution) requested
  by any process.

I would posit that the chance of having some random process on your
machine request a high-speed timer is high enough that the overhead for
Python doing the same is probably low.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Retrieve an arbitrary element from a setwithoutremoving it

2009-11-05 Thread John Arbash Meinel

geremy condra wrote:
> On Thu, Nov 5, 2009 at 4:09 PM, Alexander Belopolsky
>  wrote:
>> On Thu, Nov 5, 2009 at 3:43 PM, Chris Bergstresser  
>> wrote:
>>> .. and "x = iter(s).next()" raises a StopIteration
>>> exception.
>> And that's why the documented recipe should probably recommend
>> next(iter(s), default) instead.  Especially because iter(s).next() is
>> not even valid code in 3.0.
> 
> This seems reasonably legible to you? Strikes me as coding by
> incantation. Also, while I've heard people say that the naive
> approach is slower, I'm not getting that result. Here's my test:
> 
> 
 smrt = timeit.Timer("next(iter(s))", "s=set(range(100))")
 smrt.repeat(10)
> [1.2845709323883057, 0.60247397422790527, 0.59621405601501465,
> 0.59133195877075195, 0.58387589454650879, 0.56839084625244141,
> 0.56839680671691895, 0.56877803802490234, 0.56905913352966309,
> 0.56846404075622559]
> 
 naive = timeit.Timer("x=s.pop();s.add(x)", "s=set(range(100))")
 naive.repeat(10)
> [0.93139314651489258, 0.53566789627075195, 0.53674602508544922,
> 0.53608798980712891, 0.53634309768676758, 0.53557991981506348,
> 0.53578495979309082, 0.53666114807128906, 0.53576493263244629,
> 0.53491711616516113]
> 
> 
> Perhaps my test is flawed in some way?
> 
> Geremy Condra

Well, it also uses a fairly trivial set. 'set(range(100))' is going to
generally have 0 collisions and everything will hash to a unique bucket.
 As such, pop ing an item out of the hash is a single "val = table[int &
mask]; table[int & mask] = _dummy", and then looking it up again
requires 2 table lookups (one finds the _dummy, the next finds that the
chain is broken and can rewrite the _dummy.)

However, if a set is more full, or has more collisions, or ... then
pop() and add() become relatively more expensive.

Surprising to me, is that "next(iter(s))" was actually slower than
.pop() + .add() for 100 node set in my testing:

$ alias TIMEIT="python -m timeit -s 's = set(range(100)'"
$ TIMEIT "x = next(iter(s))"
100 loops, best of 3: 0.263 usec per loop

$ TIMEIT "x = s.pop(); s.add(x)"
100 loops, best of 3: 0.217 usec per loop

though both are much slower than the fastest we've found:
$ TIMEIT "for x in s: break"
1000 loops, best of 3: 0.0943 usec per loop

now, what about a set with *lots* of collisions. Create 100 keys that
all hash to the same bucket:
aliase TIMEIT="python -m timeit -s 's = set([x*1024*1024 for x in
range(100)])'"
$ TIMEIT "x = next(iter(s))"
100 loops, best of 3: 0.257 usec per loop

$ TIMEIT "x = s.pop(); s.add(x)"
100 loops, best of 3: 0.218 usec per loop

I tried a few different ways, and I got the same results, until I did:

$ python -m timeit -s "s = set(range(10, 1000100))" "next(iter(s))"
1000 loops, best of 3: 255 usec per loop

 Now something seems terribly wrong here. next(iter(s)) suddenly
jumps up from being < 0.3 us, to being more than 200us. Or ~1000x slower.

I realize the above has 900k keys, which is big. But 'next(iter(s))'
should only touch 1, and, in fact, should always return the *first*
entry. My best guess is just that the *first* entry in the internal set
table is no longer close to offset 0, which means that 'next(iter(s))'
has to evaluate a bunch of table slots before it finds a non-empty entry.

Anyway, none of the proposals will really ever be faster than:
  for x in s: break

It is a bit ugly of a construct, but you don't have an attribute lookup,
etc. As long as you don't do:
  for x in s: pass

Then it stays nice and fast.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 3003 - Python Language Moratorium

2009-11-07 Thread John Arbash Meinel


...

> A moratorium isn't cost-free. With the back-end free to change, patches 
> will go stale over 2+ years. People will lose interest or otherwise 
> move on. Those with good ideas but little patience will be discouraged. 
> I fully expect that, human nature being as it is, those proposing a 
> change, good or bad, will be told not to bother wasting their time, 
> there's a moratorium on at least as often as they'll be encouraged to 
> bide their time while the moratorium is on.
> 

I believe if you go back to the very beginning of this thread, Guido
considers this a *feature* not a *bug*.

He wanted to introduce a moratorium at least partially because he was
tired of endless threads about anonymous code blocks, etc. Which aren't
going to be included in the language anyway, so he may as well make a
point to say "and neither will anything else for a while".

I don't mean to put words into his mouth, so please correct me if I'm wrong.

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Splitting something into two steps produces different behavior from doing it in one fell swoop in Python 2.6.2

2009-12-11 Thread John Arbash Meinel

Roy Hyunjin Han wrote:
> While debugging a network algorithm in Python 2.6.2, I encountered
> some strange behavior and was wondering whether it has to do with some
> sort of code optimization that Python does behind the scenes.
> 
> 
> 
> After initialization: defaultdict(, {1: set([1])})
> Popping and updating in two steps: defaultdict(, {1: set([1])})
> 
> After initialization: defaultdict(, {1: set([1])})
> Popping and updating in one step: defaultdict(, {})
> 
> 
> import collections
> print ''
> x = collections.defaultdict(set)
> x[1].update([1])
> print 'After initialization: %s' % x
> items = x.pop(1)
> x[1].update(items)
> print 'Popping and updating in two steps: %s' % x
> print ''
> y = collections.defaultdict(set)
> y[1].update([1])
> print 'After initialization: %s' % y
> y[1].update(y.pop(1))
> print 'Popping and updating in one step: %s' % y
> 

y[1].update(y.pop(1))

is going to be evaluating y[1] before it evaluates y.pop(1).
Which means that it has the original set returned, which is then removed
by y.pop, and updated.

You probably get the same behavior without using a defaultdict:
  y.setdefault(1, set()).update(y.pop(1))
  ^^- evaluated first


Oh and I should probably give the standard: "This list is for the
development *of* python, not development *with* python."

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-06 Thread John Arbash Meinel

MRAB wrote:
> Hi,
> 
> I've been wondering whether it's possible to release the GIL in the
> regex engine during matching.
> 
> I know that it needs to have the GIL during memory-management calls, but
> does it for calls like Py_UNICODE_TOLOWER or PyErr_SetString? Is there
> an easy way to find out? Or is it just a case of checking the source
> files for mentions of the GIL? The header file for PyList_New, for
> example, doesn't mention it!
> 
> Thanks

Anything that Py_INCREF or Py_DECREF's should have the GIL, or you may
get concurrent updating of the value, and then the final value is wrong.
(two threads do 5+1 getting 6, rather than 7, and when the decref, you
end up at 4 rather than back at 5).

AFAIK, the only things that don't require the GIL are macro functions,
like PyString_AS_STRING or PyTuple_SET_ITEM. PyErr_SetString, for
example, will be increfing and setting the exception state, so certainly
needs the GIL to be held.

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython

2010-01-21 Thread John Arbash Meinel

Collin Winter wrote:
> Hi Dirkjan,
> 
> On Wed, Jan 20, 2010 at 10:55 PM, Dirkjan Ochtman  wrote:
>> On Thu, Jan 21, 2010 at 02:56, Collin Winter  wrote:
>>> Agreed. We are actively working to improve the startup time penalty.
>>> We're interested in getting guidance from the CPython community as to
>>> what kind of a startup slow down would be sufficient in exchange for
>>> greater runtime performance.
>> For some apps (like Mercurial, which I happen to sometimes hack on),
>> increased startup time really sucks. We already have our demandimport
>> code (I believe bzr has something similar) to try and delay imports,
>> to prevent us spending time on imports we don't need. Maybe it would
>> be possible to do something like that in u-s? It could possibly also
>> keep track of the thorny issues, like imports where there's an except
>> ImportError that can do fallbacks.
> 
> I added startup benchmarks for Mercurial and Bazaar yesterday
> (http://code.google.com/p/unladen-swallow/source/detail?r=1019) so we
> can use them as more macro-ish benchmarks, rather than merely starting
> the CPython binary over and over again. If you have ideas for better
> Mercurial/Bazaar startup scenarios, I'd love to hear them. The new
> hg_startup and bzr_startup benchmarks should give us some more data
> points for measuring improvements in startup time.
> 
> One idea we had for improving startup time for apps like Mercurial was
> to allow the creation of hermetic Python "binaries", with all
> necessary modules preloaded. This would be something like Smalltalk
> images. We haven't yet really fleshed out this idea, though.
> 
> Thanks,
> Collin Winter

There is "freeze":
http://wiki.python.org/moin/Freeze

Which IIRC Robert Collins tried in the past, but didn't see a huge gain.
It at least tries to compile all of your python files to C files and
then build an executable out of that.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython

2010-01-21 Thread John Arbash Meinel

Martin v. Löwis wrote:
>> There is "freeze":
>> http://wiki.python.org/moin/Freeze
>>
>> Which IIRC Robert Collins tried in the past, but didn't see a huge gain.
>> It at least tries to compile all of your python files to C files and
>> then build an executable out of that.
> 
> "to C files" is a bit of an exaggeration, though. It embeds the byte
> code into the executable. When loading the byte code, Python still has
> to perform unmarshalling.
> 
> Regards,
> Martin
> 

Sure, though it sounds quite similar to what they were mentioning with:
"the creation of hermetic Python "binaries", with all necessary modules
preloaded"

My understanding was that because 'stuff' happens at import time, there
isn't a lot that you can do to improve startup time. I guess it depends
on what sort of state you could persist safely. And, of course, what you
could get away with for a library would probably be different than what
you could do with a standalone app.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython

2010-01-22 Thread John Arbash Meinel

sstein...@gmail.com wrote:
> On Jan 21, 2010, at 11:32 PM, Chris Bergstresser wrote:
> 
>> On Thu, Jan 21, 2010 at 9:49 PM, Tres Seaver  wrote:
>>   Generally, that's not going to be the case.  But the broader
>> point--that you've no longer got an especially good idea of what's
>> taking time to run in your program--is still very valid.
> 
> I'm sure someone's given it a clever name that I don't know but it's kind of 
> the profiling Heisenbug -- the information you need to optimize disappears 
> when you turn on the JIT optimizer.
> 
> S

I would assume that part of the concern is not being able to get
per-line profiling out of the JIT'd code. Personally, I'd rather see one
big "and I called a JIT function that took XX seconds" rather than
getting nothing.

At the moment, we have some small issues with cProfile in that it
doesn't attribute time to extension functions particularly well.

For example, I've never seen a Pyrex "__init__" function show up in
timing, the time spent always gets assigned to the calling function. So
if I want to see it, I set up a 'create_foo(*args, **kwargs)' function
that just does return Foo(*args, **kwargs).

I don't remember the other bits I've run into. But certainly I would say
that giving some sort of real-world profiling is better than having it
drop back to interpreted code. You could always run --profile -j never
if you really wanted to get the non-JIT'd code.

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython

2010-01-22 Thread John Arbash Meinel

Collin Winter wrote:
> Hey Jake,
> 
...

>> Hmm.  So cProfile doesn't break, but it causes code to run under a
>> completely different execution model so the numbers it produces are
>> not connected to reality?
>>
>> We've found the call graph and associated execution time information
>> from cProfile to be extremely useful for understanding performance
>> issues and tracking down regressions.  Giving that up would be a huge
>> blow.
> 
> FWIW, cProfile's call graph information is still perfectly accurate,
> but you're right: turning on cProfile does trigger execution under a
> different codepath. That's regrettable, but instrumentation-based
> profiling is always going to introduce skew into your numbers. That's
> why we opted to improve oProfile, since we believe sampling-based
> profiling to be a better model.
> 
> Profiling was problematic to support in machine code because in
> Python, you can turn profiling on from user code at arbitrary points.
> To correctly support that, we would need to add lots of hooks to the
> generated code to check whether profiling is enabled, and if so, call
> out to the profiler. Those "is profiling enabled now?" checks are
> (almost) always going to be false, which means we spend cycles for no
> real benefit.

At least my point was that I'd rather the machine code generated never
called out to the profiler, but that at least calling the machine code
itself would be profiled. That would show larger-than-minimal hot spots,
but at least it wouldn't change the actual hotspots.


> 
> Can YouTube use oProfile for profiling, or is instrumented profiling
> critical? oProfile does have its downsides for profiling user code:
> you see all the C-language support functions, not just the pure-Python
> functions. That extra data might be useful, but it's probably more
> information than most people want. YouTube might want it, though.
> 

Does oprofile actually give you much of the python state? When I've
tried that sort of profiling, it seems to tell me the C function that
the VM is in, rather than the python functions being executed. Knowing
that PyDict_GetItem is being called 20M times doesn't help me a lot if
it doesn't tell me that it is being called in

 def foo(d):
for x in d:
  for y in d:
 if x != y:
   assert d[x] != d[y]

(Or whatever foolish function you do that in)

I'd certainly like to know that 'foo()' was the one to attribute the 20M
calls to PyDict_GetItem.

Googling to search for oProfile python just gives me Unladen Swallow
mentions of making oprofile work... :)

> Assuming YouTube can't use oProfile as-is, there are some options:
> - Write a script around oProfile's reporting tool to strip out all C
> functions from the report. Enhance oProfile to fix any deficiencies
> compared to cProfile's reporting.
> - Develop a sampling profiler for Python that only samples pure-Python
> functions, ignoring C code (but including JIT-compiled Python code).
> - Add the necessary profiling hooks to JITted code to better support
> cProfile, but add a command-line flag (something explicit like -O3)
> that removes the hooks and activates the current behaviour (or
> something even more restrictive, possibly).
> - Initially compile Python code without the hooks, but have a
> trip-wire set to detect the installation of profiling hooks. When
> profiling hooks are installed, purge all machine code from the system
> and recompile all hot functions to include the profiling hooks.
> 
> Thoughts?
> 
> Collin Winter

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] patch to make list.pop(0) work in O(1) time

2010-01-27 Thread John Arbash Meinel


> Right now the Python programmer looking to aggressively delete elements from 
> the top of a list has to consider the tradeoff that the operation takes O(N) 
> time and would possibly churn his memory caches with the O(N) memmove 
> operation.  In some cases, the Python programmer would only have himself to 
> blame for not using a deque in the first place.  But maybe he's a maintenance 
> programmer, so it's not his fault, and maybe the code he inherits uses lists 
> in a pervasive way that makes it hard to swap in deque after the fact.  What 
> advice do you give him?
> 

Or he could just set them to None.
John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] urlparse.urlunsplit should be smarter about +

2010-05-08 Thread John Arbash Meinel

Stephen J. Turnbull wrote:
> David Abrahams writes:
>  > 
>  > This is a bug report.  bugs.python.org seems to be down.
>  > 
>  >   >>> from urlparse import *
>  >   >>> urlunsplit(urlsplit('git+file:///foo/bar/baz'))
>  >   git+file:/foo/bar/baz
>  > 
>  > Note the dropped slashes after the colon.
> 
> That's clearly wrong, but what does "+" have to to do with it?  AFAIK,
> the only thing special about + in scheme names is that it's not
> allowed as the first character.

Don't you need to register the "git+file:///" url for urlparse to
properly split it?

if protocol not in urlparse.uses_netloc:
urlparse.uses_netloc.append(protocol)

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Reasons behind misleading TypeError message when passing the wrong number of arguments to a method

2010-05-19 Thread John Arbash Meinel

Giampaolo Rodolà wrote:
 class A:
> ... def echo(self, x):
> ... return x
> ...
 a = A()
 a.echo()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: echo() takes exactly 2 arguments (1 given)
> 
> I bet my last 2 cents this has already been raised in past but I want
> to give it a try and revamp the subject anyway.
> Is there a reason why the error shouldn't be adjusted to state that
> *1* argument is actually required instead of 2?
> 

Because you wouldn't want to have

A.echo()

Say that it takes 1 argument and (-1 given) ?

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 3148 ready for pronouncement

2010-05-21 Thread John Arbash Meinel

Brian Quinlan wrote:
> The PEP is here:
> http://www.python.org/dev/peps/pep-3148/
> 
> I think the PEP is ready for pronouncement, and the code is pretty much
> ready for submission into py3k (I will have to make some minor changes
> in the patch like changing the copyright assignment):
> http://code.google.com/p/pythonfutures/source/browse/#svn/branches/feedback/python3/futures%3Fstate%3Dclosed
> 

Your example here:
for number, is_prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
print('%d is prime: %s' % (number, is_prime))

Overwrites the 'is_prime' function with the return value of the
function. Probably better to use a different variable name.

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 3148 ready for pronouncement

2010-05-21 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brian Quinlan wrote:
> The PEP is here:
> http://www.python.org/dev/peps/pep-3148/
> 
> I think the PEP is ready for pronouncement, and the code is pretty much
> ready for submission into py3k (I will have to make some minor changes
> in the patch like changing the copyright assignment):
> http://code.google.com/p/pythonfutures/source/browse/#svn/branches/feedback/python3/futures%3Fstate%3Dclosed
> 
> The tests are here and pass on W2K, Mac OS X and Linux:
> http://code.google.com/p/pythonfutures/source/browse/branches/feedback/python3/test_futures.py
> 
> The docs (which also need some minor changes) are here:
> http://code.google.com/p/pythonfutures/source/browse/branches/feedback/docs/index.rst
> 
> Cheers,
> Brian
> 

I also just noticed that your example uses:
 zip(PRIMES, executor.map(is_prime, PRIMES))

But your doc explicitly says:
map(func, *iterables, timeout=None)

Equivalent to map(func, *iterables) but executed asynchronously and
possibly out-of-order.

So it isn't safe to zip() against something that can return out of order.

Which opens up a discussion about how these things should be used.

Given that your other example uses a dict to get back to the original
arguments, and this example uses zip() [incorrectly], it seems that the
Futures object should have the arguments easily accessible. It certainly
seems like a common use case that if things are going to be returned in
arbitrary order, you'll want an easy way to distinguish which one you
have.  Having to write a dict map before each call can be done, but
seems unoptimal.

John
=:->

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkv2cugACgkQJdeBCYSNAAPWzACdE6KepgEmjwhCD1M4bSSVrI97
NIYAn1z5U3CJqZnBSn5XgQ/DyLvcKtbf
=TKO7
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-21 Thread John Arbash Meinel


...
>> IOW, if you're producing output that has to go into another system
>> that doesn't take unicode, it doesn't matter how
>> theoretically-correct it would be for your app to process the data in
>> unicode form.  In that case, unicode is not a feature: it's a bug.
>>
> This is not always true.  If you read a webpage, chop it up so you get
> a list of words, create a histogram of word length, and then write the output 
> as
> utf8 to a database.  Should you do all your intermediate string operations
> on utf8 encoded byte strings?  No, you should do them on unicode strings as
> otherwise you need to know about the details of how utf8 encodes characters.
> 

You'd still have problems in Unicode given stuff like å =~ å even though
u'\xe5' vs u'a\u030a' (those will look the same depending on your
Unicode system. IDLE shows them pretty much the same, T-Bird on Windosw
with my current font shows the second as 2 characters.)

I realize this was a toy example, but it does point out that Unicode
complicates the idea of 'equality' as well as the idea of 'what is a
character'. And just saying "decode it to Unicode" isn't really sufficient.

John
=:->

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] versioned .so files for Python 3.2

2010-07-07 Thread John Arbash Meinel

Scott Dial wrote:
> On 6/30/2010 2:53 PM, Barry Warsaw wrote:
>> It might be amazing, but it's still a significant overhead.  As I've
>> described, multiply that by all the py files in all the distro packages
>> containing Python source code, and then still try to fit it on a CDROM.
> 
> I decided to prove to myself that it was not a significant issue to have
> parallel directory structures in a .tar.bz2, and I was surprised to find
> it much worse at that then I had imagined. For example,
> 
> # cd /usr/lib/python2.6/site-packages
> # tar --exclude="*.pyc" --exclude="*.pyo" \
>   -cjf mercurial.tar.bz2 mercurial
> # du -h mercurial.tar.bz2
> 640Kmercurial.tar.bz2
> 
> # cp -a mercurial mercurial2
> # tar --exclude="*.pyc" --exclude="*.pyo" \
>   -cjf mercurial2.tar.bz2 mercurial mercurial2
> # du -h mercurial.tar.bz2
> 1.3Mmercurial2.tar.bz2
> 

I believe the standard (and largest) block size for .bz2 is 900kB, and I
*think* that is uncompressed. Though I know that bz2 can chain, since it
can compress all NULL bytes extremely well (multiple GB down to kB, IIRC).

There was a question as to whether LZMA would do better here, I'm using
7zip, but .xz should perform similarly.

$ du -sh mercurial*
2.6Mmercurial
2.6Mmercurial2

366K mercurial.tar.bz2
734K mercurial2.tar.bz2

303K mercurial.7z
310K mercurial2.7z

So LZMA with the 'normal' compression has a big enough window to find
almost all of the redundancy, and 310kB is certainly a very small
increase over the 303kB. And clearly bz2 does not, since 734kB is
actually slightly more than 2x 366kB.

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Intended behavior of backlash in raw strings

2010-07-12 Thread John Arbash Meinel

I'm trying to determine if this is intended behavior:

>>> r"\""
'\\"'

>>> r'\''
"\\'"

Normally, the quote would end the string, but it gets escaped by the
preceding '\'. However, the preceding slash is interpreted as 'not a
backslash' because of the raw indicator, so it gets left in verbatim.

Note that it works anywhere:

>>> r"testing \" backslash and quote"
'testing \\" backslash and quote'

It happens that this is the behavior I want, but it seemed just as
likely to be an error. I tested it with python2.5 and 2.6 and got the
same results.

Is this something I can count on? Or is it undefined behavior and I
should really not be doing it?

John
=:->
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Fixing #7175: a standard location for Python config files

2010-08-13 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


...
>  * that said, Windows seems much slower than Linux on equivalent
>hardware, perhaps attempting to open files is intrinsically more
>expensive there?  Certainly it's not safe to assume conclusions drawn
>on Linux will apply equally well on Windows, or vice versa.

I don't know what the specific issue is here, but adding entries to
sys.path makes startup time *significantly* slower.

I happen to use easy_install since Windows doesn't have its own package
manager. Unfortunately the default of creating a new directory and
adding it to easy_install.pth is actually pretty terrible.

On my system, 'len(sys.path)' is 72 entries long. 62 of that is from
easy-install. A huge amount of that is all the zope and lazr.
dependencies that are needed by launchpadlib (not required for bzr itself.)

With a fully hot cache, and running the minimal bzr command:

 time bzr rocks --no-plugins
 real 0m0.395s
vs
 real 0m0.195s

So about 400ms to startup versus 200ms if I use the packaged version of
bzr (which has a very small search path).

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxlm0kACgkQJdeBCYSNAAMSEgCfW24XNG3h20UkFdEODNMob6uR
nisAoLes/usoHd1YRDIkzxfIJohPjSer
=YO9b
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Mercurial Schedule

2010-11-19 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2010 7:50 AM, Nick Coghlan wrote:
> On Fri, Nov 19, 2010 at 5:43 PM, Georg Brandl  wrote:
>> Am 19.11.2010 03:23, schrieb Benjamin Peterson:
>>> 2010/11/18 Jesus Cea :
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 18/11/10 18:32, "Martin v. Löwis" wrote:
> In general, I'm *also* concerned about the lack of volunteers that
> are interested in working on the infrastructure. I wish some of the
> people who stated that they can't wait for the migration to happen
> would work on solving some of the remaining problems.

 Do we have a exhaustive list of mercurial "to do" things?.
>>>
>>> http://hg.python.org/pymigr/file/1576eb34ec9f/tasks.txt
>>
>> Uh, that's the list of things to do *at* the migration.  The todo list is
>>
>> http://hg.python.org/pymigr/file/1576eb34ec9f/todo.txt
> 
> That kind of link is the sort of thing that should really be in the
> PEP... (along with the info about where to find the hooks repository,
> specific URLs for at least 3.x, 3.1 and 2.7, pointers to a draft FAQ
> to replace the current SVN focused FAQ, etc)
> 

Well, if it goes in the pep, you should at least use the 'always the
most recent' version :)
http://hg.python.org/pymigr/file/tip/todo.txt

John
=:->

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkzmp/gACgkQJdeBCYSNAAOwjgCeOda2XeNvxOR0UnFuQOfN0zZt
jGIAoIuarrvIz3oQ+o1jtnH5dFoFk35t
=JJo8
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] svn outage on Friday

2011-02-15 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2/15/2011 8:03 AM, Benjamin Peterson wrote:
> 2011/2/15 Victor Stinner :
>> Le mardi 15 février 2011 à 09:30 +0100, "Martin v. Löwis" a écrit :
>>> I'm going to perform a Debian upgrade of svn.python.org on Friday,
>>> between 9:00 UTC and 11:00 UTC. I'll be disabling write access during
>>> that time. The outage shouldn't be longer than an hour.
>>
>> It's time to move to Mercurial! :-)
> 
> And doubtless there will be times when Mercurial must be upgraded, too...
> 

True, but on those days you just keep committing locally...

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1aqpkACgkQJdeBCYSNAAMW3gCgt/Ea75R/HfKM4KlmGmCmfjtL
BBYAoI89GYsxrsa4/Eefifg3i6+Euv+T
=Kz3A
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Actual Mercurial Roadmap for February (Was: svn outage on Friday)

2011-02-22 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2/22/2011 9:41 AM, anatoly techtonik wrote:
> On Fri, Feb 18, 2011 at 4:00 PM, Dirkjan Ochtman  wrote:
>> On Fri, Feb 18, 2011 at 14:41, anatoly techtonik  wrote:
>>> Do you have a public list of stuff to be done (i.e. Roadmap)?
>>> BTW, what is the size of Mercurial clone for Python repository?
>>
>> There is a TODO file in the pymigr repo (though I think that is
>> currently inaccessible).
> 
> Can you provide a link? I don't know where to search. Should we open a
> src.python.org domain?
> 
>> I don't have a recent optimized clone to check the size of, yet.
> 
> What is the size of non-optimized clone then? I know that a clone of
> such relatively small project as MoinMoin is about 250Mb. ISTM that
> Python repository may take more than 1GB, and that's not acceptable
> IMHO. BTW, what do you mean by optimization - I hope not stripping the
> history?

Mercurial repositories are sensitive to the order that data is inserted
into them. So re-ordering the topological insert can dramatically
improve compression.

The quick overview is that in a given file's history, each diff is
computed to the previous text in that file. So if you have a history like:

 foo
  | \
 foo baz
 bar foo
  | /
  baz
  foo
  bar

This can be stored as either:

 foo

 +bar

 -bar

 +baz
 +bar

This matters more if you have a long divergent history for a while:

 A
 |\
 B C
 | |
 D E
 | |
 F G
 : :
 X Y
 |/
 Z

In this case, you could end up with contents that look like:

 A +B +D +F +X -BDFX+C +E +G +Y +ABDFXZ

Or you could have the history 'interleaved':

 A +B -B+C -C+BD -BD+CE -BDF+CEG -...

There are tools that take a history file, and rewrite it to be more
compact. I don't know much more than that myself. But especially with
something like an svn conversion which probably works on each revision
at a time, where people are committing to different branches
concurrently, I would imagine the conversion could easily end up in the
pessimistic case.

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1j6qoACgkQJdeBCYSNAAPzPgCdEOJsHf4Xf4lZH+jHX42FQb8J
sQoAn3JuCmDcsyv0JZpXtbVJoGewA+7t
=M8DI
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PyCObject_AsVoidPtr removed from python 3.2 - is this documented?

2011-03-07 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/7/2011 3:56 AM, Terry Reedy wrote:
> On 3/6/2011 6:09 PM, Barry Scott wrote:
>> I see that PyCObject_AsVoidPtr has been removed from python 3.2.
>> The 3.2 docs do not seem to explain this has happened and what
>> to replace it with.
>>
>> I searched the 3.2 docs and failed to find PyCObject_AsVoidPtr.
>> I looked at the whats new page and the API PEP. Did I miss
>> where this is documented?
> 
> Georg recently reaffirmed on a tracker issue that when something is
> removed from the code, it is removed from the docs also. So the place to
> look for a deprecation notice and replacement suggestion is in the last
> release where present.
> 

It might be interesting to just have a stub entry with:

  PyCObject_AsVoidPtr (This was deleted in 3.2, last available in [3.1])

Might end up being too cluttered, but certainly helps the people who hit
the problem.

Especially since, AIUI, deprecations are suppressed by default now.

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk10mPEACgkQJdeBCYSNAAMNCwCfUIm79vsW7KSuibBRLZUFA4P2
VooAn1Muo6yeciMBSO+ndlaq10VE5lxV
=ewPb
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Suggest reverting today's checkin (recursive constant folding in the peephole optimizer)

2011-03-12 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


...

> I have always felt uncomfortable with *any* kind of optimization --
> whether AST-based or bytecode-based. I feel the cost in code
> complexity is pretty high and in most cases the optimization is not
> worth the effort. Also I don't see the point in optimizing expressions
> like "3 * 4 * 5" in Python. In C, this type of thing occurs frequently
> due to macro expansion and the like, but in Python your code usually
> looks pretty silly if you write that. 

Just as a side comment here, I often do this sort of thing when dealing
with time based comments, or large constants. It is a bit more obvious that:
 10*1024*1024 vs 10485760 is 10MiB
especially since you can't put commas in your constants. 10,485,760
would at least make it immediately obvious it was 10M not 104M or
something else.

Similarly is 10,800s 3 hours? 3*60*60 certainly is.

I don't think I've done that sort of thing in anything performance
critical. But I did want to point out that writing "X*Y*Z" as a constant
isn't always "pretty silly".

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk17c0sACgkQJdeBCYSNAAMUrQCdEhissWuvTElIc6Wy/2qotzaU
xz4AnRO+ND/3NkKWC7Bbu78ACjs2X920
=QR/2
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Hg: inter-branch workflow

2011-03-20 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/20/2011 5:06 AM, R. David Murray wrote:
> On Thu, 17 Mar 2011 14:33:00 +0100,  wrote:
>> On Thu, 17 Mar 2011 09:24:26 -0400
>> "R. David Murray"  wrote:
>>>
>>> It would be great if rebase did work with share, that would make a
>>> push race basically a non-issue for me.
>>
>> rebase as well as strip destroy some history, meaning some of your
>> shared clones may end up having their working copy based on a
>> non-existent changeset. I'm not sure why rebase would be worse that
>> strip in that regard, though.
> 
> Well, it turns out that this completely doesn't work, though at first
> it appeared to (and so I pushed).
> 
> I had a push race, so I did hg pull; hg rebase.  Then I looked at the
> log, and I could (apparently) see my change sets on the top of the
> stack.  So I pushed.  Victor then asked why one of my commits deleted
> Tools/demo/README, and then the next commit restored it.
> 
> What I was attempting to push was a doc change in 3.1 that I had then
> merged to 3.2 and default.  What I saw when looking closer at the log
> (after Victor pointed it out) was that my merge commits had lost their
> parents.
> 
> I thought that at worst a rebase would screw up my local history, but
> apparently I managed to push some sort of damaged history.  The doc
> change only got applied to default, since that's the branch I
> happened to be in when I did the rebase.
> 
> Needless to say, I'm avoiding rebase henceforth.

AIUI, rebase defaults to always omitting merge changesets. Under the
assumption that the branch you would merge is the one you are targeting
to rebase upon. So those merges are 'not interesting' once you are
rebased. Obviously this has failure cases (when the branch being merged
is not the one you are targeting.)

John
=:->

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2GSdMACgkQJdeBCYSNAAMf5wCgox24+LoJRKtzJHmCFTWcZnjI
MwIAniISqH9xDR/9g5EiXEsg5Wk66jeN
=39Oi
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?

2011-03-21 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/21/2011 10:44 AM, "Martin v. Löwis" wrote:
>> My understanding is that svn does not detect fast forwards, only lack
>> of conflicts, and therefore in case of concurrent development it is
>> possible that the repository contains a version that never existed in
>> any developer's workspace.  
> 
> I can't understand how you draw this conclusion ("therefore").
> 
> If you do an svn up, it merges local changes with remote changes;
> if that works without conflicts, it tells you what files it merged,
> but lets you commit.
> 
> Still, in this case, the merge result did exist in the sandbox
> of the developer performing the merge. Subversion never ever creates
> versions in the repository that didn't before exist in some working
> copy. The notion that it may have done a server-side merge or some
> such is absurd.

It does so at the *tree* level, not at an individual file level.

1) svn doesn't require you to run 'svn up' unless there is a direct
change to the file you are committing. So there is plenty of opportunity
to have cross-file failures.

 The standard example is I change foo.py's foo() function to add a new
 mandatory parameter. I 'svn up' and run the test suite, updating all
 callers to supply that parameter. You update bar.py to call foo.foo()
 not realizing I'm updating it. You 'svn up' and run the test suite.
 Both my test suite and your test suite were perfect. So we both 'svn
 commit'.
 There is now a race condition. Both of our commits will succeed.
 (revisions 10 and 11 lets say). Revision 11 is now broken (will fail
 to pass the test suite.)

2) svn's default lack of tree-wide synchronization means that no matter
   how diligent we are, there is a race condition. (As I'm pretty sure
   both 'svn commit' can run concurrently since they don't officially
   modify the same file.)

3) IIRC, there is a way to tell "svn commit", "this is my base revno,
   if the server is newer, abort the commit". It is used by bzr-svn to
   ensure we always commit tree-wide state safely. However, I don't
   think svn itself makes much use of it, or makes it easy to use.

Blindly merging in trunk / rebasing your changes has the same hole.
Though you at least can be aware that it is there, rather than the
system hiding the fact that you were out of date.

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2HMxoACgkQJdeBCYSNAAMaewCfW3DK8hW4hBKOA+5zbyaxyptH
MMQAoKGw2uWUWafBK2+Jl5A6XMK+0z9f
=5R0+
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?

2011-03-22 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/21/2011 10:48 PM, "Martin v. Löwis" wrote:
>> It does so at the *tree* level, not at an individual file level.
> 
> Thanks - I stand corrected. I was thinking about the file level only (at
> which it doesn't do server-side merging - right?).
> 
> Regards,
> Martin

AIUI, you are correct, svn doesn't do server-side content merging.
John
=:->

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2IbC4ACgkQJdeBCYSNAANhCQCgpmKnD7QmZQ8Cv1DGPkJw22Q0
/uYAoMc0VNoLq2VMFpu3+uzS2M93x08P
=TL8l
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?

2011-03-22 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/22/2011 2:05 AM, Terry Reedy wrote:
> On 3/21/2011 7:14 AM, Nick Coghlan wrote:
> 
>> hg broadens the check and complains if *any* files are not up to date
>> on any of the branches being pushed, thus making it a requirement to
>> do a hg pull and merge on all affected branches before the hg push can
>> succeed. In theory, this provides an opportunity for the developer
>> doing the merge to double check that it didn't break anything, in
>> practice (at least in the near term) we're more likely to see an
>> SVN-like practice of pushing the merged changes without rerunning the
>> full test suite.
> 
> Now that you and John have (finally) explained how 'non-conflict' merges
> can actually contain a conflict (and there could be such for docs as
> well as code*), I think there is a pretty clear guideline for when to
> re-test.
> 
> If my change adds or changes in one file a reference to something in
> another file, or changes or subtracts in one file something that might
> be referenced by other files, and the the change could affect the
> cross-file linkage, and the pulled changes merged with my changes might
> have such linkages, then I should rerun tests on the new merged state.
> (I say 'might' because it is easier and safer to just rerun than to
> check very hard.) Otherwise, it should be safe not to. Correct?
> 

I would agree with 'might' with the very large caveat that cross-file
linkage tends to exist in far more places than most people think. I know
I've done a few "single line fixes" where the test suite showed me that
it wasn't as simple as that. :)

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2IbIUACgkQJdeBCYSNAAPsbQCbB53ctC5DQ8hQo1U/1crNsxLc
d5EAoIF1/x9hBW2z4X9EfGNaaGM3V8A+
=2HN+
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?

2011-03-22 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/21/2011 10:32 PM, Nick Coghlan wrote:
> On Tue, Mar 22, 2011 at 3:16 AM, Raymond Hettinger
>  wrote:
>> I don't think that is the main source of complexity.
>> The more difficult and fragile part of the workflows are:
>> * requiring commits to be cross-linked between branches
>> * and wanting changesets to be collapsed or rebased
>>   (two operations that destroy and rewrite history).
> 
> Yep, that sounds about right. I think in the long run the first one
> *will* turn out to be a better work flow, but it's definitely quite a
> shift from our historical way of doing things.
> 
> As far as the second point goes, I'm coming to the view that we should
> avoid rebase/strip/rollback when intending to push to the main
> repository, and do long term work in *separate* cloned repositories.
> Then an rdiff with the relevant cpython branch will create a nice
> collapsed patch ready for application to the main repository (I have
> yet to succeed in generating a nice patch without using rdiff, but I
> still have some more experimentation to do with MvL's last proposed
> command for that before giving up on the idea).
> 
> Cheers,
> Nick.
> 

I don't know what mercurial specifically supports. I believe git has a
'--squash' option at merge/commit time. And bzr has "bzr revert
- --forget-merges". Which lets you do a merge as normal, and then tell it
to forget all of the history that you just merged (treating the commit
as just a collapsed patch).

It is trivial to do this as a DVCS (it is just *omitting* the extra
parent pointer for commit). Though Mercurial's model of extra heads
existing in the branch may make it a bit trickier. (If you omit the head
when committing, it still stays around looking like it is unmerged, so
you need 1 more step of killing the extra head.)

Regardless, I'm sure it is something that could be implemented and
streamlined for Python's use. Maybe someone knows a Mercurial command to
already do it?

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2IbgIACgkQJdeBCYSNAANAlACfZkegH6t9y0PUH9xufcbCB4PX
8ykAn0A6i7D/+LJ+9+9OwoA27hkAiHUc
=eh4I
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?

2011-03-22 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/21/2011 6:53 PM, Barry Warsaw wrote:
> On Mar 21, 2011, at 01:19 PM, R. David Murray wrote:
> 
>> So you are worried about the small window between me doing an 'svn up',
>> seeing no changes, and doing an 'svn ci'?  I suppose that is a legitimate
>> concern, but considering the fact that if the same thing happens in hg,
>> the only difference is that I know about it and have to do more work[*],
>> I don't think it really changes anything.  Well, it means that if your
>> culture uses the "always test" workflow you can't be *perfect* about it
>> if you use svn[**], which I must suppose has been your (and Stephen's)
>> point from the beginning.
>>
>> [*] that is, I'm *not* going to rerun the test suite even if I have to
>> pull/up/merge, unless there are conflicts.
> 
> I think if we really want full testing of all changesets landing on
> hg.python.org/cpython we're going to need a submit robot like PQM or Tarmac,
> although the latter is probably too tightly wedded to the Launchpad API, and I
> don't know if the former supports Mercurial.
> 
> With the benefits such robots bring, it's also important to understand the
> downsides.  There are more moving parts to maintain, and because landings are
> serialized, long test suites can sometimes cause significant backlogs.  Other
> than during the Pycon sprints, the backlog probably wouldn't be that big.
> 
> Another complication we'd have is running the test suite cross-platform, but I
> suspect that almost nobody does that today anyway.  So the buildbot farm would
> still be important.
> 
> -Barry

I'm personally a huge fan of 2(multi)-tier testing. So you have a basic
(and fast) test suite that runs across all your modules before every
commit in your mainline. Then a much larger (and slower) test suite that
runs regression testing across all platforms/etc that runs
asynchronously. Which gives you some basic protection against brown-bag
failures (you committed a typo in the main Python.h file, breaking
everyone). And still avoids a huge pushback on throughput.

I think Launchpad is currently looking to do batch-PQM. So that every
commit to the final mainline must pass the full test suite. However the
automated bot can grab multiple requests from the queue at a time, on
the premise that 90% of the time, none of them will break anything. So
you end up with a 100% stable trunk (any given revision committed by the
bot did pass the full test suite), but still get most of the throughput.

Also, by working in batch mode, if you have 20 submissions, and
submission #2 would have broken the test suite, it only bumps some (say
the first 5) submissions, and the other 15 still get to land in an
orderly fashion. You could even put any bumped submissions into a
deferred 'one-by-one' queue.

John
=:->

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2Ib6QACgkQJdeBCYSNAAPJWwCggyeS5DZlm/DR7bo+1AmpD9rr
YmMAoLFmmu7VBTJJX/khyigaOPU9dDE9
=68rK
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Hg: inter-branch workflow

2011-03-22 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/21/2011 9:19 PM, Barry Warsaw wrote:
> On Mar 21, 2011, at 11:56 AM, Daniel Stutzbach wrote:
> 
>> Keeping the repository clean makes it easier to use a bisection search to
>> hunt down the introduction of a bug.  If  every developer's intermediate
>> commits make it into the main repository, it's hard to go back to an older
>> revision to test something, because many of the older revisions will be
>> broken in some way.
> 
> So maybe this gets at my earlier question about rebase being cultural
> vs. technology, and the response about bzr having a strong sense of mainline
> where hg doesn't.
> 
> I don't use the bzr-bisect plugin too much, but I think by default it only
> follows commits on the main line, unless a bisect point is identified within a
> merge (i.e. side) line.  So again, those merged intermediate changes are
> mostly ignored until they're needed.
> 
> -Barry

Bazaar is, to my knowledge, the only DVCS that has a "mainline"
emphasis. Which shows up in quite a few areas. The defaults for 'log',
having branch-stable revnos[1], and the 'bzr checkout' model for
managing a mainline.

The cultural aspects of this also show up. Since we default to not
showing merge commits in 'bzr log' (or even when shown, they are shown
indented), there is less impetus to remove them by default.[2]

Guido mentioned that when he does long-term development, there are a lot
of advantages to having intermediate snapshots that he can roll back to,
which have questionable benefit in being in the final repository, since
some states were dead-ends that weren't pursued.

I certainly can understand that, but there is at least an argument that
preserving "this approach is a dead-end" also has some merit. If someone
comes to you and says "why didn't you implement it this way", you can
point to it and say "I tried, it didn't work". Which would also give
them a point to start if they really think it is an avenue to pursue
further.

I also remember something my Math teacher once said. That some of the
early proofs were so polished that nobody knew how to "reproduce" them.
Sure, anyone could follow the final proof and say "yes that is correct",
but nobody could *learn* from the proof, because it was missing those
human-level steps of intuition that helped understand *how* the proof
was developed, rather than just what the final state was.

That is not to say that the python.org primary repository should be a
teaching repository. However, I know that I'm personally quite curious
to see how Guido does his work. Insight into the minds of how
interesting people do interesting things and all.

Another key point of how tools influence and interact with culture.
Because bzr has a strong mainline bias, it leads to people interacting
with the tool differently, which strengthens and reinforces it. In
Mercurial, it would be trivial to add a "hg log --only-mainline" that
would always follow a specific parent and show that to you. However,
because Mercurial doesn't default to that view, people don't try to
preserve the mainline. For example, in these graphs are logically the
same, but result in a very different mainline view:

 A -- B -- C
   \\
 D -- E

 A -- B -- C -- E
   \  /
 D --

If you consider D as "I did my work" and E as "and I integrated that
back into trunk".

If you merge the trunk revision C into E, and then push, you end up with
this graph:

 A -- B -- D -- E
   \  /
 C --

And suddenly the revision which was an "important" C change is now gone
on the mainline, and your personal "and I did D" is now a primary
revision. It doesn't matter much for a single revision D and C, but for
anything long lived, you end up with 100 Guido exploratory D revisions,
and some 50+ other python-dev super-stable trunk C revisions. And unless
the tool helps you preserve the ordering[3], you really don't want to
trying to treat them with different levels of authority. Hence, you tend
to collapse, because you really can only trust "E" to be a final stable
change.

John
=:->

[1] any copy of "trunk" has the same revision-id matching revno 1234,
in Mercurial the numbers in 'hg log' correspond to the ordering in
the physical repository, so depend on what order revisions were
merged, etc.

[2] The downside is people having their work merged and then wondering
"where did my commits go", and it looking like this guy named Patch
was an extremely productive developer of Launchpad and Bazaar
(Patch Queue Manager.)

[3] In Bazaar, you can set 'append_revisions_only = True' for
integration branches. Which will refuse to push E if it would
remove C as a mainline revision.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2Id5sACgkQJdeBCYSNAAMnjgCbBFiMtdkj8hvJ19dPn3Maz3Bo
TrwAmwfgmg0YMGaCPM+W+kAVVDVvrOlY
=6oWG
-END PGP SIGNATURE-

Re: [Python-Dev] Hg: inter-branch workflow

2011-03-22 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/22/2011 3:22 PM, Barry Warsaw wrote:
> On Mar 22, 2011, at 12:07 PM, Adrian Buehlmann wrote:
> 
>> FWIW, Mercurial's "mainline" is the branch with the name 'default'. This
>> branch name is reserved, and it implies that the head with the highest
>> revision number from that branch will be checked out on 'hg clone'.
> 
> I think that's different than what John was describing, or perhaps Python's
> use of it has the effect of being different.  IIUC, in Mercurial, within the
> default branch there's no clear "main line" of development assigned to a path
> within the DAG.  All paths are created equal, so it's not possible to
> e.g. have log or bisect suppress one path containing feature sub-commits from
> the point of departure to the point of recombination (merge).
> 
> -Barry

If you think of "mainline" as "trunk" or "master", then yes, Mercurial's
concept is "default".

If you think of "mainline" as a series of commits in a linear sequence,
then there isn't specifically a "main" "line" in Mercurial's world view.

In Bazaar, the sequence of first-parents in the history of a branch's
tip is considered special. Based on the concept that:

 cd me
 bzr merge ../you
 bzr commit -m "I merged you"

Is not the same *social* thing as:

 cd ../you
 bzr merge ../me
 bzr commit -m "You merged me"

Logically, the contents of the tree should be the same. The name of the
person doing it is often different. If you have "blessed branch"
approach to development (which almost all projects do at some level or
other) then merging something into the "blessed branch" is not the same
as merging the "blessed branch" into your personal branch.

For python, the "blessed branch" is http://hg.python.org/cpython.

Consider, if someone merges/pushes a change to that branch, it has very
different social consequences than if someone merges that branch into
their own feature branch. Namely, it is going to be part of the next
released tarball.

Note that hg does distinguish it a little bit. If you look here:
http://hg.python.org/cpython/graph

There is a clear separation of what was merged into the line on the
left, versus what was committed elsewhere. However, there is no
distinction on:
http://hg.python.org/cpython/shortlog

Note also that because 'hg log' doesn't default to only showing you
things on a given branch, you can end up with log views like this:
http://hg.python.org/cpython/graph/693415328363

Where "Relax timing check" has nothing to do with "Issue #10833", but
because the data in the repository has Relax Timing Check placed after
Issue #10833 physically on disk, they end up both getting shown in the
log view.

It certainly is much worse in
http://hg.python.org/cpython/shortlog/693415328363

Where there is no way to tell that the revisions are unrelated, and just
happen to physically reside in same repository.

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2Iu5kACgkQJdeBCYSNAAMVQACgpGd53yMnQBjJXuoVLElxC6qN
OqwAmwTdxjIS5tjkf0+iK62DvT/uPLdz
=1YLF
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Hg: inter-branch workflow

2011-03-22 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/22/2011 3:25 PM, Barry Warsaw wrote:
> On Mar 22, 2011, at 12:52 AM, Éric Araujo wrote:
> 
>> Bazaar apparently has a notion of mainline whereas Mercurial believes
>> that all changesets are created equal.  The tools are different.
> 
> I'm curious: what are the benefits of the Mercurial model?
> 
> -Barry
> 

- From discussions in #bzr, I would say the primary complaint to our model
is that people feel it means that not all revisions are created equal.

Though honestly, that is true. Getting a revision into Linus's kernel
tree is not the same as getting it into my own branch of the kernel.

I do remember discussions where people felt bad that after integrating
someone else's work, by default it showed their name, rather than the
names of the people who had done the actual work. (Though you could
always use --author if you really wanted to change the name on that commit.)

There is a technical argument. A merging B should be idempotent to B
merging A. And certainly, the file-content should be the same after
either operation.

But as I mentioned in the other email, Barry merging my work and
committing it to the hg.python.org/cpython branch is very different
socially than me merging cpython's branch. So bzr is somewhat
distinguishing "things done on this branch" from "things done in other
branches that I have included into this branch.".

Simplicity. If you enforce the mainline concept with
"append_revisions_only=True" on your trunk branches. Then people who try
to do:

 cd my-work
 merge trunk && commit
 push trunk

Will be blocked. While if you did
 cd trunk
 merge my-work && commit
 push trunk

It would succeed. If a tool doesn't give you a benefit from maintaining
a mainline, there is overhead in preserving it.

If your tool defaults to fast-forward merging, it is also really hard to
get the behavior. (git does this with a config option to disable it, bzr
has it as an option to merge [defaulting to off], and I'm not sure what
Mercurial's default is.)

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2Ivi4ACgkQJdeBCYSNAAOhzwCg0tD1KdR53fH7OEzhom0IaPOL
niYAn2KsY2jPLJmbXWf8sIauMW+y2hHC
=NFdA
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Workflow proposal

2011-03-23 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/23/2011 4:30 AM, Stephen J. Turnbull wrote:
> Antoine Pitrou writes:
> 
>  > Now, "hg strip" should definitely be absent of any recommended or even
>  > suggested workflow. It's a power user tool for the experimented
>  > developer/admin. Not the average hg command.
> 
> So what you're saying is that Mercurial by itself can't support the
> recommended workflow, because any "collapsing" of commits requires
> stripping, whether done by hg strip or implicitly by some other
> "non-average" hg command.
> 

Just as an aside, and I might be wrong. But reading through what strip
does, (and from my knowledge of the disk layout) it can't actually be
atomic. So if you kill it at the wrong time, it will have corrupted your
repository.

At least, that is from what I can tell. Which is that it has to rewrite
each file history to omit the revisions you are stripping, and then
rewrite the revision history to do the same. It would be possible to be
'stable' if you wrote a write-ahead-log and did all the work on the
side, and then any client that tries to read or write to the repository
finishes up the steps. But the individual file histories refer to the
global revision history (by index), so you don't have a 'top-down' view
that makes it all atomic by changing the top level object to point to
the new lower level objects.

It is possible that they only rewrite the revisions file, leaving blanks
for the old references. But the statement "it rewrites the numbers"
means it is collapsing the offsets in the index.
http://mercurial.selenic.com/wiki/Strip

I definitely would be very leery of putting that in any sort of
recommended documentation. It also makes me understand more why hg folks
value having "hg check" run very quickly...

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2J2/0ACgkQJdeBCYSNAAPFCACfeGHMB3/td31EuyateqzcoNFx
87sAoKxDt8i1rqllHogRBMxTGUDzSsdd
=RgPe
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?

2011-03-23 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/22/2011 11:05 PM, Barry Warsaw wrote:
> On Mar 23, 2011, at 07:31 AM, Nick Coghlan wrote:
> 
>> On Wed, Mar 23, 2011 at 4:25 AM, Barry Warsaw  wrote:
>>> It probably wouldn't be a bad idea to add a very fast "smoke test" for the
>>> case where you get tripped up on the merge dance floor.  When that happens,
>>> you could run your localized tests, and then a set of tests that run in 
>>> just a
>>> minute or two.
>>
>> What would such a smoke test cover, though? It's hard to think of
>> anything particularly useful in the middle ground between "Can you run
>> regrtest *at all*?" and "make quicktest".
> 
> make quicktest runs 340 tests, and I'm certain many don't need to be run in a
> smoke test.  E.g. 
> 
> test_aifc
> test_colorsys
> test_concurrent_futures
> (that's as far as it's gotten so far ;)
> 
> Or you could time each individual test, choose a cut off for total test run
> and then run whatever you can within that time (on some reference machine).
> Or maybe just remove the longest running 50% of the quick tests.
> 
> -Barry

bzr can run its 30,000 tests in about 15-30min depending on your
specific hardware and platform (it also supports parallel running of the
test suite). I think it takes about 1hr running single-threaded on
Windows because "os.rename()" is a fairly slow operation and we do it a
lot in most tests (it is part of our locking primitive).


Unit-tests (tests directly on a function without setting up lots of
external state) generally can execute in milliseconds.

I don't specifically know what is in those 340 tests, but 18min/340 =
3.2s for each test. Which is *much* longer than simple smoke tests would
have to be.

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2J3WUACgkQJdeBCYSNAAMI2ACgl6obH9WKlmRiK4K4ib1g6SR7
KqkAn1LNrlBaUTf/sc5s30tZq3F3hmNl
=DtQ8
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Workflow proposal

2011-03-23 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 3/23/2011 1:23 PM, Dirkjan Ochtman wrote:
> On Wed, Mar 23, 2011 at 12:39, John Arbash Meinel
>  wrote:
>> Just as an aside, and I might be wrong. But reading through what strip
>> does, (and from my knowledge of the disk layout) it can't actually be
>> atomic. So if you kill it at the wrong time, it will have corrupted your
>> repository.
>>
>> At least, that is from what I can tell. Which is that it has to rewrite
>> each file history to omit the revisions you are stripping, and then
>> rewrite the revision history to do the same. It would be possible to be
>> 'stable' if you wrote a write-ahead-log and did all the work on the
>> side, and then any client that tries to read or write to the repository
>> finishes up the steps. But the individual file histories refer to the
>> global revision history (by index), so you don't have a 'top-down' view
>> that makes it all atomic by changing the top level object to point to
>> the new lower level objects.
> 
> The reason that shouldn't happen is the ordering. If we strip the
> changelog first (what you call the global revision history), other
> clients won't be able to "find" the any file-level revisions only
> referenced by the revision just stripped, so it should be atomic.
> 
> Cheers,
> 
> Dirkjan

http://mercurial.selenic.com/wiki/Strip

Thats only true if you are stripping only from the top. According to the
strip page, you also might re-order the numbers.

Also, even with stripping the changelog first, it still leaves you with
data in your repo that is going to suddenly think it is associated with
the *next* commit you do. (So I make a change to 'foo.txt' commit it,
then strip, the next commit I only change 'bar.txt'. If the strip was
canceled 'hg log foo.txt' would include the latest revision as modifying
foo.txt)

John
=:->
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2J6dYACgkQJdeBCYSNAAMQMQCfXvD4dGOVV8LB9LmtMNqXeHys
5xkAoJBWAXXVbZcCKC1GXDPjUMSNbVtn
=k1FG
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 396, Module Version Numbers

2011-04-06 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


...
> #. ``__version_info__`` SHOULD be of the format returned by PEP 386's
>``parse_version()`` function.

The only reference to parse_version in PEP 386 I could find was the
setuptools implementation which is pretty odd:

> 
> In other words, parse_version will return a tuple for each version string, 
> that is compatible with StrictVersion but also accept arbitrary version and 
> deal with them so they can be compared:
> 
 from pkg_resources import parse_version as V
 V('1.2')
> ('0001', '0002', '*final')
 V('1.2b2')
> ('0001', '0002', '*b', '0002', '*final')
 V('FunkyVersion')
> ('*funkyversion', '*final')

bzrlib has certainly used 'version_info' as a tuple indication such as:

version_info = (2, 4, 0, 'dev', 2)

and

version_info = (2, 4, 0, 'beta', 1)

and

version_info = (2, 3, 1, 'final', 0)

etc.

This is mapping what we could sort out from Python's "sys.version_info".

The *really* nice bit is that you can do:

if sys.version_info >= (2, 6):
  # do stuff for python 2.6(.0) and beyond

Doing that as:

if sys.version_info >= ('2', '6'):

is pretty ugly.

John
=:->

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2cLIcACgkQJdeBCYSNAAPT9wCg01L2s0DcqXE+zBAVPB7/Ts0W
HwgAnRRrzR1yiQCSeFGh9jZzuXYrHwPz
=0l4b
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Test cases not garbage collected after run

2011-04-14 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 4/14/2011 1:23 AM, Martin (gzlist) wrote:
> On 07/04/2011, Michael Foord  wrote:
>> On 07/04/2011 20:18, Robert Collins wrote:
>>>
>>> Testtools did something to address this problem, but I forget what it
>>> was offhand.
> 
> Some issues were worked around, but I don't remember any comprehensive 
> solution.
> 
>> The proposed "fix" is to make test suite runs destructive, either
>> replacing TestCase instances with None or pop'ing tests after they are
>> run (the latter being what twisted Trial does). run-in-a-loop helpers
>> could still repeatedly iterate over suites, just not call the suite.
> 
> Just pop-ing is unlikely to be sufficient in practice. The Bazaar test
> suite (which uses testtools nowadays) has code that pops during the
> run, but still keeps every case alive for the duration. That trebles
> the runtime on my memory-constrained box unless I add a hack that
> clears the __dict__ of every testcase after it's run.
> 
> Martin

I think we would be ok with merging the __dict__ clearing as long as it
doesn't do it for failed tests, etc.

John
=:->

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2m1oAACgkQJdeBCYSNAAPHmwCfQSNW8Pk7V7qx6Jl/gYthFVxE
p0cAn0XRvRR+Rqb+yiJnaVEzUOBdwOpf
=19YJ
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Disabling cyclic GC in timeit module

2011-10-18 Thread John Arbash Meinel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

...

>> If you are only measuring json encoding of a few select pieces of
>> data then it's a microbenchmark. If you are measuring the whole
>> application (or a significant part of it) then I'm not sure
>> timeit is the right tool for that.
>> 
>> Regards
>> 
>> Antoine.
>> 
> 
> When you're measuring how much time it takes to encode json, this
> is a microbenchmark and yet the time that timeit gives you is
> misleading, because it'll take different amount of time in your
> application. I guess my proposition would be to not disable gc by
> default and disable it when requested, but well, I guess I'll give
> up given the strong push against it.
> 
> Cheers, fijal

True, but it is also heavily dependent on how much other data your
application has in memory at the time. If your application has 1M
objects in memory and then goes to encode/decode a JSON string when
the gc kicks in, it will take a lot longer because of all the stuff
that isn't JSON related.

I don't think it can be suggested that timeit should grow a flag for
"put garbage into memory, and then run this microbenchmark with gc
enabled.".

If you really want to know how fast something is in your application,
you sort of have to do the timing in your application, at scale and at
load.

John
=:->

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6dLwMACgkQJdeBCYSNAAOzzACfXpP16589Mu7W8ls9KddacF+g
ozwAnRz5ciPg950qcV2uzyTKl1R21+6t
=hGgf
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

57 matches

Mail list logo