[Python-Dev] Rethinking intern() and its data structure
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I've been doing some memory profiling of my application, and I've found some interesting results with how intern() works. I was pretty surprised to see that the "interned" dict was actually consuming a significant amount of total memory. To give the specific values, after doing: bzr branch A B of a small project, the total memory consumption is ~21MB Of that, the largest single object is the 'interned' dict, at 1.57MB, which contains 22k strings. One interesting bit, the size of it + the referenced strings is only 2.4MB. So the "interned" dict *by itself* is 2/3rds the size of the dict + strings it contains. It also means that the average size of a referenced string is 37.4 bytes. A 'str' has 24 bytes of overhead, so the average string is 13.5 characters long. So to save references to 13.5*22k ~ 300kB of character data, we are paying 2.4MB, or about 8:1 overhead. When I looked at the actual references from interned, I saw mostly variable names. Considering that every variable goes through the python intern dict. And when you look at the intern function, it doesn't use setdefault logic, it actually does a get() followed by a set(), which means the cost of interning is 1-2 lookups depending on likelyhood, etc. (I saw a whole lot of strings as the error codes in win32all / winerror.py, and windows error codes tend to be longer-than-average variable length.) Anyway, I the internals of intern() could be done a bit better. Here are some concrete things: a) Don't keep a double reference to both key and value to the same object (1 pointer per entry), this could be as simple as using a Set() instead of a dict() b) Don't cache the hash key in the set, as strings already cache them. (1 long per entry). This is a big win for space, but would need to be balanced against lookup and collision resolving speed. My guess is that reducing the size of the set will actually improve speed more, because more items can fit in cache. It depends on how many times you need to resolve a collision. If the string hash is sufficiently spread out, and the load factor is reasonable, then likely when you actually find an item in the set, it will be the item you want, and you'll need to bring the string object into cache anyway, so that you can do a string comparison (rather than just a hash comparison.) c) Use the existing lookup function one time. (PySet->lookup()) Sets already have a "lookup" which is optimized for strings, and returns a pointer to where the object would go if it exists. Which means the intern() function can do a single lookup resolving any collisions, and return the object or insert without doing a second lookup. d) Having a special structure might also allow for separate optimizing of things like 'default size', 'grow rate', 'load factor', etc. A lot of this could be tuned specifically knowing that we really only have 1 of these objects, and it is going to be pointing at a lot of strings that are < 50 bytes long. If hashes of variable name strings are well distributed, we could probably get away with a load factor of 2. If we know we are likely to have lots and lots that never go away (you rarely *unload* modules, and all variable names are in the intern dict), that would suggest having a large initial size, and probably a wide growth factor to avoid spending a lot of time resizing the set. e) How tuned is String.hash() for the fact that most of these strings are going to be ascii text? (I know that python wants to support non-ascii variable names, but I still think there is going to be an overwhelming bias towards characters in the range 65-122 ('A'-'z'). Also note that the performance of the "interned" dict gets even worse on 64-bit platforms. Where the size of a 'dictentry' doubles, but the average length of a variable name wouldn't change. Anyway, I would be happy to implement something along the lines of a "StringSet", or maybe the "InternSet", etc. I just wanted to check if people would be interested or not. John =:-> PS> I'm not yet subscribed to python-dev, so if you could make sure to CC me in replies, I would appreciate it. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkneDfYACgkQJdeBCYSNAAPMywCfQVWOg51dtIkWT/jttVTARV0g WJ4An1w7ypB+akHT5hiSwRKoUhH7ez4j =9TTp -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
... >> Anyway, I the internals of intern() could be done a bit better. Here are >> some concrete things: >> > > [snip] > > Memory usage is definitely something we're interested in improving. > Since you've already looked at this in some detail, could you try > implementing one or two of your ideas and see if it makes a difference > in memory consumption? Changing from a dict to a set looks promising, > and should be a fairly self-contained way of starting on this. If it > works, please post the patch on http://bugs.python.org with your > results and assign it to me for review. > > Thanks, > Collin Winter > (I did end up subscribing, just with a different email address :) What is the best branch to start working from? "trunk"? John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Christian Heimes wrote: > John Arbash Meinel wrote: >> When I looked at the actual references from interned, I saw mostly >> variable names. Considering that every variable goes through the python >> intern dict. And when you look at the intern function, it doesn't use >> setdefault logic, it actually does a get() followed by a set(), which >> means the cost of interning is 1-2 lookups depending on likelyhood, etc. >> (I saw a whole lot of strings as the error codes in win32all / >> winerror.py, and windows error codes tend to be longer-than-average >> variable length.) > > I've read your posting twice but I'm still not sure if you are aware of > the most important feature of interned strings. In the first place > interning not about saving some bytes of memory but a speed > optimization. Interned strings can be compared with a simple and fast > pointer comparison. With interend strings you can simple write: > > char *a, *b; > if (a == b) { > ... > } > > Instead of: > > char *a, *b; > if (strcmp(a, b) == 0) { > ... > } > > A compiler can optimize the pointer comparison much better than a > function call. > Certainly. But there is a cost associated with calling intern() in the first place. You created a string, and you are now trying to de-dup it. That cost is both in the memory to track all strings interned so far, and the cost to do a dict lookup. And the way intern is currently written, there is a third cost when the item doesn't exist yet, which is another lookup to insert the object. I'll also note that increasing memory does have a semi-direct effect on performance, because more memory requires more time to bring memory back and forth from main memory to CPU caches. ... > I agree that a dict is not the most memory efficient data structure for > interned strings. However dicts are extremely well tested and highly > optimized. Any specialized data structure needs to be desinged and > tested very carefully. If you happen to break the interning system it's > going to lead to rather nasty and hard to debug problems. Sure. My plan was to basically take the existing Set/Dict design, and just tweak it slightly for the expected operations of "interned". > >> e) How tuned is String.hash() for the fact that most of these strings >> are going to be ascii text? (I know that python wants to support >> non-ascii variable names, but I still think there is going to be an >> overwhelming bias towards characters in the range 65-122 ('A'-'z'). > > Python 3.0 uses unicode for all names. You have to design something that > can be adopted to unicode, too. By the way do you know that dicts have > an optimized lookup function for strings? It's called lookdict_unicode / > lookdict_string. Sure, but so does PySet. I'm not sure about lookset_unicode, but I would guess that exists or should exist for py3k. > >> Also note that the performance of the "interned" dict gets even worse on >> 64-bit platforms. Where the size of a 'dictentry' doubles, but the >> average length of a variable name wouldn't change. >> >> Anyway, I would be happy to implement something along the lines of a >> "StringSet", or maybe the "InternSet", etc. I just wanted to check if >> people would be interested or not. > > Since interning is mostly used in the core and extension modules you > might want to experiment with a different growth rate. The interning > data structure could start with a larger value and have a slower, non > progressive data growth rate. > > Christian I'll also mention that there are other uses for intern() where it is uniquely suitable. Namely, if you are parsing lots of text with redundant strings, it is a way to decrease total memory consumption. (And potentially speed up future comparisons, etc.) The main reason why intern() is useful for this is because it doesn't make strings immortal, as would happen if you used some other structure. Because strings know about the "interned" object. The options for a 3rd-party structure fall down into something like: 1) A cache that makes the strings immortal. (IIRC this is what older versions of Python did.) 2) A cache that is periodically walked to see if any of the objects are no longer externally referenced. The main problem here is that walking is O(all-objects), whereas doing the checking at refcount=0 time means you only check objects when you think the last reference has gone away. 3) Hijacking PyStringType->dealloc, so that when the refcount goes to 0 and Python want's to destroy the string, you then trigger your own cache to l
Re: [Python-Dev] Rethinking intern() and its data structure
Alexander Belopolsky wrote: > On Thu, Apr 9, 2009 at 11:02 AM, John Arbash Meinel > wrote: > ... >> a) Don't keep a double reference to both key and value to the same >> object (1 pointer per entry), this could be as simple as using a >> Set() instead of a dict() >> > > There is a rejected patch implementing just that: > http://bugs.python.org/issue1507011 . > Thanks for the heads up. So reading that thread, the final reason it was rejected was 2 part: Without reviewing the patch again, I also doubt it is capable of getting rid of the reference count cheating: essentially, this cheating enables the interning dictionary to have weak references to strings, this is important to allow automatic collection of certain interned strings. This feature needs to be preserved, so the cheating in the reference count must continue. That specific argument was invalid. Because the patch just changed the refcount trickery to use +- 1. And I'm pretty sure Alexander's argument was just that +- 2 was weird, not that the "weakref" behavior was bad. The other argument against the patch was based on the idea that: The operation "give me the member equal but not identical to E" is conceptually a lookup operation; the mathematical set construct has no such operation, and the Python set models it closely. IOW, set is *not* a dict with key==value. I don't know if there was any consensus reached on this, since only Martin responded this way. I can say that for my "do some work with a medium size code base", the overhead of "interned" as a dictionary was 1.5MB out of 20MB total memory. Simply changing it to a Set would drop this to 1.0MB. I have no proof about the impact on performance, since I haven't benchmarked it yet. Changing it to a StringSet could further drop it to 0.5MB. I would guess that any performance impact would depend on whether the total size of 'interned' would fit inside L2 cache or not. There is a small bug in the original patch adding the string to the set failed. Namely it would return "t == NULL" which would be "t != s" and the intern in place would end up setting your pointer to NULL rather than doing nothing and clearing the error code. So I guess some of it comes down to whether "loweis" would also reject this change on the basis that mathematically a "set is not a dict". Though given that his claim "nobody else is speaking in favor of the patch", while at least Colin Winter has expressed some interest at this point. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
... > I like your rationale (save memory) much more, and was asking in the > tracker for specific numbers, which weren't forthcoming. > ... > Now that you brought up a specific numbers, I tried to verify them, > and found them correct (although a bit unfortunate), please see my > test script below. Up to 21800 interned strings, the dict takes (only) > 384kiB. It then grows, requiring 1536kiB. Whether or not having 22k > interned strings is "typical", I still don't know. Given that every variable name in any file is interned, it can grow pretty rapidly. As an extreme case, consider the file "win32/lib/winerror.py" which tracks all possible win32 errors. >>> import winerror >>> print len(winerror.__dict__) 1872 So a single error file has 1.9k strings. My python version (2.5.2) doesn't have 'sys.getsizeof()', but otherwise your code looks correct. If all I do is find the interned dict, I see: >>> print len(d) 5037 So stock python, without importing much extra (just os, sys, gc, etc.) has almost 5k strings already. I don't have a great regex yet for just extracting how many unique strings there are in a given bit of source code. However, if I do: import gc, sys def find_interned_dict(): cand = None for o in gc.get_objects(): if not isinstance(o, dict): continue if "find_interned_dict" not in o: continue for k,v in o.iteritems(): if k is not v: break else: assert not cand cand = o return cand d = find_interned_dict() print len(d) # Just import a few of the core structures from bzrlib import branch, repository, workingtree, builtins print len(d) I start at 5k strings, and after just importing the important bits of bzrlib, I'm at: 19,316 Now, the bzrlib source code isn't particularly huge. It is about 3.7MB / 91k lines of .py files (that is, without importing the test suite). Memory consumption with just importing bzrlib shows up at 15MB, with 300kB taken up by the intern dict. If I then import some extra bits of bzrlib, like http support, ftp support, and sftp support (which brings in python's httplib, and paramiko, and ssh/sftp implementation), I'm up to: >>> print len(d) 25186 Memory has jumped to 23MB, (interned is now 1.57MB) and I haven't actually done anything but import python code yet. If I sum the size of the PyString objects held in intern() it ammounts to 940KB. Though they refer to only 335KB of char data. (or an average of 13 bytes per string). > > Wrt. your proposed change, I would be worried about maintainability, > in particular if it would copy parts of the set implementation. Right, so in the first part, I would just use Set(), as it could then save 1/3rd of the memory it uses today. (Dropping down to 1MB from 1.5MB.) I don't have numbers on how much that would improve CPU times, I would imagine improving 'intern()' would impact import times more than run times, simply because import time is interning a *lot* of strings. Though honestly, Bazaar would really like this, because startup overhead for us is almost 400ms to 'do nothing', which is a lot for a command line app. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Martin v. Löwis wrote: >> I don't have numbers on how much that would improve CPU times, I would >> imagine improving 'intern()' would impact import times more than run >> times, simply because import time is interning a *lot* of strings. >> >> Though honestly, Bazaar would really like this, because startup overhead >> for us is almost 400ms to 'do nothing', which is a lot for a command >> line app. > > Maybe I misunderstand your proposed change: how could the representation > of the interning dict possibly change the runtime of interning? (let > alone significantly) > > Regards, > Martin > Decreasing memory consumption lets more things fit in cache. Once the size of 'interned' is greater than fits into L2 cache, you start paying the cost of a full memory fetch, which is usually measured in 100s of cpu cycles. Avoiding double lookups in the dictionary would be less overhead, though the second lookup is probably pretty fast if there are no collisions, since everything would already be in the local CPU cache. If we were dealing in objects that were KB in size, it wouldn't matter. But as the intern dict quickly gets into MB, it starts to make a bigger difference. How big of a difference would be very CPU and dataset size specific. But certainly caches make certain things much faster, and once you overflow a cache, performance can take a surprising turn. So my primary goal is certainly a decrease of memory consumption. I think it will have a small knock-on effect of improving performance, I don't have anything to give concrete numbers. Also, consider that resizing has to evaluate every object, thus paging in all X bytes, and assigning to another 2X bytes. Cutting X by (potentially 3), would probably have a small but measurable effect. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Greg Ewing wrote: > John Arbash Meinel wrote: >> And the way intern is currently >> written, there is a third cost when the item doesn't exist yet, which is >> another lookup to insert the object. > > That's even rarer still, since it only happens the first > time you load a piece of code that uses a given variable > name anywhere in any module. > Somewhat true, though I know it happens 25k times during startup of bzr... And I would be a *lot* happier if startup time was 100ms instead of 400ms. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
... >> Somewhat true, though I know it happens 25k times during startup of >> bzr... And I would be a *lot* happier if startup time was 100ms instead >> of 400ms. > > I don't want to quash your idealism too severely, but it is extremely > unlikely that you are going to get anywhere near that kind of speed up > by tweaking string interning. 25k times doing anything (computation) > just isn't all that much. > > $ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in > xrange(25000): d.get(x)' > 100 loops, best of 3: 8.28 msec per loop > > Perhaps this isn't representative (int hashing is ridiculously cheap, > for instance), but the dict itself is far bigger than the dict you are > dealing with and such would have similar cache-busting properties. And > yet, 25k accesses (plus python->c dispatching costs which you are paying > with interning) consume only ~10ms. You could do more good by > eliminating a handful of disk seeks by reducing the number of imported > modules... > > -Mike > You're also using timeit over the same set of 25k keys, which means it only has to load that subset. And as you are using identical runs each time, those keys are already loaded into your cache lines... And given how hash(int) works, they are all sequential in memory, and all 10M in your original set have 0 collisions. Actually, at 10M, you'll have a dict of size 20M entries, and the first 10M entries will be full, and the trailing 10M entries will all be empty. That said, you're right, the benefits of a smaller structure are going to be small. I'll just point that if I just do a small tweak to your timing and do: $ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in xrange(25000): d.get(x)' 100 loops, best of 3: 6.27 msec per loop So slightly faster than yours, *but*, lets try a much smaller dict: $ python -mtimeit -s 'd=dict.fromkeys(xrange(25000))' 'for x in xrange(25000): d.get(x)' 100 loops, best of 3: 6.35 msec per loop Pretty much the same time. Well within the noise margin. But if I go back to the "big dict" and actually select 25k keys across the whole set: $ TIMEIT -s 'd=dict.fromkeys(xrange(1000));' \ -s keys=range(0,1000,1000/25000)' \ 'for x in keys: d.get(x)' 100 loops, best of 3: 13.1 msec per loop Now I'm still accessing 25k keys, but I'm doing it across the whole range, and suddenly the time *doubled*. What about slightly more random access: $ TIMEIT -s 'import random; d=dict.fromkeys(xrange(1000));' -s 'bits = range(0, 1000, 400); random.shuffle(bits)'\ 'for x in bits: d.get(x)' 100 loops, best of 3: 15.5 msec per loop Not as big of a difference as I thought it would be... But I bet if there was a way to put the random shuffle in the inner loop, so you weren't accessing the same identical 25k keys internally, you might get more interesting results. As for other bits about exercising caches: $ shuffle(range(0, 1000, 400)) 100 loops, best of 3: 15.5 msec per loop $ shuffle(range(0, 1000, 40)) 10 loops, best of 3: 175 msec per loop 10x more keys, costs 11.3x, pretty close to linear. $ shuffle(range(0, 1000, 10)) 10 loops, best of 3: 739 msec per loop 4x the keys, 4.5x the time, starting to get more into nonlinear effects. Anyway, you're absolutely right. intern() overhead is a tiny fraction of 'import bzrlib.*' time, so I don't expect to see amazing results. That said, accessing 25k keys in a smaller structure is 2x faster than accessing 25k keys spread across a larger structure. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Easy way to detect filesystem case-sensitivity?
Andrew Bennetts wrote: > Antoine Pitrou wrote: >> Robert Kern gmail.com> writes: >>> Since one may have more than one filesystem side-by-side, this can't be just >> be >>> a system-wide boolean somewhere. One would have to query the target >>> directory >>> for this information. I am not aware of the existence of code that does such >> a >>> query, though. >> Or you can just be practical and test for it. Create a file "foobar" and see >> if >> you can open "FOOBAR" in read mode... > > Agreed. That is how Bazaar's test suite detects this, and it works well. > > -Andrew. Actually, I believe we do: open('format', 'wb').close() try: os.lstat('FoRmAt') except IOError, e: if e.errno == errno.ENOENT: ... I don't know that it really matters, just wanted to indicate we use 'lstat' rather than 'open()' to check. I could be wrong about the test suite, but I know that is what we do for 'live' files. (We always create a format file, so we know it is there to 'stat' it via a different name.) John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 385: the eol-type issue
Mark Hammond wrote: > On 5/08/2009 8:14 PM, Dirkjan Ochtman wrote: >> endings. Typically, in my case, that was either Notepad2 (an awesomely >> light-weight Notepad replacement) or Komodo (Edit). That solved all of >> my issues, so I haven't had a need for win32text so far. > > FWIW, I use komodo and scite as my primary editors, and as mentioned, am > personally responsible for accidentally checking in \r\n files into what > should be a \n repo. I am slowly and painfully learning to be more > careful - IMO, I shouldn't need to... > > Cheers, > > Mark IIRC one of the main problems in Copy & Paste. I believe both Scite and Visual Studio have had issues where they "preserve" the line endings of files, but if you paste from another source, it will continue to "preserve" the line endings of the pasted content. That said, you also have the "create a new file defaults to CRLF" that has similar problems. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] VC++ versions to match python versions?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Chris Withers wrote: > Michael Foord wrote: >> D'oh. For 2.5 I mean. It may be *possible* though - just as you *can* >> build extensions for Python 2.5 on windows with mingw (with the >> appropriate distutils configuration), but there are pitfalls with >> doing this. > > Yes, in my case I'm trying to compile guppy (for heapy, which is an > amazing tool) but that blows up with mingw... > > (But I'm also likely to want to do some python dev on windows, httplib > download problems and all...) > > Chris > Guppy doesn't compile on Windows. Pretty much full-stop. It uses static references to DLL functions, which on Windows is not allowed. I've tried patching it to remove such things, and I finally got it to compile, only to have it go "boom!" in actual use. If you can get it to work, certainly post something so that I can cheer. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkqJc8EACgkQJdeBCYSNAAPFWgCghbyZ4MDcA3xich0mBOO1/VoY 5mcAnjjv1kS8Ln3dhbG6/W75zmGacWQw =x6ZX -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL behaviour under Windows
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Antoine Pitrou wrote: >> I don't really know how this test works, so I won't claim to understand >> the results either. However, here you go: > > Thanks. > > Interesting results. I wonder what they would be like on a multi-core > machine. The GIL seems to behave perfectly on your setup (no performance > degradation due to concurrency, and zero latencies). > C:\downloads>C:\Python26\python.exe ccbench.py - --- Throughput --- Pi calculation (Python) threads=1: 691 iterations/s. threads=2: 400 ( 57 %) threads=3: 453 ( 65 %) threads=4: 467 ( 67 %) ^- seems to have some contention regular expression (C) threads=1: 592 iterations/s. threads=2: 598 ( 101 %) threads=3: 587 ( 99 %) threads=4: 586 ( 99 %) bz2 compression (C) threads=1: 536 iterations/s. threads=2: 1056 ( 196 %) threads=3: 1040 ( 193 %) threads=4: 1060 ( 197 %) ^- seems to properly show that I have 2 cores here. - --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 38 ms. (std dev: 18 ms.) CPU threads=2: 173 ms. (std dev: 77 ms.) CPU threads=3: 518 ms. (std dev: 264 ms.) CPU threads=4: 661 ms. (std dev: 343 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.) John =;-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrfJawACgkQJdeBCYSNAANQlgCgwx0TCLh7YhLSJxkfOuMi1/YF XhkAoIONtdP0rR1YW0nDza+wpKpAlInd =L4WZ -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-ideas] Remove GIL with CAS instructions?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Kristján Valur Jónsson wrote: ... > This depends entirely on the platform and primitives used to implement the > GIL. > I'm interested in windows. There, I found this article: > http://fonp.blogspot.com/2007/10/fairness-in-win32-lock-objects.html > So, you may be on to something. Perhaps a simple C test is in order then? > > I did that. I found, on my dual-core vista machine, that running "release", > that both Mutexes and CriticalSections behaved as you describe, with no > "fairness". Using a "semaphore" seems to retain fairness, however. > "fairness" was retained in debug builds too, strangely enough. > > Now, Python uses none of these. On windows, it uses an "Event" object > coupled with an atomically updated counter. This also behaves fairly. > > The test application is attached. > > > I think that you ought to sustantiate your claims better, maybe with a > specific platform and using some test like the above. > > On the other hand, it shows that we must be careful what we use. There has > been some talk of using CriticalSections for the GIL on windows. This test > ought to show the danger of that. The GIL is different than a regular lock. > It is a reverse-lock, really, and therefore may need to be implemented in its > own special way, if we want very fast mutexes for the rest of the system (cc > to python-dev) > > Cheers, > > Kristján I can compile and run this, but I'm not sure I know how to interpret the results. If I understand it correctly, then everything but "Critical Sections" are fair on my Windows Vista machine. To run, I changed the line "#define EVENT" to EVENT, MUTEX, SEMAPHORE and CRIT. I then built and ran in "Release" environment (using VS 2008 Express) For all but CRIT, I saw things like: thread 5532 reclaims GIL thread 5532 working 51234 units thread 5532 worked 51234 units: 1312435761 thread 5532 flashing GIL thread 5876 reclaims GIL thread 5876 working 51234 units thread 5876 worked 51234 units: 1312435761 thread 5876 flashing GIL Where there would be 4 lines for one thread, then 4 lines for the other thread. for CRIT, I saw something more like 50 lines for one thread, and then 50 lines for the other thread. This is Vista Home Basic, and VS 2008 Express Edition, with a 2-core machine. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrfKFAACgkQJdeBCYSNAANbuQCgudU0IChylofTwvUk/JglChBd 9gsAoIJHj63/CagKpduUsd68HV8pP3QX =CuUj -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL behaviour under Windows
Antoine Pitrou wrote: > Sturla Molden molden.no> writes: >> It does not crash the interpreter, but it seems it can deadlock. > > Kristján sent me a patch which I applied and is supposed to fix this. > Anyway, thanks for the numbers. The GIL does seem to fare a bit better (zero > latency with the Pi calculation in the background) than under Linux, although > it > may be caused by the limited resolution of time.time() under Windows. > > Regards > > Antoine. You can use time.clock() instead to get <15ms resolution. Changing all instances of 'time.time' to 'time.clock' gives me this result: (2-core machine, python 2.6.2) $ py ccbench.py --- Throughput --- Pi calculation (Python) threads=1: 675 iterations/s. threads=2: 388 ( 57 %) threads=3: 374 ( 55 %) threads=4: 445 ( 65 %) regular expression (C) threads=1: 588 iterations/s. threads=2: 519 ( 88 %) threads=3: 511 ( 86 %) threads=4: 513 ( 87 %) bz2 compression (C) threads=1: 536 iterations/s. threads=2: 949 ( 176 %) threads=3: 900 ( 167 %) threads=4: 927 ( 172 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 24727 ms. (std dev: 0 ms.) CPU threads=1: 27930 ms. (std dev: 0 ms.) CPU threads=2: 31029 ms. (std dev: 0 ms.) CPU threads=3: 34170 ms. (std dev: 0 ms.) CPU threads=4: 37292 ms. (std dev: 0 ms.) Background CPU task: regular expression (C) CPU threads=0: 40454 ms. (std dev: 0 ms.) CPU threads=1: 43674 ms. (std dev: 21 ms.) CPU threads=2: 47100 ms. (std dev: 165 ms.) CPU threads=3: 50441 ms. (std dev: 304 ms.) CPU threads=4: 53707 ms. (std dev: 377 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 56138 ms. (std dev: 0 ms.) CPU threads=1: 59332 ms. (std dev: 0 ms.) CPU threads=2: 62436 ms. (std dev: 0 ms.) CPU threads=3: 66130 ms. (std dev: 0 ms.) CPU threads=4: 69859 ms. (std dev: 0 ms.) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL behaviour under Windows
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Antoine Pitrou wrote: > Le mercredi 21 octobre 2009 à 12:42 -0500, John Arbash Meinel a écrit : >> You can use time.clock() instead to get <15ms resolution. Changing all >> instances of 'time.time' to 'time.clock' gives me this result: > [snip] >> --- Latency --- >> >> Background CPU task: Pi calculation (Python) >> >> CPU threads=0: 24727 ms. (std dev: 0 ms.) >> CPU threads=1: 27930 ms. (std dev: 0 ms.) >> CPU threads=2: 31029 ms. (std dev: 0 ms.) >> CPU threads=3: 34170 ms. (std dev: 0 ms.) >> CPU threads=4: 37292 ms. (std dev: 0 ms.) > > Well apparently time.clock() has a per-process time reference, which > makes it unusable for this benchmark :-( > (the numbers above are obviously incorrect) > > Regards > > Antoine. I believe that 'time.count()' is measured as seconds since the start of the process. So yeah, I think spawning a background process will reset this counter back to 0. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrfTFYACgkQJdeBCYSNAAObWQCfRJsRENbcp6kuo2x1k+HvhYGZ ftsAn2PNnNHNj6D4esNBMhlSdH4IjeMA =1KWG -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Retrieve an arbitrary element from a set without removing it
Vitor Bosshard wrote: > 2009/10/23 Willi Richert : >> Hi, >> >> recently I wrote an algorithm, in which very often I had to get an arbitrary >> element from a set without removing it. >> >> Three possibilities came to mind: >> >> 1. >> x = some_set.pop() >> some_set.add(x) >> >> 2. >> for x in some_set: >>break >> >> 3. >> x = iter(some_set).next() >> >> >> Of course, the third should be the fastest. It nevertheless goes through all >> the iterator creation stuff, which costs some time. I wondered, why the >> builtin >> set does not provide a more direct and efficient way for retrieving some >> element >> without removing it. Is there any reason for this? >> >> I imagine something like >> >> x = some_set.get() >> > > > I see this as being useful for frozensets as well, where you can't get > an arbitrary element easily due to the obvious lack of .pop(). I ran > into this recently, when I had a frozenset that I knew had 1 element > (it was the difference between 2 other sets), but couldn't get to that > element easily (get the pun?) So in my testing (2) was actually the fastest. I assumed because .next() was a function call overhead, while: for x in some_set: break Was evaluated inline. It probably still has to call PyObject_GetIter, however it doesn't have to create a stack frame for it. This is what "timeit" tells me. All runs are of the form: python -m timeit -s "s = set([10])" ... 0.101us "for x in s: break; x" 0.130us "for x in s: pass; x" 0.234us -s "n = next; i = iter" "x = n(i(s)); x" 0.248us "x = next(iter(s)); x" 0.341us "x = iter(s).next(); x" So 'for x in s: break' is about 2x faster than next(iter(s)) and 3x faster than (iter(s).next()). I was pretty surprised that it was 30% faster than "for x in s: pass". I assume it has something to do with a potential "else:" statement? Note that all of these are significantly < 1us. So this only matters if it is something you are doing often. I don't know your specific timings, but I would guess that: for x in s: break Is actually going to be faster than your s.get() Primarily because s.get() requires an attribute lookup. I would think your version might be faster for: stat2 = "g = s.get; for i in xrange(100): g()" However, that is still a function call, which may be treated differently by the interpreter than the for:break loop. I certainly suggest you try it and compare. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Retrieve an arbitrary element from a set without removing it
Terry Reedy wrote: > John Arbash Meinel wrote: >> So 'for x in s: break' is about 2x faster than next(iter(s)) and 3x >> faster than (iter(s).next()). >> I was pretty surprised that it was 30% faster than "for x in s: pass". I >> assume it has something to do with a potential "else:" statement? > > for x in s: pass > > iterates through *all* the elements in s and leaves x bound to the > arbritrary *last* one instead of the arbitrary *first* one. For a large > set, this would be a lot slower, not just a little. > > fwiw, I think the use case for this is sufficiently rare that it does > not need a separate method just for this purpose. > > tjr The point of my test was that it was a set with a *single* item, and 'break' was 30% faster than 'pass'. Which was surprising. Certainly the difference is huge if there are 10k items in the set. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Retrieve an arbitrary element from a set without removing it
Adam Olsen wrote: > On Fri, Oct 23, 2009 at 11:04, Vitor Bosshard wrote: >> I see this as being useful for frozensets as well, where you can't get >> an arbitrary element easily due to the obvious lack of .pop(). I ran >> into this recently, when I had a frozenset that I knew had 1 element >> (it was the difference between 2 other sets), but couldn't get to that >> element easily (get the pun?) > > item, = set_of_one > > Interesting. It depends a bit on the speed of tuple unpacking, but presumably that is quite fast. On my system it is pretty darn good: 0.101us "for x in s: break" 0.112us "x, = s" 0.122us "for x in s: pass" So not quite as good as the for loop, but quite close. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Retrieve an arbitrary element from a set without removing it
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Martin v. Löwis wrote: >> Hmm, perhaps when using sets as work queues? > > A number of comments: > > - it's somewhat confusing to use a set as a *queue*, given > that it won't provide FIFO semantics. > - there are more appropriate and direct container structures > available, including a dedicated Queue module (even though > this might be doing to much with its thread-safety). > - if you absolutely want to use a set as a work queue, > then the .pop() method should be sufficient, right? > > Regards, > Martin We were using sets to track the tips of a graph, and to compare whether one node was an ancestor of another one. We were caching that answer into frozensets, since that made them immutable. If res = heads(node1, node2) if len(res) == 1: # What is the 'obvious' way to get the node out? I posit that there *isn't* an obvious way to get the single item out of a 1-entry frozenset. for x in res: break list(res)[0] set(res).pop() iter(res).next() [x for x in res][0] x, = res # I didn't think of this one before recently Are all answers, but none of them I would consider *obvious*. At the least, none of them are obviously better than another, so you look into the performance characteristics to give you a reason to pick one over the other. res.get() would be a fairly obvious way to do it. Enough that I would probably never have gone searching for any of the other answers. Though personally, I think I would call it "set.peek()", but the specific name doesn't really matter to me. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrlB4MACgkQJdeBCYSNAAN0lQCgrtyXWlqIbjj01qB4AKPhKrMq QH8An0z2gCWZHoceEJsqRJOUdEl/VLTB =fJXI -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Reworking the GIL
Sturla Molden wrote: > Antoine Pitrou skrev: >> It certainly is. >> But once again, I'm no Windows developer and I don't have a native >> Windost host >> to test on; therefore someone else (you?) has to try. >> > I'd love to try, but I don't have VC++ to build Python, I use GCC on > Windows. > > Anyway, the first thing to try then is to call > > timeBeginPeriod(1); > > once on startup, and leave the rest of the code as it is. If 2-4 ms is > sufficient we can use timeBeginPeriod(2), etc. Microsoft is claiming > Windows performs better with high granularity, which is why it is 10 ms > by default. > > > Sturla That page claims: Windows uses the lowest value (that is, highest resolution) requested by any process. I would posit that the chance of having some random process on your machine request a high-speed timer is high enough that the overhead for Python doing the same is probably low. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Retrieve an arbitrary element from a setwithoutremoving it
geremy condra wrote: > On Thu, Nov 5, 2009 at 4:09 PM, Alexander Belopolsky > wrote: >> On Thu, Nov 5, 2009 at 3:43 PM, Chris Bergstresser >> wrote: >>> .. and "x = iter(s).next()" raises a StopIteration >>> exception. >> And that's why the documented recipe should probably recommend >> next(iter(s), default) instead. Especially because iter(s).next() is >> not even valid code in 3.0. > > This seems reasonably legible to you? Strikes me as coding by > incantation. Also, while I've heard people say that the naive > approach is slower, I'm not getting that result. Here's my test: > > smrt = timeit.Timer("next(iter(s))", "s=set(range(100))") smrt.repeat(10) > [1.2845709323883057, 0.60247397422790527, 0.59621405601501465, > 0.59133195877075195, 0.58387589454650879, 0.56839084625244141, > 0.56839680671691895, 0.56877803802490234, 0.56905913352966309, > 0.56846404075622559] > naive = timeit.Timer("x=s.pop();s.add(x)", "s=set(range(100))") naive.repeat(10) > [0.93139314651489258, 0.53566789627075195, 0.53674602508544922, > 0.53608798980712891, 0.53634309768676758, 0.53557991981506348, > 0.53578495979309082, 0.53666114807128906, 0.53576493263244629, > 0.53491711616516113] > > > Perhaps my test is flawed in some way? > > Geremy Condra Well, it also uses a fairly trivial set. 'set(range(100))' is going to generally have 0 collisions and everything will hash to a unique bucket. As such, pop ing an item out of the hash is a single "val = table[int & mask]; table[int & mask] = _dummy", and then looking it up again requires 2 table lookups (one finds the _dummy, the next finds that the chain is broken and can rewrite the _dummy.) However, if a set is more full, or has more collisions, or ... then pop() and add() become relatively more expensive. Surprising to me, is that "next(iter(s))" was actually slower than .pop() + .add() for 100 node set in my testing: $ alias TIMEIT="python -m timeit -s 's = set(range(100)'" $ TIMEIT "x = next(iter(s))" 100 loops, best of 3: 0.263 usec per loop $ TIMEIT "x = s.pop(); s.add(x)" 100 loops, best of 3: 0.217 usec per loop though both are much slower than the fastest we've found: $ TIMEIT "for x in s: break" 1000 loops, best of 3: 0.0943 usec per loop now, what about a set with *lots* of collisions. Create 100 keys that all hash to the same bucket: aliase TIMEIT="python -m timeit -s 's = set([x*1024*1024 for x in range(100)])'" $ TIMEIT "x = next(iter(s))" 100 loops, best of 3: 0.257 usec per loop $ TIMEIT "x = s.pop(); s.add(x)" 100 loops, best of 3: 0.218 usec per loop I tried a few different ways, and I got the same results, until I did: $ python -m timeit -s "s = set(range(10, 1000100))" "next(iter(s))" 1000 loops, best of 3: 255 usec per loop Now something seems terribly wrong here. next(iter(s)) suddenly jumps up from being < 0.3 us, to being more than 200us. Or ~1000x slower. I realize the above has 900k keys, which is big. But 'next(iter(s))' should only touch 1, and, in fact, should always return the *first* entry. My best guess is just that the *first* entry in the internal set table is no longer close to offset 0, which means that 'next(iter(s))' has to evaluate a bunch of table slots before it finds a non-empty entry. Anyway, none of the proposals will really ever be faster than: for x in s: break It is a bit ugly of a construct, but you don't have an attribute lookup, etc. As long as you don't do: for x in s: pass Then it stays nice and fast. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3003 - Python Language Moratorium
... > A moratorium isn't cost-free. With the back-end free to change, patches > will go stale over 2+ years. People will lose interest or otherwise > move on. Those with good ideas but little patience will be discouraged. > I fully expect that, human nature being as it is, those proposing a > change, good or bad, will be told not to bother wasting their time, > there's a moratorium on at least as often as they'll be encouraged to > bide their time while the moratorium is on. > I believe if you go back to the very beginning of this thread, Guido considers this a *feature* not a *bug*. He wanted to introduce a moratorium at least partially because he was tired of endless threads about anonymous code blocks, etc. Which aren't going to be included in the language anyway, so he may as well make a point to say "and neither will anything else for a while". I don't mean to put words into his mouth, so please correct me if I'm wrong. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Splitting something into two steps produces different behavior from doing it in one fell swoop in Python 2.6.2
Roy Hyunjin Han wrote: > While debugging a network algorithm in Python 2.6.2, I encountered > some strange behavior and was wondering whether it has to do with some > sort of code optimization that Python does behind the scenes. > > > > After initialization: defaultdict(, {1: set([1])}) > Popping and updating in two steps: defaultdict(, {1: set([1])}) > > After initialization: defaultdict(, {1: set([1])}) > Popping and updating in one step: defaultdict(, {}) > > > import collections > print '' > x = collections.defaultdict(set) > x[1].update([1]) > print 'After initialization: %s' % x > items = x.pop(1) > x[1].update(items) > print 'Popping and updating in two steps: %s' % x > print '' > y = collections.defaultdict(set) > y[1].update([1]) > print 'After initialization: %s' % y > y[1].update(y.pop(1)) > print 'Popping and updating in one step: %s' % y > y[1].update(y.pop(1)) is going to be evaluating y[1] before it evaluates y.pop(1). Which means that it has the original set returned, which is then removed by y.pop, and updated. You probably get the same behavior without using a defaultdict: y.setdefault(1, set()).update(y.pop(1)) ^^- evaluated first Oh and I should probably give the standard: "This list is for the development *of* python, not development *with* python." John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
MRAB wrote: > Hi, > > I've been wondering whether it's possible to release the GIL in the > regex engine during matching. > > I know that it needs to have the GIL during memory-management calls, but > does it for calls like Py_UNICODE_TOLOWER or PyErr_SetString? Is there > an easy way to find out? Or is it just a case of checking the source > files for mentions of the GIL? The header file for PyList_New, for > example, doesn't mention it! > > Thanks Anything that Py_INCREF or Py_DECREF's should have the GIL, or you may get concurrent updating of the value, and then the final value is wrong. (two threads do 5+1 getting 6, rather than 7, and when the decref, you end up at 4 rather than back at 5). AFAIK, the only things that don't require the GIL are macro functions, like PyString_AS_STRING or PyTuple_SET_ITEM. PyErr_SetString, for example, will be increfing and setting the exception state, so certainly needs the GIL to be held. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython
Collin Winter wrote: > Hi Dirkjan, > > On Wed, Jan 20, 2010 at 10:55 PM, Dirkjan Ochtman wrote: >> On Thu, Jan 21, 2010 at 02:56, Collin Winter wrote: >>> Agreed. We are actively working to improve the startup time penalty. >>> We're interested in getting guidance from the CPython community as to >>> what kind of a startup slow down would be sufficient in exchange for >>> greater runtime performance. >> For some apps (like Mercurial, which I happen to sometimes hack on), >> increased startup time really sucks. We already have our demandimport >> code (I believe bzr has something similar) to try and delay imports, >> to prevent us spending time on imports we don't need. Maybe it would >> be possible to do something like that in u-s? It could possibly also >> keep track of the thorny issues, like imports where there's an except >> ImportError that can do fallbacks. > > I added startup benchmarks for Mercurial and Bazaar yesterday > (http://code.google.com/p/unladen-swallow/source/detail?r=1019) so we > can use them as more macro-ish benchmarks, rather than merely starting > the CPython binary over and over again. If you have ideas for better > Mercurial/Bazaar startup scenarios, I'd love to hear them. The new > hg_startup and bzr_startup benchmarks should give us some more data > points for measuring improvements in startup time. > > One idea we had for improving startup time for apps like Mercurial was > to allow the creation of hermetic Python "binaries", with all > necessary modules preloaded. This would be something like Smalltalk > images. We haven't yet really fleshed out this idea, though. > > Thanks, > Collin Winter There is "freeze": http://wiki.python.org/moin/Freeze Which IIRC Robert Collins tried in the past, but didn't see a huge gain. It at least tries to compile all of your python files to C files and then build an executable out of that. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython
Martin v. Löwis wrote: >> There is "freeze": >> http://wiki.python.org/moin/Freeze >> >> Which IIRC Robert Collins tried in the past, but didn't see a huge gain. >> It at least tries to compile all of your python files to C files and >> then build an executable out of that. > > "to C files" is a bit of an exaggeration, though. It embeds the byte > code into the executable. When loading the byte code, Python still has > to perform unmarshalling. > > Regards, > Martin > Sure, though it sounds quite similar to what they were mentioning with: "the creation of hermetic Python "binaries", with all necessary modules preloaded" My understanding was that because 'stuff' happens at import time, there isn't a lot that you can do to improve startup time. I guess it depends on what sort of state you could persist safely. And, of course, what you could get away with for a library would probably be different than what you could do with a standalone app. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython
sstein...@gmail.com wrote: > On Jan 21, 2010, at 11:32 PM, Chris Bergstresser wrote: > >> On Thu, Jan 21, 2010 at 9:49 PM, Tres Seaver wrote: >> Generally, that's not going to be the case. But the broader >> point--that you've no longer got an especially good idea of what's >> taking time to run in your program--is still very valid. > > I'm sure someone's given it a clever name that I don't know but it's kind of > the profiling Heisenbug -- the information you need to optimize disappears > when you turn on the JIT optimizer. > > S I would assume that part of the concern is not being able to get per-line profiling out of the JIT'd code. Personally, I'd rather see one big "and I called a JIT function that took XX seconds" rather than getting nothing. At the moment, we have some small issues with cProfile in that it doesn't attribute time to extension functions particularly well. For example, I've never seen a Pyrex "__init__" function show up in timing, the time spent always gets assigned to the calling function. So if I want to see it, I set up a 'create_foo(*args, **kwargs)' function that just does return Foo(*args, **kwargs). I don't remember the other bits I've run into. But certainly I would say that giving some sort of real-world profiling is better than having it drop back to interpreted code. You could always run --profile -j never if you really wanted to get the non-JIT'd code. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3146: Merge Unladen Swallow into CPython
Collin Winter wrote: > Hey Jake, > ... >> Hmm. So cProfile doesn't break, but it causes code to run under a >> completely different execution model so the numbers it produces are >> not connected to reality? >> >> We've found the call graph and associated execution time information >> from cProfile to be extremely useful for understanding performance >> issues and tracking down regressions. Giving that up would be a huge >> blow. > > FWIW, cProfile's call graph information is still perfectly accurate, > but you're right: turning on cProfile does trigger execution under a > different codepath. That's regrettable, but instrumentation-based > profiling is always going to introduce skew into your numbers. That's > why we opted to improve oProfile, since we believe sampling-based > profiling to be a better model. > > Profiling was problematic to support in machine code because in > Python, you can turn profiling on from user code at arbitrary points. > To correctly support that, we would need to add lots of hooks to the > generated code to check whether profiling is enabled, and if so, call > out to the profiler. Those "is profiling enabled now?" checks are > (almost) always going to be false, which means we spend cycles for no > real benefit. At least my point was that I'd rather the machine code generated never called out to the profiler, but that at least calling the machine code itself would be profiled. That would show larger-than-minimal hot spots, but at least it wouldn't change the actual hotspots. > > Can YouTube use oProfile for profiling, or is instrumented profiling > critical? oProfile does have its downsides for profiling user code: > you see all the C-language support functions, not just the pure-Python > functions. That extra data might be useful, but it's probably more > information than most people want. YouTube might want it, though. > Does oprofile actually give you much of the python state? When I've tried that sort of profiling, it seems to tell me the C function that the VM is in, rather than the python functions being executed. Knowing that PyDict_GetItem is being called 20M times doesn't help me a lot if it doesn't tell me that it is being called in def foo(d): for x in d: for y in d: if x != y: assert d[x] != d[y] (Or whatever foolish function you do that in) I'd certainly like to know that 'foo()' was the one to attribute the 20M calls to PyDict_GetItem. Googling to search for oProfile python just gives me Unladen Swallow mentions of making oprofile work... :) > Assuming YouTube can't use oProfile as-is, there are some options: > - Write a script around oProfile's reporting tool to strip out all C > functions from the report. Enhance oProfile to fix any deficiencies > compared to cProfile's reporting. > - Develop a sampling profiler for Python that only samples pure-Python > functions, ignoring C code (but including JIT-compiled Python code). > - Add the necessary profiling hooks to JITted code to better support > cProfile, but add a command-line flag (something explicit like -O3) > that removes the hooks and activates the current behaviour (or > something even more restrictive, possibly). > - Initially compile Python code without the hooks, but have a > trip-wire set to detect the installation of profiling hooks. When > profiling hooks are installed, purge all machine code from the system > and recompile all hot functions to include the profiling hooks. > > Thoughts? > > Collin Winter John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] patch to make list.pop(0) work in O(1) time
> Right now the Python programmer looking to aggressively delete elements from > the top of a list has to consider the tradeoff that the operation takes O(N) > time and would possibly churn his memory caches with the O(N) memmove > operation. In some cases, the Python programmer would only have himself to > blame for not using a deque in the first place. But maybe he's a maintenance > programmer, so it's not his fault, and maybe the code he inherits uses lists > in a pervasive way that makes it hard to swap in deque after the fact. What > advice do you give him? > Or he could just set them to None. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urlparse.urlunsplit should be smarter about +
Stephen J. Turnbull wrote: > David Abrahams writes: > > > > This is a bug report. bugs.python.org seems to be down. > > > > >>> from urlparse import * > > >>> urlunsplit(urlsplit('git+file:///foo/bar/baz')) > > git+file:/foo/bar/baz > > > > Note the dropped slashes after the colon. > > That's clearly wrong, but what does "+" have to to do with it? AFAIK, > the only thing special about + in scheme names is that it's not > allowed as the first character. Don't you need to register the "git+file:///" url for urlparse to properly split it? if protocol not in urlparse.uses_netloc: urlparse.uses_netloc.append(protocol) John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Reasons behind misleading TypeError message when passing the wrong number of arguments to a method
Giampaolo Rodolà wrote: class A: > ... def echo(self, x): > ... return x > ... a = A() a.echo() > Traceback (most recent call last): > File "", line 1, in > TypeError: echo() takes exactly 2 arguments (1 given) > > I bet my last 2 cents this has already been raised in past but I want > to give it a try and revamp the subject anyway. > Is there a reason why the error shouldn't be adjusted to state that > *1* argument is actually required instead of 2? > Because you wouldn't want to have A.echo() Say that it takes 1 argument and (-1 given) ? John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3148 ready for pronouncement
Brian Quinlan wrote: > The PEP is here: > http://www.python.org/dev/peps/pep-3148/ > > I think the PEP is ready for pronouncement, and the code is pretty much > ready for submission into py3k (I will have to make some minor changes > in the patch like changing the copyright assignment): > http://code.google.com/p/pythonfutures/source/browse/#svn/branches/feedback/python3/futures%3Fstate%3Dclosed > Your example here: for number, is_prime in zip(PRIMES, executor.map(is_prime, PRIMES)): print('%d is prime: %s' % (number, is_prime)) Overwrites the 'is_prime' function with the return value of the function. Probably better to use a different variable name. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3148 ready for pronouncement
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Brian Quinlan wrote: > The PEP is here: > http://www.python.org/dev/peps/pep-3148/ > > I think the PEP is ready for pronouncement, and the code is pretty much > ready for submission into py3k (I will have to make some minor changes > in the patch like changing the copyright assignment): > http://code.google.com/p/pythonfutures/source/browse/#svn/branches/feedback/python3/futures%3Fstate%3Dclosed > > The tests are here and pass on W2K, Mac OS X and Linux: > http://code.google.com/p/pythonfutures/source/browse/branches/feedback/python3/test_futures.py > > The docs (which also need some minor changes) are here: > http://code.google.com/p/pythonfutures/source/browse/branches/feedback/docs/index.rst > > Cheers, > Brian > I also just noticed that your example uses: zip(PRIMES, executor.map(is_prime, PRIMES)) But your doc explicitly says: map(func, *iterables, timeout=None) Equivalent to map(func, *iterables) but executed asynchronously and possibly out-of-order. So it isn't safe to zip() against something that can return out of order. Which opens up a discussion about how these things should be used. Given that your other example uses a dict to get back to the original arguments, and this example uses zip() [incorrectly], it seems that the Futures object should have the arguments easily accessible. It certainly seems like a common use case that if things are going to be returned in arbitrary order, you'll want an easy way to distinguish which one you have. Having to write a dict map before each call can be done, but seems unoptimal. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkv2cugACgkQJdeBCYSNAAPWzACdE6KepgEmjwhCD1M4bSSVrI97 NIYAn1z5U3CJqZnBSn5XgQ/DyLvcKtbf =TKO7 -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] email package status in 3.X
... >> IOW, if you're producing output that has to go into another system >> that doesn't take unicode, it doesn't matter how >> theoretically-correct it would be for your app to process the data in >> unicode form. In that case, unicode is not a feature: it's a bug. >> > This is not always true. If you read a webpage, chop it up so you get > a list of words, create a histogram of word length, and then write the output > as > utf8 to a database. Should you do all your intermediate string operations > on utf8 encoded byte strings? No, you should do them on unicode strings as > otherwise you need to know about the details of how utf8 encodes characters. > You'd still have problems in Unicode given stuff like å =~ å even though u'\xe5' vs u'a\u030a' (those will look the same depending on your Unicode system. IDLE shows them pretty much the same, T-Bird on Windosw with my current font shows the second as 2 characters.) I realize this was a toy example, but it does point out that Unicode complicates the idea of 'equality' as well as the idea of 'what is a character'. And just saying "decode it to Unicode" isn't really sufficient. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] versioned .so files for Python 3.2
Scott Dial wrote: > On 6/30/2010 2:53 PM, Barry Warsaw wrote: >> It might be amazing, but it's still a significant overhead. As I've >> described, multiply that by all the py files in all the distro packages >> containing Python source code, and then still try to fit it on a CDROM. > > I decided to prove to myself that it was not a significant issue to have > parallel directory structures in a .tar.bz2, and I was surprised to find > it much worse at that then I had imagined. For example, > > # cd /usr/lib/python2.6/site-packages > # tar --exclude="*.pyc" --exclude="*.pyo" \ > -cjf mercurial.tar.bz2 mercurial > # du -h mercurial.tar.bz2 > 640Kmercurial.tar.bz2 > > # cp -a mercurial mercurial2 > # tar --exclude="*.pyc" --exclude="*.pyo" \ > -cjf mercurial2.tar.bz2 mercurial mercurial2 > # du -h mercurial.tar.bz2 > 1.3Mmercurial2.tar.bz2 > I believe the standard (and largest) block size for .bz2 is 900kB, and I *think* that is uncompressed. Though I know that bz2 can chain, since it can compress all NULL bytes extremely well (multiple GB down to kB, IIRC). There was a question as to whether LZMA would do better here, I'm using 7zip, but .xz should perform similarly. $ du -sh mercurial* 2.6Mmercurial 2.6Mmercurial2 366K mercurial.tar.bz2 734K mercurial2.tar.bz2 303K mercurial.7z 310K mercurial2.7z So LZMA with the 'normal' compression has a big enough window to find almost all of the redundancy, and 310kB is certainly a very small increase over the 303kB. And clearly bz2 does not, since 734kB is actually slightly more than 2x 366kB. John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Intended behavior of backlash in raw strings
I'm trying to determine if this is intended behavior: >>> r"\"" '\\"' >>> r'\'' "\\'" Normally, the quote would end the string, but it gets escaped by the preceding '\'. However, the preceding slash is interpreted as 'not a backslash' because of the raw indicator, so it gets left in verbatim. Note that it works anywhere: >>> r"testing \" backslash and quote" 'testing \\" backslash and quote' It happens that this is the behavior I want, but it seemed just as likely to be an error. I tested it with python2.5 and 2.6 and got the same results. Is this something I can count on? Or is it undefined behavior and I should really not be doing it? John =:-> ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fixing #7175: a standard location for Python config files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 ... > * that said, Windows seems much slower than Linux on equivalent >hardware, perhaps attempting to open files is intrinsically more >expensive there? Certainly it's not safe to assume conclusions drawn >on Linux will apply equally well on Windows, or vice versa. I don't know what the specific issue is here, but adding entries to sys.path makes startup time *significantly* slower. I happen to use easy_install since Windows doesn't have its own package manager. Unfortunately the default of creating a new directory and adding it to easy_install.pth is actually pretty terrible. On my system, 'len(sys.path)' is 72 entries long. 62 of that is from easy-install. A huge amount of that is all the zope and lazr. dependencies that are needed by launchpadlib (not required for bzr itself.) With a fully hot cache, and running the minimal bzr command: time bzr rocks --no-plugins real 0m0.395s vs real 0m0.195s So about 400ms to startup versus 200ms if I use the packaged version of bzr (which has a very small search path). John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxlm0kACgkQJdeBCYSNAAMSEgCfW24XNG3h20UkFdEODNMob6uR nisAoLes/usoHd1YRDIkzxfIJohPjSer =YO9b -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Mercurial Schedule
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/19/2010 7:50 AM, Nick Coghlan wrote: > On Fri, Nov 19, 2010 at 5:43 PM, Georg Brandl wrote: >> Am 19.11.2010 03:23, schrieb Benjamin Peterson: >>> 2010/11/18 Jesus Cea : -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 18/11/10 18:32, "Martin v. Löwis" wrote: > In general, I'm *also* concerned about the lack of volunteers that > are interested in working on the infrastructure. I wish some of the > people who stated that they can't wait for the migration to happen > would work on solving some of the remaining problems. Do we have a exhaustive list of mercurial "to do" things?. >>> >>> http://hg.python.org/pymigr/file/1576eb34ec9f/tasks.txt >> >> Uh, that's the list of things to do *at* the migration. The todo list is >> >> http://hg.python.org/pymigr/file/1576eb34ec9f/todo.txt > > That kind of link is the sort of thing that should really be in the > PEP... (along with the info about where to find the hooks repository, > specific URLs for at least 3.x, 3.1 and 2.7, pointers to a draft FAQ > to replace the current SVN focused FAQ, etc) > Well, if it goes in the pep, you should at least use the 'always the most recent' version :) http://hg.python.org/pymigr/file/tip/todo.txt John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkzmp/gACgkQJdeBCYSNAAOwjgCeOda2XeNvxOR0UnFuQOfN0zZt jGIAoIuarrvIz3oQ+o1jtnH5dFoFk35t =JJo8 -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] svn outage on Friday
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2/15/2011 8:03 AM, Benjamin Peterson wrote: > 2011/2/15 Victor Stinner : >> Le mardi 15 février 2011 à 09:30 +0100, "Martin v. Löwis" a écrit : >>> I'm going to perform a Debian upgrade of svn.python.org on Friday, >>> between 9:00 UTC and 11:00 UTC. I'll be disabling write access during >>> that time. The outage shouldn't be longer than an hour. >> >> It's time to move to Mercurial! :-) > > And doubtless there will be times when Mercurial must be upgraded, too... > True, but on those days you just keep committing locally... John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk1aqpkACgkQJdeBCYSNAAMW3gCgt/Ea75R/HfKM4KlmGmCmfjtL BBYAoI89GYsxrsa4/Eefifg3i6+Euv+T =Kz3A -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Actual Mercurial Roadmap for February (Was: svn outage on Friday)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2/22/2011 9:41 AM, anatoly techtonik wrote: > On Fri, Feb 18, 2011 at 4:00 PM, Dirkjan Ochtman wrote: >> On Fri, Feb 18, 2011 at 14:41, anatoly techtonik wrote: >>> Do you have a public list of stuff to be done (i.e. Roadmap)? >>> BTW, what is the size of Mercurial clone for Python repository? >> >> There is a TODO file in the pymigr repo (though I think that is >> currently inaccessible). > > Can you provide a link? I don't know where to search. Should we open a > src.python.org domain? > >> I don't have a recent optimized clone to check the size of, yet. > > What is the size of non-optimized clone then? I know that a clone of > such relatively small project as MoinMoin is about 250Mb. ISTM that > Python repository may take more than 1GB, and that's not acceptable > IMHO. BTW, what do you mean by optimization - I hope not stripping the > history? Mercurial repositories are sensitive to the order that data is inserted into them. So re-ordering the topological insert can dramatically improve compression. The quick overview is that in a given file's history, each diff is computed to the previous text in that file. So if you have a history like: foo | \ foo baz bar foo | / baz foo bar This can be stored as either: foo +bar -bar +baz +bar This matters more if you have a long divergent history for a while: A |\ B C | | D E | | F G : : X Y |/ Z In this case, you could end up with contents that look like: A +B +D +F +X -BDFX+C +E +G +Y +ABDFXZ Or you could have the history 'interleaved': A +B -B+C -C+BD -BD+CE -BDF+CEG -... There are tools that take a history file, and rewrite it to be more compact. I don't know much more than that myself. But especially with something like an svn conversion which probably works on each revision at a time, where people are committing to different branches concurrently, I would imagine the conversion could easily end up in the pessimistic case. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk1j6qoACgkQJdeBCYSNAAPzPgCdEOJsHf4Xf4lZH+jHX42FQb8J sQoAn3JuCmDcsyv0JZpXtbVJoGewA+7t =M8DI -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PyCObject_AsVoidPtr removed from python 3.2 - is this documented?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/7/2011 3:56 AM, Terry Reedy wrote: > On 3/6/2011 6:09 PM, Barry Scott wrote: >> I see that PyCObject_AsVoidPtr has been removed from python 3.2. >> The 3.2 docs do not seem to explain this has happened and what >> to replace it with. >> >> I searched the 3.2 docs and failed to find PyCObject_AsVoidPtr. >> I looked at the whats new page and the API PEP. Did I miss >> where this is documented? > > Georg recently reaffirmed on a tracker issue that when something is > removed from the code, it is removed from the docs also. So the place to > look for a deprecation notice and replacement suggestion is in the last > release where present. > It might be interesting to just have a stub entry with: PyCObject_AsVoidPtr (This was deleted in 3.2, last available in [3.1]) Might end up being too cluttered, but certainly helps the people who hit the problem. Especially since, AIUI, deprecations are suppressed by default now. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk10mPEACgkQJdeBCYSNAAMNCwCfUIm79vsW7KSuibBRLZUFA4P2 VooAn1Muo6yeciMBSO+ndlaq10VE5lxV =ewPb -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Suggest reverting today's checkin (recursive constant folding in the peephole optimizer)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 ... > I have always felt uncomfortable with *any* kind of optimization -- > whether AST-based or bytecode-based. I feel the cost in code > complexity is pretty high and in most cases the optimization is not > worth the effort. Also I don't see the point in optimizing expressions > like "3 * 4 * 5" in Python. In C, this type of thing occurs frequently > due to macro expansion and the like, but in Python your code usually > looks pretty silly if you write that. Just as a side comment here, I often do this sort of thing when dealing with time based comments, or large constants. It is a bit more obvious that: 10*1024*1024 vs 10485760 is 10MiB especially since you can't put commas in your constants. 10,485,760 would at least make it immediately obvious it was 10M not 104M or something else. Similarly is 10,800s 3 hours? 3*60*60 certainly is. I don't think I've done that sort of thing in anything performance critical. But I did want to point out that writing "X*Y*Z" as a constant isn't always "pretty silly". John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk17c0sACgkQJdeBCYSNAAMUrQCdEhissWuvTElIc6Wy/2qotzaU xz4AnRO+ND/3NkKWC7Bbu78ACjs2X920 =QR/2 -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hg: inter-branch workflow
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/20/2011 5:06 AM, R. David Murray wrote: > On Thu, 17 Mar 2011 14:33:00 +0100, wrote: >> On Thu, 17 Mar 2011 09:24:26 -0400 >> "R. David Murray" wrote: >>> >>> It would be great if rebase did work with share, that would make a >>> push race basically a non-issue for me. >> >> rebase as well as strip destroy some history, meaning some of your >> shared clones may end up having their working copy based on a >> non-existent changeset. I'm not sure why rebase would be worse that >> strip in that regard, though. > > Well, it turns out that this completely doesn't work, though at first > it appeared to (and so I pushed). > > I had a push race, so I did hg pull; hg rebase. Then I looked at the > log, and I could (apparently) see my change sets on the top of the > stack. So I pushed. Victor then asked why one of my commits deleted > Tools/demo/README, and then the next commit restored it. > > What I was attempting to push was a doc change in 3.1 that I had then > merged to 3.2 and default. What I saw when looking closer at the log > (after Victor pointed it out) was that my merge commits had lost their > parents. > > I thought that at worst a rebase would screw up my local history, but > apparently I managed to push some sort of damaged history. The doc > change only got applied to default, since that's the branch I > happened to be in when I did the rebase. > > Needless to say, I'm avoiding rebase henceforth. AIUI, rebase defaults to always omitting merge changesets. Under the assumption that the branch you would merge is the one you are targeting to rebase upon. So those merges are 'not interesting' once you are rebased. Obviously this has failure cases (when the branch being merged is not the one you are targeting.) John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2GSdMACgkQJdeBCYSNAAMf5wCgox24+LoJRKtzJHmCFTWcZnjI MwIAniISqH9xDR/9g5EiXEsg5Wk66jeN =39Oi -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/21/2011 10:44 AM, "Martin v. Löwis" wrote: >> My understanding is that svn does not detect fast forwards, only lack >> of conflicts, and therefore in case of concurrent development it is >> possible that the repository contains a version that never existed in >> any developer's workspace. > > I can't understand how you draw this conclusion ("therefore"). > > If you do an svn up, it merges local changes with remote changes; > if that works without conflicts, it tells you what files it merged, > but lets you commit. > > Still, in this case, the merge result did exist in the sandbox > of the developer performing the merge. Subversion never ever creates > versions in the repository that didn't before exist in some working > copy. The notion that it may have done a server-side merge or some > such is absurd. It does so at the *tree* level, not at an individual file level. 1) svn doesn't require you to run 'svn up' unless there is a direct change to the file you are committing. So there is plenty of opportunity to have cross-file failures. The standard example is I change foo.py's foo() function to add a new mandatory parameter. I 'svn up' and run the test suite, updating all callers to supply that parameter. You update bar.py to call foo.foo() not realizing I'm updating it. You 'svn up' and run the test suite. Both my test suite and your test suite were perfect. So we both 'svn commit'. There is now a race condition. Both of our commits will succeed. (revisions 10 and 11 lets say). Revision 11 is now broken (will fail to pass the test suite.) 2) svn's default lack of tree-wide synchronization means that no matter how diligent we are, there is a race condition. (As I'm pretty sure both 'svn commit' can run concurrently since they don't officially modify the same file.) 3) IIRC, there is a way to tell "svn commit", "this is my base revno, if the server is newer, abort the commit". It is used by bzr-svn to ensure we always commit tree-wide state safely. However, I don't think svn itself makes much use of it, or makes it easy to use. Blindly merging in trunk / rebasing your changes has the same hole. Though you at least can be aware that it is there, rather than the system hiding the fact that you were out of date. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2HMxoACgkQJdeBCYSNAAMaewCfW3DK8hW4hBKOA+5zbyaxyptH MMQAoKGw2uWUWafBK2+Jl5A6XMK+0z9f =5R0+ -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/21/2011 10:48 PM, "Martin v. Löwis" wrote: >> It does so at the *tree* level, not at an individual file level. > > Thanks - I stand corrected. I was thinking about the file level only (at > which it doesn't do server-side merging - right?). > > Regards, > Martin AIUI, you are correct, svn doesn't do server-side content merging. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2IbC4ACgkQJdeBCYSNAANhCQCgpmKnD7QmZQ8Cv1DGPkJw22Q0 /uYAoMc0VNoLq2VMFpu3+uzS2M93x08P =TL8l -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/22/2011 2:05 AM, Terry Reedy wrote: > On 3/21/2011 7:14 AM, Nick Coghlan wrote: > >> hg broadens the check and complains if *any* files are not up to date >> on any of the branches being pushed, thus making it a requirement to >> do a hg pull and merge on all affected branches before the hg push can >> succeed. In theory, this provides an opportunity for the developer >> doing the merge to double check that it didn't break anything, in >> practice (at least in the near term) we're more likely to see an >> SVN-like practice of pushing the merged changes without rerunning the >> full test suite. > > Now that you and John have (finally) explained how 'non-conflict' merges > can actually contain a conflict (and there could be such for docs as > well as code*), I think there is a pretty clear guideline for when to > re-test. > > If my change adds or changes in one file a reference to something in > another file, or changes or subtracts in one file something that might > be referenced by other files, and the the change could affect the > cross-file linkage, and the pulled changes merged with my changes might > have such linkages, then I should rerun tests on the new merged state. > (I say 'might' because it is easier and safer to just rerun than to > check very hard.) Otherwise, it should be safe not to. Correct? > I would agree with 'might' with the very large caveat that cross-file linkage tends to exist in far more places than most people think. I know I've done a few "single line fixes" where the test suite showed me that it wasn't as simple as that. :) John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2IbIUACgkQJdeBCYSNAAPsbQCbB53ctC5DQ8hQo1U/1crNsxLc d5EAoIF1/x9hBW2z4X9EfGNaaGM3V8A+ =2HN+ -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/21/2011 10:32 PM, Nick Coghlan wrote: > On Tue, Mar 22, 2011 at 3:16 AM, Raymond Hettinger > wrote: >> I don't think that is the main source of complexity. >> The more difficult and fragile part of the workflows are: >> * requiring commits to be cross-linked between branches >> * and wanting changesets to be collapsed or rebased >> (two operations that destroy and rewrite history). > > Yep, that sounds about right. I think in the long run the first one > *will* turn out to be a better work flow, but it's definitely quite a > shift from our historical way of doing things. > > As far as the second point goes, I'm coming to the view that we should > avoid rebase/strip/rollback when intending to push to the main > repository, and do long term work in *separate* cloned repositories. > Then an rdiff with the relevant cpython branch will create a nice > collapsed patch ready for application to the main repository (I have > yet to succeed in generating a nice patch without using rdiff, but I > still have some more experimentation to do with MvL's last proposed > command for that before giving up on the idea). > > Cheers, > Nick. > I don't know what mercurial specifically supports. I believe git has a '--squash' option at merge/commit time. And bzr has "bzr revert - --forget-merges". Which lets you do a merge as normal, and then tell it to forget all of the history that you just merged (treating the commit as just a collapsed patch). It is trivial to do this as a DVCS (it is just *omitting* the extra parent pointer for commit). Though Mercurial's model of extra heads existing in the branch may make it a bit trickier. (If you omit the head when committing, it still stays around looking like it is unmerged, so you need 1 more step of killing the extra head.) Regardless, I'm sure it is something that could be implemented and streamlined for Python's use. Maybe someone knows a Mercurial command to already do it? John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2IbgIACgkQJdeBCYSNAANAlACfZkegH6t9y0PUH9xufcbCB4PX 8ykAn0A6i7D/+LJ+9+9OwoA27hkAiHUc =eh4I -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/21/2011 6:53 PM, Barry Warsaw wrote: > On Mar 21, 2011, at 01:19 PM, R. David Murray wrote: > >> So you are worried about the small window between me doing an 'svn up', >> seeing no changes, and doing an 'svn ci'? I suppose that is a legitimate >> concern, but considering the fact that if the same thing happens in hg, >> the only difference is that I know about it and have to do more work[*], >> I don't think it really changes anything. Well, it means that if your >> culture uses the "always test" workflow you can't be *perfect* about it >> if you use svn[**], which I must suppose has been your (and Stephen's) >> point from the beginning. >> >> [*] that is, I'm *not* going to rerun the test suite even if I have to >> pull/up/merge, unless there are conflicts. > > I think if we really want full testing of all changesets landing on > hg.python.org/cpython we're going to need a submit robot like PQM or Tarmac, > although the latter is probably too tightly wedded to the Launchpad API, and I > don't know if the former supports Mercurial. > > With the benefits such robots bring, it's also important to understand the > downsides. There are more moving parts to maintain, and because landings are > serialized, long test suites can sometimes cause significant backlogs. Other > than during the Pycon sprints, the backlog probably wouldn't be that big. > > Another complication we'd have is running the test suite cross-platform, but I > suspect that almost nobody does that today anyway. So the buildbot farm would > still be important. > > -Barry I'm personally a huge fan of 2(multi)-tier testing. So you have a basic (and fast) test suite that runs across all your modules before every commit in your mainline. Then a much larger (and slower) test suite that runs regression testing across all platforms/etc that runs asynchronously. Which gives you some basic protection against brown-bag failures (you committed a typo in the main Python.h file, breaking everyone). And still avoids a huge pushback on throughput. I think Launchpad is currently looking to do batch-PQM. So that every commit to the final mainline must pass the full test suite. However the automated bot can grab multiple requests from the queue at a time, on the premise that 90% of the time, none of them will break anything. So you end up with a 100% stable trunk (any given revision committed by the bot did pass the full test suite), but still get most of the throughput. Also, by working in batch mode, if you have 20 submissions, and submission #2 would have broken the test suite, it only bumps some (say the first 5) submissions, and the other 15 still get to land in an orderly fashion. You could even put any bumped submissions into a deferred 'one-by-one' queue. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2Ib6QACgkQJdeBCYSNAAPJWwCggyeS5DZlm/DR7bo+1AmpD9rr YmMAoLFmmu7VBTJJX/khyigaOPU9dDE9 =68rK -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hg: inter-branch workflow
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/21/2011 9:19 PM, Barry Warsaw wrote: > On Mar 21, 2011, at 11:56 AM, Daniel Stutzbach wrote: > >> Keeping the repository clean makes it easier to use a bisection search to >> hunt down the introduction of a bug. If every developer's intermediate >> commits make it into the main repository, it's hard to go back to an older >> revision to test something, because many of the older revisions will be >> broken in some way. > > So maybe this gets at my earlier question about rebase being cultural > vs. technology, and the response about bzr having a strong sense of mainline > where hg doesn't. > > I don't use the bzr-bisect plugin too much, but I think by default it only > follows commits on the main line, unless a bisect point is identified within a > merge (i.e. side) line. So again, those merged intermediate changes are > mostly ignored until they're needed. > > -Barry Bazaar is, to my knowledge, the only DVCS that has a "mainline" emphasis. Which shows up in quite a few areas. The defaults for 'log', having branch-stable revnos[1], and the 'bzr checkout' model for managing a mainline. The cultural aspects of this also show up. Since we default to not showing merge commits in 'bzr log' (or even when shown, they are shown indented), there is less impetus to remove them by default.[2] Guido mentioned that when he does long-term development, there are a lot of advantages to having intermediate snapshots that he can roll back to, which have questionable benefit in being in the final repository, since some states were dead-ends that weren't pursued. I certainly can understand that, but there is at least an argument that preserving "this approach is a dead-end" also has some merit. If someone comes to you and says "why didn't you implement it this way", you can point to it and say "I tried, it didn't work". Which would also give them a point to start if they really think it is an avenue to pursue further. I also remember something my Math teacher once said. That some of the early proofs were so polished that nobody knew how to "reproduce" them. Sure, anyone could follow the final proof and say "yes that is correct", but nobody could *learn* from the proof, because it was missing those human-level steps of intuition that helped understand *how* the proof was developed, rather than just what the final state was. That is not to say that the python.org primary repository should be a teaching repository. However, I know that I'm personally quite curious to see how Guido does his work. Insight into the minds of how interesting people do interesting things and all. Another key point of how tools influence and interact with culture. Because bzr has a strong mainline bias, it leads to people interacting with the tool differently, which strengthens and reinforces it. In Mercurial, it would be trivial to add a "hg log --only-mainline" that would always follow a specific parent and show that to you. However, because Mercurial doesn't default to that view, people don't try to preserve the mainline. For example, in these graphs are logically the same, but result in a very different mainline view: A -- B -- C \\ D -- E A -- B -- C -- E \ / D -- If you consider D as "I did my work" and E as "and I integrated that back into trunk". If you merge the trunk revision C into E, and then push, you end up with this graph: A -- B -- D -- E \ / C -- And suddenly the revision which was an "important" C change is now gone on the mainline, and your personal "and I did D" is now a primary revision. It doesn't matter much for a single revision D and C, but for anything long lived, you end up with 100 Guido exploratory D revisions, and some 50+ other python-dev super-stable trunk C revisions. And unless the tool helps you preserve the ordering[3], you really don't want to trying to treat them with different levels of authority. Hence, you tend to collapse, because you really can only trust "E" to be a final stable change. John =:-> [1] any copy of "trunk" has the same revision-id matching revno 1234, in Mercurial the numbers in 'hg log' correspond to the ordering in the physical repository, so depend on what order revisions were merged, etc. [2] The downside is people having their work merged and then wondering "where did my commits go", and it looking like this guy named Patch was an extremely productive developer of Launchpad and Bazaar (Patch Queue Manager.) [3] In Bazaar, you can set 'append_revisions_only = True' for integration branches. Which will refuse to push E if it would remove C as a mainline revision. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2Id5sACgkQJdeBCYSNAAMnjgCbBFiMtdkj8hvJ19dPn3Maz3Bo TrwAmwfgmg0YMGaCPM+W+kAVVDVvrOlY =6oWG -END PGP SIGNATURE-
Re: [Python-Dev] Hg: inter-branch workflow
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/22/2011 3:22 PM, Barry Warsaw wrote: > On Mar 22, 2011, at 12:07 PM, Adrian Buehlmann wrote: > >> FWIW, Mercurial's "mainline" is the branch with the name 'default'. This >> branch name is reserved, and it implies that the head with the highest >> revision number from that branch will be checked out on 'hg clone'. > > I think that's different than what John was describing, or perhaps Python's > use of it has the effect of being different. IIUC, in Mercurial, within the > default branch there's no clear "main line" of development assigned to a path > within the DAG. All paths are created equal, so it's not possible to > e.g. have log or bisect suppress one path containing feature sub-commits from > the point of departure to the point of recombination (merge). > > -Barry If you think of "mainline" as "trunk" or "master", then yes, Mercurial's concept is "default". If you think of "mainline" as a series of commits in a linear sequence, then there isn't specifically a "main" "line" in Mercurial's world view. In Bazaar, the sequence of first-parents in the history of a branch's tip is considered special. Based on the concept that: cd me bzr merge ../you bzr commit -m "I merged you" Is not the same *social* thing as: cd ../you bzr merge ../me bzr commit -m "You merged me" Logically, the contents of the tree should be the same. The name of the person doing it is often different. If you have "blessed branch" approach to development (which almost all projects do at some level or other) then merging something into the "blessed branch" is not the same as merging the "blessed branch" into your personal branch. For python, the "blessed branch" is http://hg.python.org/cpython. Consider, if someone merges/pushes a change to that branch, it has very different social consequences than if someone merges that branch into their own feature branch. Namely, it is going to be part of the next released tarball. Note that hg does distinguish it a little bit. If you look here: http://hg.python.org/cpython/graph There is a clear separation of what was merged into the line on the left, versus what was committed elsewhere. However, there is no distinction on: http://hg.python.org/cpython/shortlog Note also that because 'hg log' doesn't default to only showing you things on a given branch, you can end up with log views like this: http://hg.python.org/cpython/graph/693415328363 Where "Relax timing check" has nothing to do with "Issue #10833", but because the data in the repository has Relax Timing Check placed after Issue #10833 physically on disk, they end up both getting shown in the log view. It certainly is much worse in http://hg.python.org/cpython/shortlog/693415328363 Where there is no way to tell that the revisions are unrelated, and just happen to physically reside in same repository. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2Iu5kACgkQJdeBCYSNAAMVQACgpGd53yMnQBjJXuoVLElxC6qN OqwAmwTdxjIS5tjkf0+iK62DvT/uPLdz =1YLF -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hg: inter-branch workflow
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/22/2011 3:25 PM, Barry Warsaw wrote: > On Mar 22, 2011, at 12:52 AM, Éric Araujo wrote: > >> Bazaar apparently has a notion of mainline whereas Mercurial believes >> that all changesets are created equal. The tools are different. > > I'm curious: what are the benefits of the Mercurial model? > > -Barry > - From discussions in #bzr, I would say the primary complaint to our model is that people feel it means that not all revisions are created equal. Though honestly, that is true. Getting a revision into Linus's kernel tree is not the same as getting it into my own branch of the kernel. I do remember discussions where people felt bad that after integrating someone else's work, by default it showed their name, rather than the names of the people who had done the actual work. (Though you could always use --author if you really wanted to change the name on that commit.) There is a technical argument. A merging B should be idempotent to B merging A. And certainly, the file-content should be the same after either operation. But as I mentioned in the other email, Barry merging my work and committing it to the hg.python.org/cpython branch is very different socially than me merging cpython's branch. So bzr is somewhat distinguishing "things done on this branch" from "things done in other branches that I have included into this branch.". Simplicity. If you enforce the mainline concept with "append_revisions_only=True" on your trunk branches. Then people who try to do: cd my-work merge trunk && commit push trunk Will be blocked. While if you did cd trunk merge my-work && commit push trunk It would succeed. If a tool doesn't give you a benefit from maintaining a mainline, there is overhead in preserving it. If your tool defaults to fast-forward merging, it is also really hard to get the behavior. (git does this with a config option to disable it, bzr has it as an option to merge [defaulting to off], and I'm not sure what Mercurial's default is.) John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2Ivi4ACgkQJdeBCYSNAAOhzwCg0tD1KdR53fH7OEzhom0IaPOL niYAn2KsY2jPLJmbXWf8sIauMW+y2hHC =NFdA -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Workflow proposal
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/23/2011 4:30 AM, Stephen J. Turnbull wrote: > Antoine Pitrou writes: > > > Now, "hg strip" should definitely be absent of any recommended or even > > suggested workflow. It's a power user tool for the experimented > > developer/admin. Not the average hg command. > > So what you're saying is that Mercurial by itself can't support the > recommended workflow, because any "collapsing" of commits requires > stripping, whether done by hg strip or implicitly by some other > "non-average" hg command. > Just as an aside, and I might be wrong. But reading through what strip does, (and from my knowledge of the disk layout) it can't actually be atomic. So if you kill it at the wrong time, it will have corrupted your repository. At least, that is from what I can tell. Which is that it has to rewrite each file history to omit the revisions you are stripping, and then rewrite the revision history to do the same. It would be possible to be 'stable' if you wrote a write-ahead-log and did all the work on the side, and then any client that tries to read or write to the repository finishes up the steps. But the individual file histories refer to the global revision history (by index), so you don't have a 'top-down' view that makes it all atomic by changing the top level object to point to the new lower level objects. It is possible that they only rewrite the revisions file, leaving blanks for the old references. But the statement "it rewrites the numbers" means it is collapsing the offsets in the index. http://mercurial.selenic.com/wiki/Strip I definitely would be very leery of putting that in any sort of recommended documentation. It also makes me understand more why hg folks value having "hg check" run very quickly... John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2J2/0ACgkQJdeBCYSNAAPFCACfeGHMB3/td31EuyateqzcoNFx 87sAoKxDt8i1rqllHogRBMxTGUDzSsdd =RgPe -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] I am now lost - committed, pulled, merged, what is "collapse"?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/22/2011 11:05 PM, Barry Warsaw wrote: > On Mar 23, 2011, at 07:31 AM, Nick Coghlan wrote: > >> On Wed, Mar 23, 2011 at 4:25 AM, Barry Warsaw wrote: >>> It probably wouldn't be a bad idea to add a very fast "smoke test" for the >>> case where you get tripped up on the merge dance floor. When that happens, >>> you could run your localized tests, and then a set of tests that run in >>> just a >>> minute or two. >> >> What would such a smoke test cover, though? It's hard to think of >> anything particularly useful in the middle ground between "Can you run >> regrtest *at all*?" and "make quicktest". > > make quicktest runs 340 tests, and I'm certain many don't need to be run in a > smoke test. E.g. > > test_aifc > test_colorsys > test_concurrent_futures > (that's as far as it's gotten so far ;) > > Or you could time each individual test, choose a cut off for total test run > and then run whatever you can within that time (on some reference machine). > Or maybe just remove the longest running 50% of the quick tests. > > -Barry bzr can run its 30,000 tests in about 15-30min depending on your specific hardware and platform (it also supports parallel running of the test suite). I think it takes about 1hr running single-threaded on Windows because "os.rename()" is a fairly slow operation and we do it a lot in most tests (it is part of our locking primitive). Unit-tests (tests directly on a function without setting up lots of external state) generally can execute in milliseconds. I don't specifically know what is in those 340 tests, but 18min/340 = 3.2s for each test. Which is *much* longer than simple smoke tests would have to be. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2J3WUACgkQJdeBCYSNAAMI2ACgl6obH9WKlmRiK4K4ib1g6SR7 KqkAn1LNrlBaUTf/sc5s30tZq3F3hmNl =DtQ8 -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Workflow proposal
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 3/23/2011 1:23 PM, Dirkjan Ochtman wrote: > On Wed, Mar 23, 2011 at 12:39, John Arbash Meinel > wrote: >> Just as an aside, and I might be wrong. But reading through what strip >> does, (and from my knowledge of the disk layout) it can't actually be >> atomic. So if you kill it at the wrong time, it will have corrupted your >> repository. >> >> At least, that is from what I can tell. Which is that it has to rewrite >> each file history to omit the revisions you are stripping, and then >> rewrite the revision history to do the same. It would be possible to be >> 'stable' if you wrote a write-ahead-log and did all the work on the >> side, and then any client that tries to read or write to the repository >> finishes up the steps. But the individual file histories refer to the >> global revision history (by index), so you don't have a 'top-down' view >> that makes it all atomic by changing the top level object to point to >> the new lower level objects. > > The reason that shouldn't happen is the ordering. If we strip the > changelog first (what you call the global revision history), other > clients won't be able to "find" the any file-level revisions only > referenced by the revision just stripped, so it should be atomic. > > Cheers, > > Dirkjan http://mercurial.selenic.com/wiki/Strip Thats only true if you are stripping only from the top. According to the strip page, you also might re-order the numbers. Also, even with stripping the changelog first, it still leaves you with data in your repo that is going to suddenly think it is associated with the *next* commit you do. (So I make a change to 'foo.txt' commit it, then strip, the next commit I only change 'bar.txt'. If the strip was canceled 'hg log foo.txt' would include the latest revision as modifying foo.txt) John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2J6dYACgkQJdeBCYSNAAMQMQCfXvD4dGOVV8LB9LmtMNqXeHys 5xkAoJBWAXXVbZcCKC1GXDPjUMSNbVtn =k1FG -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 396, Module Version Numbers
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 ... > #. ``__version_info__`` SHOULD be of the format returned by PEP 386's >``parse_version()`` function. The only reference to parse_version in PEP 386 I could find was the setuptools implementation which is pretty odd: > > In other words, parse_version will return a tuple for each version string, > that is compatible with StrictVersion but also accept arbitrary version and > deal with them so they can be compared: > from pkg_resources import parse_version as V V('1.2') > ('0001', '0002', '*final') V('1.2b2') > ('0001', '0002', '*b', '0002', '*final') V('FunkyVersion') > ('*funkyversion', '*final') bzrlib has certainly used 'version_info' as a tuple indication such as: version_info = (2, 4, 0, 'dev', 2) and version_info = (2, 4, 0, 'beta', 1) and version_info = (2, 3, 1, 'final', 0) etc. This is mapping what we could sort out from Python's "sys.version_info". The *really* nice bit is that you can do: if sys.version_info >= (2, 6): # do stuff for python 2.6(.0) and beyond Doing that as: if sys.version_info >= ('2', '6'): is pretty ugly. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2cLIcACgkQJdeBCYSNAAPT9wCg01L2s0DcqXE+zBAVPB7/Ts0W HwgAnRRrzR1yiQCSeFGh9jZzuXYrHwPz =0l4b -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Test cases not garbage collected after run
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 4/14/2011 1:23 AM, Martin (gzlist) wrote: > On 07/04/2011, Michael Foord wrote: >> On 07/04/2011 20:18, Robert Collins wrote: >>> >>> Testtools did something to address this problem, but I forget what it >>> was offhand. > > Some issues were worked around, but I don't remember any comprehensive > solution. > >> The proposed "fix" is to make test suite runs destructive, either >> replacing TestCase instances with None or pop'ing tests after they are >> run (the latter being what twisted Trial does). run-in-a-loop helpers >> could still repeatedly iterate over suites, just not call the suite. > > Just pop-ing is unlikely to be sufficient in practice. The Bazaar test > suite (which uses testtools nowadays) has code that pops during the > run, but still keeps every case alive for the duration. That trebles > the runtime on my memory-constrained box unless I add a hack that > clears the __dict__ of every testcase after it's run. > > Martin I think we would be ok with merging the __dict__ clearing as long as it doesn't do it for failed tests, etc. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2m1oAACgkQJdeBCYSNAAPHmwCfQSNW8Pk7V7qx6Jl/gYthFVxE p0cAn0XRvRR+Rqb+yiJnaVEzUOBdwOpf =19YJ -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling cyclic GC in timeit module
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 ... >> If you are only measuring json encoding of a few select pieces of >> data then it's a microbenchmark. If you are measuring the whole >> application (or a significant part of it) then I'm not sure >> timeit is the right tool for that. >> >> Regards >> >> Antoine. >> > > When you're measuring how much time it takes to encode json, this > is a microbenchmark and yet the time that timeit gives you is > misleading, because it'll take different amount of time in your > application. I guess my proposition would be to not disable gc by > default and disable it when requested, but well, I guess I'll give > up given the strong push against it. > > Cheers, fijal True, but it is also heavily dependent on how much other data your application has in memory at the time. If your application has 1M objects in memory and then goes to encode/decode a JSON string when the gc kicks in, it will take a lot longer because of all the stuff that isn't JSON related. I don't think it can be suggested that timeit should grow a flag for "put garbage into memory, and then run this microbenchmark with gc enabled.". If you really want to know how fast something is in your application, you sort of have to do the timing in your application, at scale and at load. John =:-> -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6dLwMACgkQJdeBCYSNAAOzzACfXpP16589Mu7W8ls9KddacF+g ozwAnRz5ciPg950qcV2uzyTKl1R21+6t =hGgf -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com