Re: Trace KeyboardInterrupt exception?

2006-06-14 Thread andrewdalke
Tony Nelson wrote:
> I'm trying to find out what is eating some KeyboardInterrupt exceptions
> in a fairly large program (yum).  My KeyboardInterrupt handler is called
> for some Ctl-C presses, but for others nothing seems to happen.

>  ... I'd like to use a debugger to trace
> KeyboardInterrupt exceptions, make sure that they're happening, and see
> what is handling them.

I don't know how to do that in Idle.  You can replace the default
Ctrl-C interrupt
handler with your own and use that to inspect the current stack.  For
example,

>>> import signal
>>> signal.getsignal(signal.SIGINT)

>>> prev_handler = signal.getsignal(signal.SIGINT)
>>> def new_int_handler(*args):
...   print "Keyboard Interrupt!"
...   traceback.print_stack()
...   prev_handler(*args)
...
>>> signal.signal(signal.SIGINT, new_int_handler)

>>> def spin():
...   while 1: pass
...
>>> import traceback
>>> spin()
^CKeyboard Interrupt!
  File "", line 1, in ?
  File "", line 2, in spin
  File "", line 3, in new_int_handler
Traceback (most recent call last):
  File "", line 1, in ?
  File "", line 2, in spin
  File "", line 4, in new_int_handler
KeyboardInterrupt
>>>

There's no real need to call the old handler.  You could "raise
KeyboardInterrupt"
or SystemExit or just ignore it, as in

>>> count = 0
>>> def new_int_handler(signum, frame):
...global count
...print messages[count]
...if count >= len(messages)-1:
...   raise KeyboardInterrupt
...count += 1
...
>>> messages = {0: "Sorry, did you want me to do something?",
... 1: "That's ticklish!",
... 2: "Now, where did that off button go to",
... 3: "Do that again and I'll leave.",
... 4: "Shutdown activated"}
>>>
>>> def spin():
...   while 1: pass
...
>>> spin()
^CTraceback (most recent call last):
  File "", line 1, in ?
  File "", line 2, in spin
KeyboardInterrupt
>>>
>>> import signal
>>> signal.signal(signal.SIGINT, new_int_handler)

>>>
>>> spin()
^CSorry, did you want me to do something?
^CThat's ticklish!
^CNow, where did that off button go to
^CDo that again and I'll leave.
^CShutdown activated
Traceback (most recent call last):
  File "", line 1, in ?
  File "", line 2, in spin
  File "", line 5, in new_int_handler
KeyboardInterrupt
>>> 

Andrew
[EMAIL PROTECTED]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: __lt__ slowing the "in" operator even if not called

2006-06-15 Thread andrewdalke
Emanuele Aina wrote:
> I have some code which does a lot of "in" on lists containing objects
> with no __eq__ defined.
>
> It all goes fast until I add the __lt__() method: then I have a
> slowdown comparable to the one I get using the overridden __eq__, while
> the __lt__ method is never called.
>
> Someone can explain me why?

The list's __contains__ method is very simple

for (i = 0, cmp = 0 ; cmp == 0 && i < a->ob_size; ++i)
cmp = PyObject_RichCompareBool(el, PyList_GET_ITEM(a,
i),
   Py_EQ);

The PyObject_RichCompareBool becomes more complex.  The
relevant code occurs after the check for object identity.  The next
step is

res = PyObject_RichCompare(v, w, op);

which goes into your case

/* If the types are equal, and not old-style instances, try to
   get out cheap (don't bother with coercions etc.). */
if (v->ob_type == w->ob_type && !PyInstance_Check(v)) {
cmpfunc fcmp;
richcmpfunc frich = RICHCOMPARE(v->ob_type);
/* If the type has richcmp, try it first.
try_rich_compare
   tries it two-sided, which is not needed since we've
a
   single type only. */
if (frich != NULL) {
res = (*frich)(v, w, op);
if (res != Py_NotImplemented)
goto Done;
Py_DECREF(res);
}
/* No richcmp, or this particular richmp not
implemented.
   Try 3-way cmp. */
fcmp = v->ob_type->tp_compare;

One new-style class defines '__lt__' while the other does not.

Here's where things get shaky for me.  I think new-style classes
are instances of PyInstanceObject (and not PyClassObject).  Instances
use 'instance_richcompare' to implement the rich comparison
between two instances.  This function does the lookup for '__eq__'
in the two classes.

The tp_richcompare slot is set to instance_richcompare when any
of __lt__, __gt__, __eq_, etc. are defined in a new type.  The macro-
based code looks like

TPSLOT("__lt__", tp_richcompare, slot_tp_richcompare,
richcmp_lt,
   "x.__lt__(y) <==> x x<=y"),
TPSLOT("__eq__", tp_richcompare, slot_tp_richcompare,
richcmp_eq,
   "x.__eq__(y) <==> x==y"),
TPSLOT("__ne__", tp_richcompare, slot_tp_richcompare,
richcmp_ne,
   "x.__ne__(y) <==> x!=y"),
TPSLOT("__gt__", tp_richcompare, slot_tp_richcompare,
richcmp_gt,
   "x.__gt__(y) <==> x>y"),
TPSLOT("__ge__", tp_richcompare, slot_tp_richcompare,
richcmp_ge,
   "x.__ge__(y) <==> x>=y"),

So if you define "__lt__" in your object then the type gets a richcmp
function and your == test implicit in the 'in' search always incurs the
cost of figuring out that "__eq__" is not defined.


Andrew
[EMAIL PROTECTED]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Seeking regex optimizer

2006-06-19 Thread andrewdalke
Kay Schluehr wrote:
> I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
> regular expression sx from it, such that sx.match(s) yields a SRE_Match
> object when s starts with an s_i for one i in [0,...,n].

Why do you want to use a regex for this?  When you have constant
strings there are other solutions.  For example, this uses Nicolas
Lehuen's
pytst module from http://nicolas.lehuen.com/index.php/Pytst

>>> import tst
>>> tree = tst.TST()
>>> tree["aa"] = (1, "String with 'aa'")
>>> tree["aahhh"] = (2, "At the doctor's")
>>> tree["a"] = (3, "indefinite article")
>>> tree.scan("This is a bunch of text.  Have you seen an aardvark?",
...   tst.TupleListAction())
[('This is ', -8, None), ('a', 1, (3, 'indefinite article')), (' bunch
of text.  H', -18, None), ('a', 1, (3, 'indefinite article')), ('ve you
seen ', -12, None), ('a', 1, (3, 'indefinite article')), ('n ', -2,
None), ('aa', 2, (1, "String with 'aa'")), ('rdv', -3, None), ('a', 1,
(3, 'indefinite article')), ('rk?', -3, None)]
>>>

Each returned tuple has three elements:
  For a substring which is not a match these are:
   - the unmatched substring
   - the negative length of the substring
   - None

  For a substring which matches:
   - the matched substring
   - the positive length of the substring
   - the value of the match in the TST (the tree)

It uses Aho-Corasick for the implementation which is fast and does what
you expect it to do.  Nor does it have a problem of matching more than
99 possible strings as the regexp approach may have.

Andrew
[EMAIL PROTECTED]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Seeking regex optimizer

2006-06-19 Thread andrewdalke
Replying to me Mirco Wahab wrote:
> If you pull the strings into (?>( ... )) (atomic groups),
> this would't happen.

Given that Python's re engine doesn't support this feature
it doesn't really help the original poster's problem.

Even if some future Python did support it, the limit
to 100 named groups is unaffected by backtracking.

>>> import re
>>> re.compile("(.)"*100)
Traceback (most recent call last):
  File "", line 1, in ?
  File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 180, in compile
return _compile(pattern, flags)
  File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 225, in _compile
p = sre_compile.compile(pattern, flags)
  File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre_compile.py",
line 506, in compile
raise AssertionError(
AssertionError: sorry, but this version only supports 100 named groups
>>>

There was no backtracking in "(.)"*100.

Andrew
[EMAIL PROTECTED]

-- 
http://mail.python.org/mailman/listinfo/python-list


europython room share

2006-06-19 Thread andrewdalke
Is anyone here going to Europython and would like a roommate
to help split the cost?  I'll be there for all three days of the
conference plus a few extra days for sprints.  I figure I can move
elsewhere if need be for the sprints.

It looks like the best choices are St. Genis (because it is about 20
minutes walk away) or Geneva (30 minutes by frequent bus, with
more to do and places to eat).

I did some research looking for places to stay.  My biggest problem
was my inability to read French, though most of the hotels on
the CERN page at
http://housing-service.web.cern.ch/housing-service/listhotel.html
also have alternate pages in English.

My second biggest is not knowing the geography. The Europython "Places
to stay" page at http://www.europython.org/sections/accommodation
says

   Nearby places (useful when consulting the hotels with
   special rates) include Meyrin, Satigny (in Switzerland)
   and St. Genis-Pouilly (in France).

Figuring a lot of students have worked at CERN and written
up their experiences, I found one with this comment in
http://www.sccs.swarthmore.edu/org/swap/reureviews/cern.html

  CERN is about 30 minutes by bus from Geneva and a 20 minute
  walk or bike ride from entertainment in St. Genis, France. Summer
  students hung out most Tuesdays at a pub in St. Genis.

The Europython wiki says the bus to Geneva "leaves four times an
hour during most of a weekday."

I don't know if most people will meet up for dinner and drinks
somewhere close by CERN or in Geneva, making it harder to
figure out where to stay.  I'm thinking nearby.. at least those
staying in the CERN hostel.

Andrew
[EMAIL PROTECTED]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Seeking regex optimizer

2006-06-20 Thread andrewdalke
Kay Schluehr replied to my question:
> > Why do you want to use a regex for this?
>
> Because it is part of a tokenizer that already uses regexps and I do
> not intend to rewrite / replace it.

Switching to pytst is not a big change - there will be little
impact on the rest of your code.  On the other hand, it does
change your build and deployment process because of the
need for another package.

Because you are constraining yourself to regex there isn't
much of an improvement you can do.

> I found that certain groups of token
> might be represented in a more compact mannerr: for matching ['+',
> '++'] one might generate '\+|\+\+' or '\+\+?' and I wanted to know if
> there is some generic approach to solve the "inverse regexp" problem in
> a non-trivial fashion.

There are a few questions which come up.  Why do you want to
have a "more compact" representation?  Do you think it will
make the result faster?  Likely it will because of the reduced
need for backtracking but unless you have a lot of strings with
the same prefix then the number of backtracks should go as
something like the log of the number of words, which scales
pretty well.

Are you speed limited by the existing implementation?  Likely
not.  In which case the performance isn't an issue.  Consider
this anecdote from http://www.technicat.com/writing/programming.html

I was asked in a phone interview with Google how I would
search for a number in an array of ordered numbers.
Obviously, the questioner was asking for a CS 101 answer -
binary search. But in real life, I would probably do the "wrong"
thing - search the array from beginning to end. There's no point
in taking twice the time to write twice as much code that has
to be maintained and debugged if the application performance
is good enough, and in particular if that piece of code is not
the bottleneck in the application. (And I seriously doubt you'd
have that data laid out linearly in a fixed array like that if it
was
the bottleneck)

As it turns out, the regexp implementation could do the optimization
you want.  That is, it could recognize that the input sequence is
a set of fixed words (no special characters) and implement something
like Aho-Corasick, in which case you wouldn't need to change
your input.

  I don't know if Python's regexp implementation does this already.
Neither do you.  Test the functionality and performance first
before going starting other work.

I am not the first to give you this advice:

John Machin:
> Optimised with respect to what? speed? ease of construction?

[email protected]:
> As others have suggested, you should first try the most naive
> implementation before making a hard optimization problem out of this.


  As for the general question of the existance of a "generic approach
to solve the "inverse regexp" problem in a non-trivial fashion."

  Yes, there is.  Kinda of.  All (formal/theoretical) regular
expressions
can be written as a regular expression without need for backtracking.
If you look up the theory of regular expressions, backtracking occurs
when there are ambiguities in traversing the state machine.  These
are called nondeterministic finite state machines, or NFAs.  It turns
out that every NFA can be written as a DFA, which has no need for
backtracking.

  Except.  Actually, two excepts.  The normal algorithm for converting
an NFA to a DFA doesn't support grouping, which you'll need to
figure out which group matched.  Wikipedia says there is an alternate,
slower algorithm which does support groups.

  The second except is that regexps in Python, Perl, etc. are not
regular expressions.  Some patterns, like r"([A-Z][a-z]+)\1", which
matches "WikiWiki", are not "regular grammars" under the definition
used in language theory.  It's possible to use regexps to match
strings of prime length, for example.

  So in the general case there is no solution.  There are many
possible optimizations, but no general solution.

  For your problem, which is a limited subset of the problem as
it involves only constant strings, then the solution is a prefix
tree, also called a trie.  See http://en.wikipedia.org/wiki/Prefix_tree

  John Machin in an earlier post suggested you search along these
lines when he said
> I would have thought the way to approach this would be a simple
> character-by-character tree/trie-driven lookup. This would be worst case
> O(n) where n is the length of the longest string in your list. There may
> well be a Python-callable gadget for this on the net somewhere. Google
> "Danny Yoo ahocorasick" for a Python-callable solution to a similar but
> more complex problem.

For that matter I suggested it when I pointed you to pytst.  The tst
means "ternary search tree" (some write it "ternary search trie") which
is a different implementation of the same idea.

Your posts read like you ignored those pointers because they didn't
fall into the category "implemented with a regular expression".  If
that's the cas

Re: Seeking regex optimizer

2006-06-20 Thread andrewdalke
Mirco Wahab wrote:
> Hi, are you the A.Dalke from the Schulten group (VMD) as
> listed here: http://www.ks.uiuc.edu/Overview/People/former.cgi

Yes.  But I left there nearly a decade ago.

>   # naive regex '\d+9'
>   # find some number only if it ends by 9
>   my $str="10099000";
>
>   print "non-atomic backtracks: $1\n" if $str =~ / (  \d+9 )/x;
>   print "atomic: matches $1\n" if $str =~ / ( (?>\d+)9 )/x;

What makes me less than enthusiastic about various new regexp
features is that they can be written other ways.  For example,

>>> import re
>>> pat = re.compile("([0-8]*9)+")
>>> pat.match("10099000")
<_sre.SRE_Match object at 0x590a0>
>>> _.group(0)
'10099'
>>> pat = re.compile("([0-8]*9)+(?!\d)")
>>> pat.match("10099000")
>>>

They are different ways to reach the same goal.  Understanding
the alternatives requires some non-trivial understanding about
how regexp engines work.  The diversity of solutions with no
clear preference of one over the other overly promotes
idiosyncratic patterns and requires that everyone reading the
regexps must understand all the special syntax contructions
even when a person's working vocabulary only uses a subset
of those.

> Second line would not backtrack and not find the 9 behind \d+
> because atomic grouping prevents that, so line 1 matches
> 10099 and line 2 nothing (but very fast, which is the point).

While my pattern is not as fast.  It's linearly slower than yours
as it does the backtracking.  (Unless the engine is very smart.)
That factor is not something I'm concerned with.

I can also get performance without backtracking (though there
is setup for backtracking) using

>>> pat = re.compile("([0-8]*9)+([0-8]*)")
>>> m = pat.match("10099000")
>>> if m.group(2): print "does not end in a 9"
...
does not end in a 9
>>>
>>> m = pat.match("100990009a")
>>> if m.group(2): print "does not end in a 9"
...
>>>

> Why is there a named group count
> restriction in Python? Any reasons
> one should know?

While not definite, I think it introduces an ambiguity with octal
escape codes.  (*That's* something to omit in Python 3000!)
Here's the quote from perlre

   There is no limit to the number of captured substrings that you
may
   use.  However Perl also uses \10, \11, etc. as aliases for \010,
\011,
   etc.  (Recall that 0 means octal, so \011 is the character at
number 9
   in your coded character set; which would be the 10th character,
a hori-
   zontal tab under ASCII.)  Perl resolves this ambiguity by
interpreting
   \10 as a backreference only if at least 10 left parentheses have
opened
   before it.  Likewise \11 is a backreference only if at least 11
left
   parentheses have opened before it.  And so on.  \1 through \9
are
   always interpreted as backreferences.

Python doesn't have "\10", "\11", etc. as aliases for "\010", "\011",
etc.

>>> re.compile("\10")
<_sre.SRE_Pattern object at 0x575c0>
>>> pat = re.compile(r"\10")
Traceback (most recent call last):
  File "", line 1, in ?
  File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 180, in compile
return _compile(pattern, flags)
  File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 227, in _compile
raise error, v # invalid expression
sre_constants.error: bogus escape: '\\10'
>>> pat = re.compile(r"\010")
>>>

so there is no ambiguity until reaching \100.

>>> pat = re.compile(r"\100")
>>> pat.match("@")
<_sre.SRE_Match object at 0x232560>
>>>

Most likely /F (who implemented the regex engine) decided
to refuse the temptation to guess the meaning, eg by
parens counting as Perl does.

Try this variation of your program to see how Perl changes
the meaning of "\100" based on the input string

for (my $i=98; $i<=103; $i++) {
  my $re = '(\d+)\D' x $i . '\100';
  my $text = join ' ', 0..$i;
  $text .= " @";

  print "Testing $i\n";
  if ($text =~ /$re/) {
print "  Number of matches: $+\n";
print "  Last group: ${$i}\n";
  } else {
print "No matches\n";
  }
}

I get

Testing 98
  Number of matches: 98
  Last group: 98
Testing 99
  Number of matches: 99
  Last group: 99
Testing 100
No matches
Testing 101
No matches
Testing 102
No matches
Testing 103
No matches


Andrew
[EMAIL PROTECTED]

-- 
http://mail.python.org/mailman/listinfo/python-list