date:20170201

[Python-Dev] Core Python Projects for GSoC 2017?

2017-02-01 Thread Terri Oda


Hey all, it's Google Summer of Code time again!

For those not familiar with it, Google Summer of Code is a mentoring 
program where Google provides stipends to pay students to work with 
selected open source organizations.  Python has participated for many 
years now, and we're hoping to be selected to participate again.


One of the things we have to do to apply is have a bunch of interesting 
ideas for students to work on and a bunch of mentors who are excited 
about working with students.


The PSF page is http://python-gsoc.org/

That has contact information and answers to questions like "what does it 
take to be a mentor?" and please feel free to ask if there's more you 
want to know!


A good project idea is one that is useful to the community and can be 
completed by a student with the help of mentors during the 3-month 
duration of the program.  We want at least two mentors available per 
idea (so that it's not a big deal if someone's at a conference or sick) 
and at least one of the mentors (usually the primary one) should be able 
to review code and work with the community to get code merged upstream 
when it's ready.


So... is anyone here interested in mentoring this year, and does anyone 
have any great ideas of things that we need and can support in Core 
Python?  Brainstorming appreciated!  (Although only ideas with actual 
mentoring support will make the final cut.)


I've got a stub of a Core Python ideas page here:
https://wiki.python.org/moin/SummerOfCode/2017/python-core

(If you can't edit the wiki, I can add people's usernames to the editor 
lists, or I can add ideas that are emailed to me.)


And our ideas page template that has a checklist of things students 
usually want to know is here:

https://wiki.python.org/moin/SummerOfCode/OrgIdeasPageTemplate

If you're interested in mentoring, email gsoc-adm...@python.org so I can 
make sure you're on the list of people I ping with reminders.  We need a 
good set of ideas by Feb 7 for our application, and then if we're 
accepted by Google, we can add a few other ideas up until Feb 28 or so.


 Terri
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] SSL certificates recommendations for downstream python packagers

2017-02-01 Thread Cory Benfield

> On 31 Jan 2017, at 18:26, Steve Dower  wrote:
> 
> In short, I want to allow Python code to set OpenSSL's certificate validation 
> callback. Basically, given a raw certificate, return True/False based on 
> whether it should be trusted. I then have separate code (yet to be published) 
> implementing the callback on Windows by calling into the WinVerifyTrust API 
> with the HTTPS provider, which (allegedly) behaves identically to browsers 
> using the same API (again, allegedly - I have absolutely no evidence to 
> support this other than some manual testing).

For context here Steve, this is not quite what Chrome does (and I cannot stress 
enough that the Chrome approach is the best one I’ve seen, the folks working on 
it really do know what they’re doing). The reason here is a bit tricky, but 
essentially the validation callback is called incrementally for each step up 
the chain. This is not normally what a platform validation API actually wants: 
generally they want the entire cert chain the remote peer sent at once.

Chrome, instead, essentially disables the OpenSSL cert validation entirely: 
they still require the certificate be presented, but override the verification 
callback to always say “yeah that’s cool, no big deal”. They then take the 
complete cert chain provided by the remote peer and pass that to the platform 
validation code in one shot after the handshake is complete, but before they 
send/receive any data on the connection. This is still safe: so long as you 
don’t actually expose any data before you have validated the certificates you 
aren’t at risk.

I have actually prototyped this approach for Requests/urllib3 in the past. I 
wrote a small Rust extension to call into the platform-native code, and then 
wrapped it in a CFFI library that exposed a single callable to validate a cert 
chain for a specific hostname (library is here: 
https://github.com/python-hyper/certitude 
). This could then be called from 
urllib3 code that used PyOpenSSL using this patch here: 
https://github.com/shazow/urllib3/pull/802/files 

PLEASE DON’T ACTUALLY USE THIS CODE. I have not validated that certitude does 
entirely the right things with the platform APIs. This is just an example of a 
stripped-down version of what Chrome does, as a potential example of how to get 
something working for your Python use-case.

Cory___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] SSL certificates recommendations for downstreampython packagers

2017-02-01 Thread Steve Dower

Sorry, I misspoke when I said "certificate validation callback", I meant the 
same callback Cory uses below (name escapes me now, but it's unfortunately 
similar to what I said). There are two callbacks in OpenSSL, one that allows 
you to verify each certificate in the chain individually, and one that requires 
you to validate the entire chain.

I do indeed take the entire chain in one go and pass it to the OS API. 
Christian also didn't like that I was bypassing *all* of OpenSSL's certificate 
handling here, but maybe there's a way to make it reliable if Chrome has done 
it?

Top-posted from my Windows Phone

-Original Message-
From: "Cory Benfield" 
Sent: ‎2/‎1/‎2017 2:03
To: "Steve Dower" 
Cc: "Christian Heimes" ; "David Cournapeau" 
; "python-dev" 
Subject: Re: [Python-Dev] SSL certificates recommendations for downstreampython 
packagers



On 31 Jan 2017, at 18:26, Steve Dower  wrote:

In short, I want to allow Python code to set OpenSSL's certificate validation 
callback. Basically, given a raw certificate, return True/False based on 
whether it should be trusted. I then have separate code (yet to be published) 
implementing the callback on Windows by calling into the WinVerifyTrust API 
with the HTTPS provider, which (allegedly) behaves identically to browsers 
using the same API (again, allegedly - I have absolutely no evidence to support 
this other than some manual testing).



For context here Steve, this is not quite what Chrome does (and I cannot stress 
enough that the Chrome approach is the best one I’ve seen, the folks working on 
it really do know what they’re doing). The reason here is a bit tricky, but 
essentially the validation callback is called incrementally for each step up 
the chain. This is not normally what a platform validation API actually wants: 
generally they want the entire cert chain the remote peer sent at once.


Chrome, instead, essentially disables the OpenSSL cert validation entirely: 
they still require the certificate be presented, but override the verification 
callback to always say “yeah that’s cool, no big deal”. They then take the 
complete cert chain provided by the remote peer and pass that to the platform 
validation code in one shot after the handshake is complete, but before they 
send/receive any data on the connection. This is still safe: so long as you 
don’t actually expose any data before you have validated the certificates you 
aren’t at risk.


I have actually prototyped this approach for Requests/urllib3 in the past. I 
wrote a small Rust extension to call into the platform-native code, and then 
wrapped it in a CFFI library that exposed a single callable to validate a cert 
chain for a specific hostname (library is here: 
https://github.com/python-hyper/certitude). This could then be called from 
urllib3 code that used PyOpenSSL using this patch here: 
https://github.com/shazow/urllib3/pull/802/files


PLEASE DON’T ACTUALLY USE THIS CODE. I have not validated that certitude does 
entirely the right things with the platform APIs. This is just an example of a 
stripped-down version of what Chrome does, as a potential example of how to get 
something working for your Python use-case.


Cory___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] SSL certificates recommendations for downstreampython packagers

2017-02-01 Thread Cory Benfield

> On 1 Feb 2017, at 14:20, Steve Dower  wrote:
> 
> Sorry, I misspoke when I said "certificate validation callback", I meant the 
> same callback Cory uses below (name escapes me now, but it's unfortunately 
> similar to what I said). There are two callbacks in OpenSSL, one that allows 
> you to verify each certificate in the chain individually, and one that 
> requires you to validate the entire chain.
> 
> I do indeed take the entire chain in one go and pass it to the OS API. 
> Christian also didn't like that I was bypassing *all* of OpenSSL's 
> certificate handling here, but maybe there's a way to make it reliable if 
> Chrome has done it?

So, my understanding is that bypassing OpenSSL’s cert handling is basically 
fine. The risks are only in cases where OpenSSL’s cert handling would be a 
supplement to what the OS provides, which is not really very common and I don’t 
think is a major risk for Python.

So in general, it is not unreasonable to ask your OS “are these certificates 
valid for this connection based on your trust DB” and circumventing OpenSSL 
entirely there. Please do bear in mind you need to ask your OS the right 
question. For Windows this stuff is actually kinda hard because the API is 
somewhat opaque, but you have to worry about setting correct certificate 
usages, building up chain policies, and then doing appropriate error handling 
(AFAIK the crypto API can “fail validation” for some reasons that have nothing 
to do with validation itself, so worth bearing that in mind).

The TL;DR is: I understand Christian’s concern, but I don’t think it’s 
important if you’re very, very careful.

Cory

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Investigating Python memory footprint of one real Web application

2017-02-01 Thread Ivan Levkivskyi

Inada-san,

I have made a PR for typing module upstream
https://github.com/python/typing/pull/383
It should reduce the memory consumption significantly (and also increase
isinstance() speed).
Could you please try it with your real code base and test memory
consumption (and maybe speed) as compared to master?

--
Ivan


On 23 January 2017 at 12:25, INADA Naoki  wrote:

> On Fri, Jan 20, 2017 at 8:52 PM, Ivan Levkivskyi 
> wrote:
> > On 20 January 2017 at 11:49, INADA Naoki  wrote:
> >>
> >> * typing may increase memory footprint, through functions
> >> __attributes__ and abc.
> >>* Can we add option to remove or lazy evaluate __attributes__ ?
> >
> >
> > This idea already appeared few times. I proposed to introduce a flag
> (e.g.
> > -OOO) to ignore function and variable annotations in compile.c
> > It was decide to postpone this, but maybe we can get back to this idea.
> >
> > In 3.6, typing is already (quite heavily) optimized for both speed and
> > space.
> > I remember doing an experiment comparing a memory footprint with and
> without
> > annotations, the difference was few percent.
> > Do you have such comparison (with and without annotations) for your app?
> > It would be nice to have a realistic number to estimate what would the
> > additional optimization flag save.
> >
> > --
> > Ivan
> >
> >
>
> Hi, Ivan.
>
> I investigated why our app has so many WeakSet today.
>
> We have dozen or hundreds of annotations like Iterable[User] or List[User].
> (User is one example of application's domain object.  There are
> hundreds of classes).
>
> On the other hand, SQLAlchemy calls isinstance(obj,
> collections.Iterable) many times,
> in [sqlalchemy.util._collections.to_list](https://github.com/
> zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/util/_
> collections.py#L795-L804)
> method.
>
> So there are (# of iterable subclasses) weaksets for negative cache,
> and each weakset
> contains (# of column types) entries.  That's why WeakSet ate much RAM.
>
> It may be slowdown application startup too, because thousands of
> __subclasscheck_ is called.
>
> I gave advice to use 'List[User]' instead of List[User] to the team of
> the project,
> if the team think RAM usage or boot speed is important.
>
> FWIW, stacktrace is like this:
>
>   File "/Users/inada-n/local/py37dbg/lib/python3.7/_weakrefset.py", line
> 84
> self.data.add(ref(item, self._remove))
>   File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 233
> cls._abc_negative_cache.add(subclass)
>   File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 226
> if issubclass(subclass, scls):
>   File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 226
> if issubclass(subclass, scls):
>   File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 191
> return cls.__subclasscheck__(subclass)
>   File "venv/lib/python3.7/site-packages/sqlalchemy/util/_collections.py",
> line 803
> or not isinstance(x, collections.Iterable):
>   File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line
> 1680
> columns = util.to_list(prop)
>   File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line
> 1575
> prop = self._property_from_column(key, prop)
>   File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line
> 1371
> setparent=True)
>   File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line
> 675
> self._configure_properties()
>
> Regards,
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Why doesn't module finalization delete names as expected?

2017-02-01 Thread Philippe Proulx

Hello. I'm not sure if I'm posting to the right list. If it's not the
case, please tell me which one to post to.

Using Python 3.5.2.

I'm developing a C module with the help of SWIG. My library manages
objects with reference counting, much like Python, except that it's
deterministic: there's no GC.

I create two Python objects like this:

bakery = Bakery()
bread = bakery.create_bread()

Behind the scenes, the situation looks like this:

   ++
   | UserBread obj (Py) |
   +--^---+-+
  |   :
  |   :
  |   :
+--++-+---V---+
| Bakery obj (lib) <+ Bread obj (lib) |
+^---+-++^+
 |   :   |
 |   :   |
++---V+ ++---+
| Bakery obj (Py) | | Bread obj (Py) |
+-^---+ +--^-+
  ||
  ||
  ++
   bakerybread

A pipe link means one "strong" reference and a colon link means one
borrowed/weak reference.

I have some ownership inversion magic for the Bakery lib. and Python
objects to always coexist.

So here it's pretty clear what can happen. I don't know which reference
gets deleted first, but let's assume it's `bakery`. Then the situation
looks like this:

   ++
   | UserBread obj (Py) |
   +--^---+-+
  |   :
  |   :
  |   :
+--++-+---V---+
| Bakery obj (lib) <+ Bread obj (lib) |
+^---+-++^+
 :   |   |
 :   |   |
++---V+ ++---+
| Bakery obj (Py) | | Bread obj (Py) |
+-+ +--^-+
   |
   |
   +
 bread

The Bakery Python object's __del__() drops the reference to its library
object, but first its reference count is incremented during this call
(so it's not really destroyed) and it's marked as only owned by the
library object from now on.

When `bread` gets deleted:

1. The Bread Python object's __del__() method gets called: the reference
   to its library object is dropped.
2. The Bread library object's destroy function drops its reference to
   the Bakery library object.
3. The Bakery library object's destroy function drops its reference to
   the Bakery Python object.
4. The Bakery Python object's __del__() method does nothing this time,
   since the object is marked as only owned by its library object
   (inverted ownership).
5. The Bread library object's destroy function then drops its reference
   to the UserBread Python object.

In the end, everything is correctly destroyed and released. This also
works if `bread` is deleted before `bakery`.

My problem is that this works as expected when used like this:

def func():
bakery = Bakery()
bread = bakery.create_bread()

if __name__ == '__main__':
func()

but NOTHING is destroyed when used like this:

bakery = Bakery()
bread = bakery.create_bread()

That is, directly during the module's initialization. It works, however,
if I delete `bread` manually:

bakery = Bakery()
bread = bakery.create_bread()
del bread

It also works with `bakery` only:

bakery = Bakery()

My question is: what could explain this?

My guess is that my logic is correct since it works fine in the function
call situation.

It feels like `bread` is never deleted in the module initialization
situation, but I don't know why: the only reference to the Bread Python
object is this `bread` name in the module... what could prevent this
object's __del__() method to be called? It works when I call `del bread`
manually: I would expect that the module finalization does the exact
same thing?

Am I missing anything?

Thanks,
Phil
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Un

Re: [Python-Dev] Why doesn't module finalization delete names as expected?

2017-02-01 Thread Antoine Pitrou

On Wed, 1 Feb 2017 03:23:02 -0500
Philippe Proulx  wrote:
> 
> It feels like `bread` is never deleted in the module initialization
> situation, but I don't know why: the only reference to the Bread Python
> object is this `bread` name in the module... what could prevent this
> object's __del__() method to be called? It works when I call `del bread`
> manually: I would expect that the module finalization does the exact
> same thing?

When do you expect module finalization to happen?  Your module is
recorded in sys.modules so, unless you explicitly remove it from there,
module finalization will only happen at interpreter shutdown.

(and this question is more appropriate for the python-list, anyway)

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Heads up: possible double-comments on bpo for commits

2017-02-01 Thread Victor Stinner

Hi,

I noticed a strange issue with Roundup Robot on the issue #29318.

I closed http://bugs.python.org/issue29318 as rejected:

   resolution:  -> rejected
   status: open -> closed

Roundup Robot made a first change:

   stage:  -> resolved

But then it made a second change:

   resolution: rejected -> fixed

To finish with an empty change:

=> I got an email to notify me "Changes by Roundup Robot
", but with no useful email body.
--- 8< ---

Changes by Roundup Robot :


___
Python tracker 


--- 8< ---

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] re performance

2017-02-01 Thread Lukasz Langa

> On Jan 31, 2017, at 11:40 AM, Wang, Peter Xihong 
>  wrote:
> 
> Regarding to the performance difference between "re" and "regex" and 
> packaging related options, we did a performance comparison using Python 3.6.0 
> to run some micro-benchmarks in the Python Benchmark Suite 
> (https://github.com/python/performance 
> ):
> 
> Results in ms, and the lower the better (running on Ubuntu 15.10)
>   re  regex (via pip install regex, 
> and a replacement of "import re" with "import regex as re")
> bm_regex_compile.py   229 298
> bm_regex_dna.py   171 267
> bm_regex_effbot.py2.773.04
> bm_regex_v8.py24.814.1
> This data shows "re" is better than "regex" in term of performance in 3 out 
> of 4 above micro-benchmarks.

This is very informative, thank you! This clearly shows we should rather pursue 
the PyPI route (with a documentation endorsement and possible bundling for 3.7) 
than full-blown replacement.

However, this benchmark is incomplete in the sense that it only checks the 
compatibility mode of `regex`, whereas it's the new mode that lends the biggest 
performance gains. So, providing checks for the other engine would show us the 
full picture. We'd need to add checks that prove the regular expressions in 
said benchmarks end up with equivalent matches, to be sure we're testing the 
same thing.

- Ł___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] re performance

2017-02-01 Thread Serhiy Storchaka


On 31.01.17 21:40, Wang, Peter Xihong wrote:

Regarding to the performance difference between "re" and "regex" and packaging 
related options, we did a performance comparison using Python 3.6.0 to run some micro-benchmarks in 
the Python Benchmark Suite (https://github.com/python/performance):

Results in ms, and the lower the better (running on Ubuntu 15.10)
re  regex (via pip install regex, and a replacement of 
"import re" with "import regex as re")
bm_regex_compile.py 229 298
bm_regex_dna.py 171 267
bm_regex_effbot.py  2.773.04
bm_regex_v8.py  24.814.1
This data shows "re" is better than "regex" in term of performance in 3 out of 
4 above micro-benchmarks.


bm_regex_v8 is the one that is purposed to reflect real-world use of 
regular expressions.


See also different comparison at 
https://mail.python.org/pipermail/speed/2016-March/000311.html. In some 
tests regex surpasses re, in other tests re surpasses regex. re2 is much 
faster than other engines in all tests except the one in which it is 
much slower (and this engine is the least featured).



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] re performance

2017-02-01 Thread Victor Stinner

2017-02-01 20:42 GMT+01:00 Lukasz Langa :
> However, this benchmark is incomplete in the sense that it only checks the
> compatibility mode of `regex`, whereas it's the new mode that lends the
> biggest performance gains. So, providing checks for the other engine would
> show us the full picture.

Would you mind to write a pull request for performance to add a
command line option to test "re" (stdlib) or "regex" (PyPI, in the new
mode)? Or maybe even regex and regex_compatibility_mode :-)

Source:
https://github.com/python/performance/blob/master/performance/benchmarks/bm_regex_v8.py#L1789

Example of benchmark with cmdline options:
https://github.com/python/performance/blob/master/performance/benchmarks/bm_raytrace.py#L385

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] re performance

2017-02-01 Thread Wang, Peter Xihong

+1

We'd like to get more details on how to try this "new mode", and do a 
full/comprehensive comparison between the "re" vs "regex".

Peter


 
-Original Message-
From: Victor Stinner [mailto:victor.stin...@gmail.com] 
Sent: Wednesday, February 01, 2017 12:58 PM
To: Lukasz Langa 
Cc: Wang, Peter Xihong ; python-dev@python.org
Subject: Re: [Python-Dev] re performance

2017-02-01 20:42 GMT+01:00 Lukasz Langa :
> However, this benchmark is incomplete in the sense that it only checks 
> the compatibility mode of `regex`, whereas it's the new mode that 
> lends the biggest performance gains. So, providing checks for the 
> other engine would show us the full picture.

Would you mind to write a pull request for performance to add a command line 
option to test "re" (stdlib) or "regex" (PyPI, in the new mode)? Or maybe even 
regex and regex_compatibility_mode :-)

Source:
https://github.com/python/performance/blob/master/performance/benchmarks/bm_regex_v8.py#L1789

Example of benchmark with cmdline options:
https://github.com/python/performance/blob/master/performance/benchmarks/bm_raytrace.py#L385

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] SSL certificates recommendations for downstreampython packagers

2017-02-01 Thread Stephen J. Turnbull

Cory Benfield writes:

 > The TL;DR is: I understand Christian’s concern, but I don’t think
 > it’s important if you’re very, very careful.

But AIUI, the "you" above is the end-user or admin of end-user's
system, no?  We know that they aren't very careful (or perhaps more
accurate, this is too fsckin' complicated for anybody but an infosec
expert to do very well).

I[1] still agree with you that it's *unlikely* that end-users/admins
will need to worry about it.  But we need to be really careful about
what we say here, or at least where the responsible parties will be
looking.

Thanks to all who are contributing so much time and skull sweat on
this.  This is insanely hard, but important.

Footnotes: 
[1]  Infosec wannabe, I've thought carefully but don't claim real
expertise.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] re performance

2017-02-01 Thread Franklin? Lee

On Thu, Jan 26, 2017 at 4:13 PM, Sven R. Kunze  wrote:
> Hi folks,
>
> I recently refreshed regular expressions theoretical basics *indulging in
> reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
>
> However, reaching the chart in the lower third of the article, I saw Python
> 2.4 measured against a naive Thompson matching implementation. And I was
> surprised about how bad it performed compared to an unoptimized version of
> an older than dirt algorithm.

> From my perspective, I can say, that regular expressions might worth
> optimizing especially for web applications (url matching usually uses
> regexes) but also for other applications where I've seen many tight loops
> using regexes as well. So, I am probing interest on this topic here.

What I (think I) know:
- Both re and regex use the same C backend, which is not based on NFA.
- The re2 library, which the writer of that article made, allows
capture groups (but only up to a limit) and bounded repetitions (up to
a limit).
- Perl has started to optimize some regex patterns.

What I think:
- The example is uncharacteristic of what people write, like URL matching.
- But enabling naive code to perform well is usually a good thing. The
fewer details newbies need to know to write code, the better.
- It's possible for Python to optimize a large category of patterns,
even if not all.
- It's also possible to optimize large parts of patterns with
backrefs. E.g. If there's a backref, the group that the backref refers
to can still be made into an NFA.
- To do the above, you'd need a way to generate all possible matches.
- Optimization can be costly. The full NFA construction could be
generated only upon request, or maybe the code automatically tries to
optimize after 100 uses (like a JIT). This should only be considered
if re2's construction really is costly.

If people want NFAs, I think the "easiest" way is to use re2. Jakub
Wilk mentioned it before, but here it is again.
https://github.com/google/re2

re2 features:
https://github.com/google/re2/wiki/Syntax

Named references aren't supported, but I think that can be worked
around with some renumbering. It's just a matter of translating the
pattern.

It also doesn't like lookarounds, which AFAIK are perfectly doable
with NFAs. Looks like lookaheads and lookbehinds are hard to do
without a lot of extra space (https://github.com/google/re2/issues/5).

Facebook has a Python wrapper for re2.
https://github.com/facebook/pyre2/

In a post linked to from this thread, Serhiy mentioned another Python
wrapper for re2. This wrapper is designed to be like re, and should
probably be the basis of any efforts. It's not been updated for 11
months, though.
https://pypi.python.org/pypi/re2/
https://github.com/axiak/pyre2/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Core Python Projects for GSoC 2017?

Re: [Python-Dev] SSL certificates recommendations for downstream python packagers

Re: [Python-Dev] SSL certificates recommendations for downstreampython packagers

Re: [Python-Dev] SSL certificates recommendations for downstreampython packagers

Re: [Python-Dev] Investigating Python memory footprint of one real Web application

[Python-Dev] Why doesn't module finalization delete names as expected?

Re: [Python-Dev] Why doesn't module finalization delete names as expected?

Re: [Python-Dev] Heads up: possible double-comments on bpo for commits

Re: [Python-Dev] re performance

Re: [Python-Dev] re performance

Re: [Python-Dev] re performance

Re: [Python-Dev] re performance

Re: [Python-Dev] SSL certificates recommendations for downstreampython packagers

Re: [Python-Dev] re performance

14 matches

Site Navigation

Mail list logo

Footer information