date:20201017

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-17 Thread Antoine Pitrou

On Fri, 16 Oct 2020 20:24:22 -0500
Tim Peters  wrote:
> 
> Note that no "extra" storage is needed to exploit this. No character
> lookups, no extra expenses in time or space of any kind.  Just "if we
> mismatch on the k'th try, we can jump ahead k positions".

Ok, so that means that on a N-character haystack, it'll always do at
least N character comparisons?

IIRC, the current algorithm is able to do much less than that in
favorable cases.  If the needle is k characters long and they are all
distinct, it's able to do N/k comparisons.

Regards

Antoine.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QTB3XVIPZOO3QBVONPWLUN5KHFD3UWBC/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-17 Thread Tim Peters

[Tim]
>> Note that no "extra" storage is needed to exploit this. No character
>> lookups, no extra expenses in time or space of any kind.  Just "if we
>> mismatch on the k'th try, we can jump ahead k positions".

[Antoine Pitrou ]
> Ok, so that means that on a N-character haystack, it'll always do at
> least N character comparisons?
>
> IIRC, the current algorithm is able to do much less than that in
> favorable cases.  If the needle is k characters long and they are all
> distinct, it's able to do N/k comparisons.

It's an excellent question, and from what I said it's a correct inference.

But I only described how the algorithm matches the "v" part of the
"needle = u + v" splitting.  When matching the "u" part, skips
materially exceeding the count of comparisons made are possible.

The paper claims the minimal number of comparisons needed (ignoring
preprocessing) is 2N/k, so same order as the current algorithm. But,
for the constant factor, Dennis's code achieves N/k, because he
augmented the Crochemre-Perrin algorithm with a variant of Sunday's
algorithm (a simplification, but not as extreme as the current code's
simplification). Note that best case near N/k is a known lower bound
for best cases - impossible to do materially better.

In the 'x' * 100 + 'y' example I ended my msg with, recall that  v wa
just "y", so the part I described was of minimum value - if the last
haystack character in the current search window isn't "y", that the
mismatch occurred on the first try leads to a shift of just 1.

But if the character one beyond the end of the current search window
isn't "x" or "y" then, Sunday's algorithm allows to skip 102 positions
(len(needle) + 1).  His algorithm requires precomputing a skip-count
table of O(len(sigma)) space, where sigma is the universe of possible
characters ("the alphabet").  That's "a whole lot" for Unicode.

Fredrik and Dennis simplified Sunday's algorithm to weaker versions
that require relatively small constant space instead, independent of
needle, pattern, and alphabet size.

Despite the current code's simplification of that nearly to the point
of non-existence (it reduces the space needed to 32 bits, and with
only one possible skip count), it's nevertheless so effective that it
beat Dennis's initially pure Crochemre-Perrin code by dramatic margins
in a significant number of exceptionally lucky (for the current code)
cases.  After adding a variation of Sunday (the current PR uses 256
bytes, and has 64K possible skip counts - but perhaps other tradeoffs
would work better), we haven't yet seen a case (contrived or natural)
where the current code is dramatically faster.

But we have seen non-contrived cases where the current PR is
dramatically faster, including in the benchmark code Fredrik gifted us
with (Tools/stringbench/stringbench.py in your Python tree).  Alas,
the higher preprocessing costs leave the current PR slower in "too
many" cases too, especially when the needle is short and found early
in the haystack. Then any preprocessing cost approaches a pure waste
of time.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/MHGNAZH7X4HS2KSWTRDVSX2AJXN7POXZ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Modified parser does not seem to work. What am I doing wrong?

2020-10-17 Thread Stefano Borini

That's what I suspected looking at the parser debug in fact. Good to
see that I was on the right track. Thanks! I'll play with it a bit
more tonight.

On Fri, 16 Oct 2020 at 23:43, Pablo Galindo Salgado  wrote:
>
> Hi Stefano,
>
> One of the problems you have is that the rule for slices has a negative 
> lookahead for the comma:
>
>  slices[expr_ty]:
> | a=slice !',' { a }
>
> IIRC the reason that is there is to allow "x[3,]" to be parsed.
>
> Also, to allow "a[k=3]" you need to to create a rule that allows skipping the 
> "slices" part of the subscript and allow the 2nd argument of  _Py_Subscript  
> to be NULL. For example:
>
> | a=t_primary '[' k=kwargs ']' !t_lookahead { _Py_Subscript(a, NULL, k, 
> Store, EXTRA) }
>
> Regards,
> Pablo
>
> On Fri, 16 Oct 2020 at 22:07, Stefano Borini  wrote:
>>
>> Hello,
>>
>> I am trying to implement PEP-637, and I started modifying the parser
>> and the grammar, but I don't know what I am missing.
>>
>> The PR is here
>>
>> https://github.com/python/cpython/compare/master...stefanoborini:PEP-637-implementation-attempt-1?expand=1
>>
>> It includes other stuff but the core is that I extended the Subscript
>> in the asdl to accept the keyword args
>>
>> | Subscript(expr value, expr slice, keyword* keywords, expr_context ctx)
>>
>> which seems to work:
>>
>> >>> ast.parse('a[3]').body[0].value.keywords
>> []
>>
>> I also added a few "productions" (I believe they are called like this,
>> my compiler theory is very approximated), one of which is now
>>
>> primary[expr_ty]:
>> 
>> | a=primary '[' b=slices c=[',' k=kwargs {k}]']' {
>> _Py_Subscript(a, b, c, Load, EXTRA) }
>>
>> I also tried with additional formats like
>>
>> | a=primary '[' b=slices c=kwargs ']' { _Py_Subscript(a, b, c,
>> Load, EXTRA) }
>>
>> or even just
>>
>> | a=primary '[' c=kwargs ']' { _Py_Subscript(a, NULL, c, Load, EXTRA) }
>>
>> but I always get a SyntaxError:
>>
>> >>> ast.parse('a[k=3]').body[0].value.keywords
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File "/Users/sbo/Work/Projects/stefanoborini/cpython/Lib/ast.py",
>> line 50, in parse
>> return compile(source, filename, mode, flags,
>>   File "", line 1
>> a[k=3]
>>^
>> SyntaxError: invalid syntax
>>
>> Note that I always recreated ast parser and pegen.c with 'make
>> regen-all' and recompiled with make.
>> I tried to enable debug and print the pegen.c debug, but it's a bit
>> heavy and I could not immediately spot the issue. I suspect I am
>> missing something somewhere.
>>
>> Thanks
>>
>>
>> --
>> Kind regards,
>>
>> Stefano Borini
>> ___
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-le...@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at 
>> https://mail.python.org/archives/list/python-dev@python.org/message/RZYDHYZPTPGLBX4VCR6HTYGJ3GL2CWIX/
>> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
Kind regards,

Stefano Borini
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/266N3PGL4IXGVZB4QNQGR54DP5PHNABP/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-17 Thread Tim Peters

[Tim]
> ...
> Alas, the higher preprocessing costs leave the current PR slower in "too
> many" cases too, especially when the needle is short and found early
> in the haystack. Then any preprocessing cost approaches a pure waste
> of time.

But that was this morning.  Since then, Dennis changed the PR to back
off to the current code when the needle is "too small". There are very
few cases we know of now where the PR code is slower at all than the
current code, none where it's dramatically slower, many where it's
significantly faster, and some non-contrived cases where it's
dramatically faster (for example, over a factor of 10 in
stringbench.py's "late match, 100 characters" forward-search tests,
and especially beneficial for Unicode (as opposed to bytes)).  Then
there are the pathological cases like in the original issue report,
where it's multiple orders of magnitude faster (about 3 1/2 hours to
under a tenth of a second in the issue report's case).

Still waiting for someone who thinks string search speed is critical
in their real app to give it a try.  In the absence of that, I endorse
merging this.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5K2ISKORAH327WLQ43QG5QMPWXNHMFTC/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] os.scandir bug in Windows?

2020-10-17 Thread Rob Cliffe via Python-Dev



TLDR: In os.scandir directory entries, atime is always a copy of mtime 
rather than the actual access time.


Demo program: Windows 10, Python 3.8.3:

# osscandirtest.py
import time, os
with open('Test', 'w') as f: f.write('Anything\n') # Write to a file
time.sleep(10)
with open('Test', 'r') as f: f.readline() # Read the file
print(os.stat('Test'))
for DirEntry in os.scandir('.'):
    if DirEntry.name == 'Test':
    stat = DirEntry.stat()
    print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=} 
{stat.st_atime=}')


Sample output:

os.stat_result(st_mode=33206, st_ino=8162774324687317, 
st_dev=2230120362, st_nlink=1, st_uid=0,
st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371, 
st_ctime=1600631262)
scandir DirEntry stat.st_ctime=1600631262.951019 
stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848


For os.stat, atime is 10 seconds more than mtime, as would be expected.
But for os.scandir, atime is a copy of mtime.
ISTM that this is a bug, and in fact recently it stopped me from using 
os.scandir in a program where I needed the access timestamp. No big 
deal, but ...

If it is a feature for some reason, presumably it should be documented.

Best wishes
Rob Cliffe
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZVUAQBLECFMNN2PUOH322B2BYD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: os.scandir bug in Windows?

2020-10-17 Thread Gregory P. Smith

Could you please file this as an issue on bugs.python.org?

Thanks!
-Greg


On Sat, Oct 17, 2020 at 7:25 PM Rob Cliffe via Python-Dev <
python-dev@python.org> wrote:

>
> TLDR: In os.scandir directory entries, atime is always a copy of mtime
> rather than the actual access time.
>
> Demo program: Windows 10, Python 3.8.3:
>
> # osscandirtest.py
> import time, os
> with open('Test', 'w') as f: f.write('Anything\n') # Write to a file
> time.sleep(10)
> with open('Test', 'r') as f: f.readline() # Read the file
> print(os.stat('Test'))
> for DirEntry in os.scandir('.'):
>  if DirEntry.name == 'Test':
>  stat = DirEntry.stat()
>  print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=}
> {stat.st_atime=}')
>
> Sample output:
>
> os.stat_result(st_mode=33206, st_ino=8162774324687317,
> st_dev=2230120362, st_nlink=1, st_uid=0,
> st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371,
> st_ctime=1600631262)
> scandir DirEntry stat.st_ctime=1600631262.951019
> stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848
>
> For os.stat, atime is 10 seconds more than mtime, as would be expected.
> But for os.scandir, atime is a copy of mtime.
> ISTM that this is a bug, and in fact recently it stopped me from using
> os.scandir in a program where I needed the access timestamp. No big
> deal, but ...
> If it is a feature for some reason, presumably it should be documented.
>
> Best wishes
> Rob Cliffe
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZVUAQBLECFMNN2PUOH322B2BYD/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/INJBNXRKOBYFGFJ7CLHNJKVQQKU6X6NM/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: os.scandir bug in Windows?

2020-10-17 Thread Tal Einat

Interesting! Indeed, please create an issue and post a link here.

>From a quick look at the code, I can't see any obvious bugs here, the info
seems to be coming directly from FindNextFileW. This will likely require
some more digging.


On Sun, Oct 18, 2020 at 7:37 AM Gregory P. Smith  wrote:

> Could you please file this as an issue on bugs.python.org?
>
> Thanks!
> -Greg
>
>
> On Sat, Oct 17, 2020 at 7:25 PM Rob Cliffe via Python-Dev <
> python-dev@python.org> wrote:
>
>>
>> TLDR: In os.scandir directory entries, atime is always a copy of mtime
>> rather than the actual access time.
>>
>> Demo program: Windows 10, Python 3.8.3:
>>
>> # osscandirtest.py
>> import time, os
>> with open('Test', 'w') as f: f.write('Anything\n') # Write to a file
>> time.sleep(10)
>> with open('Test', 'r') as f: f.readline() # Read the file
>> print(os.stat('Test'))
>> for DirEntry in os.scandir('.'):
>>  if DirEntry.name == 'Test':
>>  stat = DirEntry.stat()
>>  print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=}
>> {stat.st_atime=}')
>>
>> Sample output:
>>
>> os.stat_result(st_mode=33206, st_ino=8162774324687317,
>> st_dev=2230120362, st_nlink=1, st_uid=0,
>> st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371,
>> st_ctime=1600631262)
>> scandir DirEntry stat.st_ctime=1600631262.951019
>> stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848
>>
>> For os.stat, atime is 10 seconds more than mtime, as would be expected.
>> But for os.scandir, atime is a copy of mtime.
>> ISTM that this is a bug, and in fact recently it stopped me from using
>> os.scandir in a program where I needed the access timestamp. No big
>> deal, but ...
>> If it is a feature for some reason, presumably it should be documented.
>>
>> Best wishes
>> Rob Cliffe
>> ___
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-le...@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZVUAQBLECFMNN2PUOH322B2BYD/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/INJBNXRKOBYFGFJ7CLHNJKVQQKU6X6NM/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WAVJFASQWS7RDMZEHI4AHQMJ74COQO7O/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Changing Python's string search algorithms

[Python-Dev] Re: Changing Python's string search algorithms

[Python-Dev] Re: Modified parser does not seem to work. What am I doing wrong?

[Python-Dev] Re: Changing Python's string search algorithms

[Python-Dev] os.scandir bug in Windows?

[Python-Dev] Re: os.scandir bug in Windows?

[Python-Dev] Re: os.scandir bug in Windows?

7 matches

Site Navigation

Mail list logo

Footer information