[Python-Dev] Re: Changing Python's string search algorithms
On Fri, 16 Oct 2020 20:24:22 -0500 Tim Peters wrote: > > Note that no "extra" storage is needed to exploit this. No character > lookups, no extra expenses in time or space of any kind. Just "if we > mismatch on the k'th try, we can jump ahead k positions". Ok, so that means that on a N-character haystack, it'll always do at least N character comparisons? IIRC, the current algorithm is able to do much less than that in favorable cases. If the needle is k characters long and they are all distinct, it's able to do N/k comparisons. Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QTB3XVIPZOO3QBVONPWLUN5KHFD3UWBC/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Tim] >> Note that no "extra" storage is needed to exploit this. No character >> lookups, no extra expenses in time or space of any kind. Just "if we >> mismatch on the k'th try, we can jump ahead k positions". [Antoine Pitrou ] > Ok, so that means that on a N-character haystack, it'll always do at > least N character comparisons? > > IIRC, the current algorithm is able to do much less than that in > favorable cases. If the needle is k characters long and they are all > distinct, it's able to do N/k comparisons. It's an excellent question, and from what I said it's a correct inference. But I only described how the algorithm matches the "v" part of the "needle = u + v" splitting. When matching the "u" part, skips materially exceeding the count of comparisons made are possible. The paper claims the minimal number of comparisons needed (ignoring preprocessing) is 2N/k, so same order as the current algorithm. But, for the constant factor, Dennis's code achieves N/k, because he augmented the Crochemre-Perrin algorithm with a variant of Sunday's algorithm (a simplification, but not as extreme as the current code's simplification). Note that best case near N/k is a known lower bound for best cases - impossible to do materially better. In the 'x' * 100 + 'y' example I ended my msg with, recall that v wa just "y", so the part I described was of minimum value - if the last haystack character in the current search window isn't "y", that the mismatch occurred on the first try leads to a shift of just 1. But if the character one beyond the end of the current search window isn't "x" or "y" then, Sunday's algorithm allows to skip 102 positions (len(needle) + 1). His algorithm requires precomputing a skip-count table of O(len(sigma)) space, where sigma is the universe of possible characters ("the alphabet"). That's "a whole lot" for Unicode. Fredrik and Dennis simplified Sunday's algorithm to weaker versions that require relatively small constant space instead, independent of needle, pattern, and alphabet size. Despite the current code's simplification of that nearly to the point of non-existence (it reduces the space needed to 32 bits, and with only one possible skip count), it's nevertheless so effective that it beat Dennis's initially pure Crochemre-Perrin code by dramatic margins in a significant number of exceptionally lucky (for the current code) cases. After adding a variation of Sunday (the current PR uses 256 bytes, and has 64K possible skip counts - but perhaps other tradeoffs would work better), we haven't yet seen a case (contrived or natural) where the current code is dramatically faster. But we have seen non-contrived cases where the current PR is dramatically faster, including in the benchmark code Fredrik gifted us with (Tools/stringbench/stringbench.py in your Python tree). Alas, the higher preprocessing costs leave the current PR slower in "too many" cases too, especially when the needle is short and found early in the haystack. Then any preprocessing cost approaches a pure waste of time. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MHGNAZH7X4HS2KSWTRDVSX2AJXN7POXZ/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Modified parser does not seem to work. What am I doing wrong?
That's what I suspected looking at the parser debug in fact. Good to see that I was on the right track. Thanks! I'll play with it a bit more tonight. On Fri, 16 Oct 2020 at 23:43, Pablo Galindo Salgado wrote: > > Hi Stefano, > > One of the problems you have is that the rule for slices has a negative > lookahead for the comma: > > slices[expr_ty]: > | a=slice !',' { a } > > IIRC the reason that is there is to allow "x[3,]" to be parsed. > > Also, to allow "a[k=3]" you need to to create a rule that allows skipping the > "slices" part of the subscript and allow the 2nd argument of _Py_Subscript > to be NULL. For example: > > | a=t_primary '[' k=kwargs ']' !t_lookahead { _Py_Subscript(a, NULL, k, > Store, EXTRA) } > > Regards, > Pablo > > On Fri, 16 Oct 2020 at 22:07, Stefano Borini wrote: >> >> Hello, >> >> I am trying to implement PEP-637, and I started modifying the parser >> and the grammar, but I don't know what I am missing. >> >> The PR is here >> >> https://github.com/python/cpython/compare/master...stefanoborini:PEP-637-implementation-attempt-1?expand=1 >> >> It includes other stuff but the core is that I extended the Subscript >> in the asdl to accept the keyword args >> >> | Subscript(expr value, expr slice, keyword* keywords, expr_context ctx) >> >> which seems to work: >> >> >>> ast.parse('a[3]').body[0].value.keywords >> [] >> >> I also added a few "productions" (I believe they are called like this, >> my compiler theory is very approximated), one of which is now >> >> primary[expr_ty]: >> >> | a=primary '[' b=slices c=[',' k=kwargs {k}]']' { >> _Py_Subscript(a, b, c, Load, EXTRA) } >> >> I also tried with additional formats like >> >> | a=primary '[' b=slices c=kwargs ']' { _Py_Subscript(a, b, c, >> Load, EXTRA) } >> >> or even just >> >> | a=primary '[' c=kwargs ']' { _Py_Subscript(a, NULL, c, Load, EXTRA) } >> >> but I always get a SyntaxError: >> >> >>> ast.parse('a[k=3]').body[0].value.keywords >> Traceback (most recent call last): >> File "", line 1, in >> File "/Users/sbo/Work/Projects/stefanoborini/cpython/Lib/ast.py", >> line 50, in parse >> return compile(source, filename, mode, flags, >> File "", line 1 >> a[k=3] >>^ >> SyntaxError: invalid syntax >> >> Note that I always recreated ast parser and pegen.c with 'make >> regen-all' and recompiled with make. >> I tried to enable debug and print the pegen.c debug, but it's a bit >> heavy and I could not immediately spot the issue. I suspect I am >> missing something somewhere. >> >> Thanks >> >> >> -- >> Kind regards, >> >> Stefano Borini >> ___ >> Python-Dev mailing list -- python-dev@python.org >> To unsubscribe send an email to python-dev-le...@python.org >> https://mail.python.org/mailman3/lists/python-dev.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-dev@python.org/message/RZYDHYZPTPGLBX4VCR6HTYGJ3GL2CWIX/ >> Code of Conduct: http://python.org/psf/codeofconduct/ -- Kind regards, Stefano Borini ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/266N3PGL4IXGVZB4QNQGR54DP5PHNABP/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Tim] > ... > Alas, the higher preprocessing costs leave the current PR slower in "too > many" cases too, especially when the needle is short and found early > in the haystack. Then any preprocessing cost approaches a pure waste > of time. But that was this morning. Since then, Dennis changed the PR to back off to the current code when the needle is "too small". There are very few cases we know of now where the PR code is slower at all than the current code, none where it's dramatically slower, many where it's significantly faster, and some non-contrived cases where it's dramatically faster (for example, over a factor of 10 in stringbench.py's "late match, 100 characters" forward-search tests, and especially beneficial for Unicode (as opposed to bytes)). Then there are the pathological cases like in the original issue report, where it's multiple orders of magnitude faster (about 3 1/2 hours to under a tenth of a second in the issue report's case). Still waiting for someone who thinks string search speed is critical in their real app to give it a try. In the absence of that, I endorse merging this. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5K2ISKORAH327WLQ43QG5QMPWXNHMFTC/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] os.scandir bug in Windows?
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time. Demo program: Windows 10, Python 3.8.3: # osscandirtest.py import time, os with open('Test', 'w') as f: f.write('Anything\n') # Write to a file time.sleep(10) with open('Test', 'r') as f: f.readline() # Read the file print(os.stat('Test')) for DirEntry in os.scandir('.'): if DirEntry.name == 'Test': stat = DirEntry.stat() print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=} {stat.st_atime=}') Sample output: os.stat_result(st_mode=33206, st_ino=8162774324687317, st_dev=2230120362, st_nlink=1, st_uid=0, st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371, st_ctime=1600631262) scandir DirEntry stat.st_ctime=1600631262.951019 stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848 For os.stat, atime is 10 seconds more than mtime, as would be expected. But for os.scandir, atime is a copy of mtime. ISTM that this is a bug, and in fact recently it stopped me from using os.scandir in a program where I needed the access timestamp. No big deal, but ... If it is a feature for some reason, presumably it should be documented. Best wishes Rob Cliffe ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZVUAQBLECFMNN2PUOH322B2BYD/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: os.scandir bug in Windows?
Could you please file this as an issue on bugs.python.org? Thanks! -Greg On Sat, Oct 17, 2020 at 7:25 PM Rob Cliffe via Python-Dev < python-dev@python.org> wrote: > > TLDR: In os.scandir directory entries, atime is always a copy of mtime > rather than the actual access time. > > Demo program: Windows 10, Python 3.8.3: > > # osscandirtest.py > import time, os > with open('Test', 'w') as f: f.write('Anything\n') # Write to a file > time.sleep(10) > with open('Test', 'r') as f: f.readline() # Read the file > print(os.stat('Test')) > for DirEntry in os.scandir('.'): > if DirEntry.name == 'Test': > stat = DirEntry.stat() > print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=} > {stat.st_atime=}') > > Sample output: > > os.stat_result(st_mode=33206, st_ino=8162774324687317, > st_dev=2230120362, st_nlink=1, st_uid=0, > st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371, > st_ctime=1600631262) > scandir DirEntry stat.st_ctime=1600631262.951019 > stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848 > > For os.stat, atime is 10 seconds more than mtime, as would be expected. > But for os.scandir, atime is a copy of mtime. > ISTM that this is a bug, and in fact recently it stopped me from using > os.scandir in a program where I needed the access timestamp. No big > deal, but ... > If it is a feature for some reason, presumably it should be documented. > > Best wishes > Rob Cliffe > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZVUAQBLECFMNN2PUOH322B2BYD/ > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/INJBNXRKOBYFGFJ7CLHNJKVQQKU6X6NM/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: os.scandir bug in Windows?
Interesting! Indeed, please create an issue and post a link here. >From a quick look at the code, I can't see any obvious bugs here, the info seems to be coming directly from FindNextFileW. This will likely require some more digging. On Sun, Oct 18, 2020 at 7:37 AM Gregory P. Smith wrote: > Could you please file this as an issue on bugs.python.org? > > Thanks! > -Greg > > > On Sat, Oct 17, 2020 at 7:25 PM Rob Cliffe via Python-Dev < > python-dev@python.org> wrote: > >> >> TLDR: In os.scandir directory entries, atime is always a copy of mtime >> rather than the actual access time. >> >> Demo program: Windows 10, Python 3.8.3: >> >> # osscandirtest.py >> import time, os >> with open('Test', 'w') as f: f.write('Anything\n') # Write to a file >> time.sleep(10) >> with open('Test', 'r') as f: f.readline() # Read the file >> print(os.stat('Test')) >> for DirEntry in os.scandir('.'): >> if DirEntry.name == 'Test': >> stat = DirEntry.stat() >> print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=} >> {stat.st_atime=}') >> >> Sample output: >> >> os.stat_result(st_mode=33206, st_ino=8162774324687317, >> st_dev=2230120362, st_nlink=1, st_uid=0, >> st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371, >> st_ctime=1600631262) >> scandir DirEntry stat.st_ctime=1600631262.951019 >> stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848 >> >> For os.stat, atime is 10 seconds more than mtime, as would be expected. >> But for os.scandir, atime is a copy of mtime. >> ISTM that this is a bug, and in fact recently it stopped me from using >> os.scandir in a program where I needed the access timestamp. No big >> deal, but ... >> If it is a feature for some reason, presumably it should be documented. >> >> Best wishes >> Rob Cliffe >> ___ >> Python-Dev mailing list -- python-dev@python.org >> To unsubscribe send an email to python-dev-le...@python.org >> https://mail.python.org/mailman3/lists/python-dev.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZVUAQBLECFMNN2PUOH322B2BYD/ >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/INJBNXRKOBYFGFJ7CLHNJKVQQKU6X6NM/ > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WAVJFASQWS7RDMZEHI4AHQMJ74COQO7O/ Code of Conduct: http://python.org/psf/codeofconduct/