Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 28 June 2014 19:17, Nick Coghlan wrote: > Agreed, but walking even a moderately large tree over the network can > really hammer home the point that this offers a significant > performance enhancement as the latency of access increases. I've found > that kind of comparison can be eye-opening for folks that are used to > only operating on local disks (even spinning disks, let alone SSDs) > and/or relatively small trees (distro build trees aren't *that* big, > but they're big enough for this kind of difference in access overhead > to start getting annoying). Oops, forgot to add - I agree this isn't a blocking issue for the PEP, it's definitely only in "nice to have" territory. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 28 June 2014 16:17, Gregory P. Smith wrote: > On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan wrote: >> * it would be nice to see some relative performance numbers for NFS and >> CIFS network shares - the additional network round trips can make excessive >> stat calls absolutely brutal from a speed perspective when using a network >> drive (that's why the stat caching added to the import system in 3.3 >> dramatically sped up the case of having network drives on sys.path, and why >> I thought AJ had a point when he was complaining about the fact we didn't >> expose the dirent data from os.listdir) > > fwiw, I wouldn't wait for benchmark numbers. > > A needless stat call when you've got the information from an earlier API > call is already brutal. It is easy to compute from existing ballparks remote > file server / cloud access: ~100ms, local spinning disk seek+read: ~10ms. > fetch of stat info cached in memory on file server on the local network: > ~500us. You can go down further to local system call overhead which can > vary wildly but should likely be assumed to be at least 10us. > > You don't need a benchmark to tell you that adding needless >= 500us-100ms > blocking operations to your program is bad. :) Agreed, but walking even a moderately large tree over the network can really hammer home the point that this offers a significant performance enhancement as the latency of access increases. I've found that kind of comparison can be eye-opening for folks that are used to only operating on local disks (even spinning disks, let alone SSDs) and/or relatively small trees (distro build trees aren't *that* big, but they're big enough for this kind of difference in access overhead to start getting annoying). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fix Unicode-disabled build of Python 2.7
Hello, On Thu, 26 Jun 2014 22:49:40 +1000 Chris Angelico wrote: > On Thu, Jun 26, 2014 at 9:04 PM, Antoine Pitrou > wrote: > > For the same reason, I agree with Victor that we should ditch the > > threading-disabled builds. It's too much of a hassle for no actual, > > practical benefit. People who want a threadless unicodeless Python > > can install Python 1.5.2 for all I care. > > Or some other implementation of Python. It's looking like micropython > will be permanently supporting a non-Unicode build Yes. > (although I stepped > away from the project after a strong disagreement over what would and > would not make sense, and haven't been following it since). Your patches with my further additions were finally merged. Unicode strings still cannot be enabled by default due to https://github.com/micropython/micropython/issues/726 . Any help with reviewing/testing what's currently available is welcome. > If someone > wants a Python that doesn't have stuff that the core CPython devs > treat as essential, s/he probably wants something like uPy anyway. I hinted it during previous discussions of MicroPython, and would like to say it again, that MicroPython already embraced a lot of ideas rejected from CPython, like GC-only operation (which alone not something to be proud of, but can you start up and do something in 2K heap?) or tagged pointers (https://mail.python.org/pipermail/python-dev/2004-July/046139.html). So, it should be good vehicle to try any unorthodox ideas(*) or implementations. * MicroPython already implements intra-module constants for example. -- Best regards, Paul mailto:pmis...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
Ben Hoyt writes: > Hi Python dev folks, > > I've written a PEP proposing a specific os.scandir() API for a > directory iterator that returns the stat-like info from the OS, *the > main advantage of which is to speed up os.walk() and similar > operations between 4-20x, depending on your OS and file system.* > ... > http://legacy.python.org/dev/peps/pep-0471/ > ... > Specifically, this PEP proposes adding a single function to the ``os`` > module in the standard library, ``scandir``, that takes a single, > optional string as its argument:: > > scandir(path='.') -> generator of DirEntry objects > Have you considered adding support for paths relative to directory descriptors [1] via keyword only dir_fd=None parameter if it may lead to more efficient implementations on some platforms? [1]: https://docs.python.org/3.4/library/os.html#dir-fd -- akira ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On Sat, Jun 28, 2014 at 11:05 PM, Akira Li <4kir4...@gmail.com> wrote: > Have you considered adding support for paths relative to directory > descriptors [1] via keyword only dir_fd=None parameter if it may lead to > more efficient implementations on some platforms? > > [1]: https://docs.python.org/3.4/library/os.html#dir-fd Potentially more efficient and also potentially safer (see 'man openat')... but an enhancement that can wait, if necessary. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
>> But the underlying system calls -- ``FindFirstFile`` / >> ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X -- > > What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir? I guess it'd be better to say "Windows" and "Unix-based OSs" throughout the PEP? Because all of these (including Mac OS X) are Unix-based. > It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we > should mimic stat_result recent addition: the new > stat_result.file_attributes field. Add DirEntry.file_attributes which > would only be available on Windows. > > The Windows structure also contains > > FILETIME ftCreationTime; > FILETIME ftLastAccessTime; > FILETIME ftLastWriteTime; > DWORDnFileSizeHigh; > DWORDnFileSizeLow; > > It would be nice to expose them as well. I'm no more surprised that > the exact API is different depending on the OS for functions of the os > module. I think you've misunderstood how DirEntry.lstat() works on Windows -- it's basically a no-op, as Windows returns the full stat information with the original FindFirst/FindNext OS calls. This is fairly explict in the PEP, but I'm sure I could make it clearer: DirEntry.lstat(): "like os.lstat(), but requires no system calls on Windows So you can already get the dwFileAttributes for free by saying entry.lstat().st_file_attributes. You can also get all the other fields you mentioned for free via .lstat() with no additional OS calls on Windows, for example: entry.lstat().st_size. Feel free to suggest changes to the PEP or scandir docs if this isn't clear. Note that is_dir()/is_file()/is_symlink() are free on all systems, but .lstat() is only free on Windows. > Does your implementation uses a free list to avoid the cost of memory > allocation? A short free list of 10 or maybe just 1 may help. The free > list may be stored directly in the generator object. No, it doesn't. I might add this to the PEP under "possible improvements". However, I think the speed increase by removing the extra OS call and/or disk seek is going to be way more than memory allocation improvements, so I'm not sure this would be worth it. > Does it support also bytes filenames on UNIX? > Python now supports undecodable filenames thanks to the PEP 383 > (surrogateescape). I prefer to use the same type for filenames on > Linux and Windows, so Unicode is better. But some users might prefer > bytes for other reasons. I forget exactly now what my scandir module does, but for os.scandir() I think this should behave exactly like os.listdir() does for Unicode/bytes filenames. > Crazy idea: would it be possible to "convert" a DirEntry object to a > pathlib.Path object without losing the cache? I guess that > pathlib.Path expects a full stat_result object. The main problem is that pathlib.Path objects explicitly don't cache stat info (and Guido doesn't want them to, for good reason I think). There's a thread on python-dev about this earlier. I'll add it to a "Rejected ideas" section. > I don't understand how you can build a full lstat() result without > really calling stat. I see that WIN32_FIND_DATA contains the size, but > here you call lstat(). See above. > Do you plan to continue to maintain your module for Python < 3.5, but > upgrade your module for the final PEP? Yes, I intend to maintain the standalone scandir module for 2.6 <= Python < 3.5, at least for a good while. For integration into the Python 3.5 stdlib, the implementation will be integrated into posixmodule.c, of course. >> Should there be a way to access the full path? >> -- >> >> Should ``DirEntry``'s have a way to get the full path without using >> ``os.path.join(path, entry.name)``? This is a pretty common pattern, >> and it may be useful to add pathlib-like ``str(entry)`` functionality. >> This functionality has also been requested in `issue 13`_ on GitHub. >> >> .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13 > > I think that it would be very convinient to store the directory name > in the DirEntry. It should be light, it's just a reference. > > And provide a fullname() name which would just return > os.path.join(path, entry.name) without trying to resolve path to get > an absolute path. Yeah, fair suggestion. I'm still slightly on the fence about this, but I think an explicit fullname() is a good suggestion. Ideally I think it'd be better to mimic pathlib.Path.__str__() which is kind of the equivalent of fullname(). But how does pathlib deal with unicode/bytes issues if it's the str function which has to return a str object? Or at least, it'd be very weird if __str__() returned bytes. But I think it'd need to if you passed bytes into scandir(). Do others have thoughts? > Would it be hard to implement the wildcard feature on UNIX to compare > performances of scandir('*.jpg') with and without the wildcard built > in os.scandir? It's a good idea, the problem with this is that the Windows wildca
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
Re is_dir etc being properties rather than methods: >> I find this behaviour a bit misleading: using methods and have them >> return cached results. How much (implementation and/or performance >> and/or memory) overhead would incur by using property-like access here? >> I think this would underline the static nature of the data. >> >> This would break the semantics with respect to pathlib, but they're only >> marginally equal anyways -- and as far as I understand it, pathlib won't >> cache, so I think this has a fair point here. > > Indeed - using properties rather than methods may help emphasise the > deliberate *difference* from pathlib in this case (i.e. value when the > result was retrieved from the OS, rather than the value right now). The main > benefit is that switching from using the DirEntry object to a pathlib Path > will require touching all the places where the performance characteristics > switch from "memory access" to "system call". This benefit is also the main > downside, so I'd actually be OK with either decision on this one. The problem with this is that properties "look free", they look just like attribute access, so you wouldn't normally handle exceptions when accessing them. But .lstat() and .is_dir() etc may do an OS call, so if you're needing to be careful with error handling, you may want to handle errors on them. Hence I think it's best practice to make them functions(). Some of us discussed this on python-dev or python-ideas a while back, and I think there was general agreement with what I've stated above and therefore they should be methods. But I'll dig up the links and add to a Rejected ideas section. > * +1 on a new section in the PEP covering rejected design options (calling > it iterdir, returning a 2-tuple instead of a dedicated DirEntry type) Great idea. I'll add a bunch of stuff, including the above, to a new section, Rejected Design Options. > * regarding "why not a 2-tuple", we know from experience that operating > systems evolve and we end up wanting to add additional info to this kind of > API. A dedicated DirEntry type lets us adjust the information returned over > time, without breaking backwards compatibility and without resorting to ugly > hacks like those in some of the time and stat APIs (or even our own codec > info APIs) Fully agreed. > * it would be nice to see some relative performance numbers for NFS and CIFS > network shares - the additional network round trips can make excessive stat > calls absolutely brutal from a speed perspective when using a network drive > (that's why the stat caching added to the import system in 3.3 dramatically > sped up the case of having network drives on sys.path, and why I thought AJ > had a point when he was complaining about the fact we didn't expose the > dirent data from os.listdir) Don't know if you saw, but there are actually some benchmarks, including one over NFS, on the scandir GitHub page: https://github.com/benhoyt/scandir#benchmarks os.walk() was 23 times faster with scandir() than the current listdir() + stat() implementation on the Windows NFS file system I tried. Pretty good speedup! -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 29 June 2014 05:48, Ben Hoyt wrote: >>> But the underlying system calls -- ``FindFirstFile`` / >>> ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X -- >> >> What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide >> readdir? > > I guess it'd be better to say "Windows" and "Unix-based OSs" > throughout the PEP? Because all of these (including Mac OS X) are > Unix-based. *nix and POSIX-based are the two conventions I use. >> Crazy idea: would it be possible to "convert" a DirEntry object to a >> pathlib.Path object without losing the cache? I guess that >> pathlib.Path expects a full stat_result object. > > The main problem is that pathlib.Path objects explicitly don't cache > stat info (and Guido doesn't want them to, for good reason I think). > There's a thread on python-dev about this earlier. I'll add it to a > "Rejected ideas" section. The key problem with caches on pathlib.Path objects is that you could end up with two separate path objects that referred to the same filesystem location but returned different answers about the filesystem state because their caches might be stale. DirEntry is different, as the content is generally *assumed* to be stale (referring to when the directory was scanned, rather than the current filesystem state). DirEntry.lstat() on POSIX systems will be an exception to that general rule (referring to the time of first lookup, rather than when the directory was scanned, so the answer rom lstat() may be inconsistent with other data stored directly on the DirEntry object), but one we can probably live with. More generally, as part of the pathlib PEP review, we figured out that a *per-object* cache of filesystem state would be an inherently bad idea, but a string based *process global* cache might make sense for modules like walkdir (not part of the stdlib - it's an iterator pipeline based approach to file tree scanning I wrote a while back, that currently suffers badly from the performance impact of repeated stat calls at different stages of the pipeline). We realised this was getting into a space where application and library specific concerns are likely to start affecting the caching design, though, so the current status of standard library level stat caching is "it's not clear if there's an available approach that would be sufficiently general purpose to be appropriate for inclusion in the standard library". Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On 29 June 2014 05:55, Ben Hoyt wrote: > Re is_dir etc being properties rather than methods: > >>> I find this behaviour a bit misleading: using methods and have them >>> return cached results. How much (implementation and/or performance >>> and/or memory) overhead would incur by using property-like access here? >>> I think this would underline the static nature of the data. >>> >>> This would break the semantics with respect to pathlib, but they're only >>> marginally equal anyways -- and as far as I understand it, pathlib won't >>> cache, so I think this has a fair point here. >> >> Indeed - using properties rather than methods may help emphasise the >> deliberate *difference* from pathlib in this case (i.e. value when the >> result was retrieved from the OS, rather than the value right now). The main >> benefit is that switching from using the DirEntry object to a pathlib Path >> will require touching all the places where the performance characteristics >> switch from "memory access" to "system call". This benefit is also the main >> downside, so I'd actually be OK with either decision on this one. > > The problem with this is that properties "look free", they look just > like attribute access, so you wouldn't normally handle exceptions when > accessing them. But .lstat() and .is_dir() etc may do an OS call, so > if you're needing to be careful with error handling, you may want to > handle errors on them. Hence I think it's best practice to make them > functions(). > > Some of us discussed this on python-dev or python-ideas a while back, > and I think there was general agreement with what I've stated above > and therefore they should be methods. But I'll dig up the links and > add to a Rejected ideas section. Yes, only the stuff that *never* needs a system call (regardless of OS) would be a candidate for handling as a property rather than a method call. Consistency of access would likely trump that idea anyway, but it would still be worth ensuring that the PEP is clear on which values are guaranteed to reflect the state at the time of the directory scanning and which may imply an additional stat call. >> * it would be nice to see some relative performance numbers for NFS and CIFS >> network shares - the additional network round trips can make excessive stat >> calls absolutely brutal from a speed perspective when using a network drive >> (that's why the stat caching added to the import system in 3.3 dramatically >> sped up the case of having network drives on sys.path, and why I thought AJ >> had a point when he was complaining about the fact we didn't expose the >> dirent data from os.listdir) > > Don't know if you saw, but there are actually some benchmarks, > including one over NFS, on the scandir GitHub page: > > https://github.com/benhoyt/scandir#benchmarks No, I hadn't seen those - may be worth referencing explicitly from the PEP (and if there's already a reference... oops!) > os.walk() was 23 times faster with scandir() than the current > listdir() + stat() implementation on the Windows NFS file system I > tried. Pretty good speedup! Ah, nice! Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator
On Jun 28, 2014 12:49 PM, "Ben Hoyt" wrote: > > >> But the underlying system calls -- ``FindFirstFile`` / > >> ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X -- > > > > What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir? > > I guess it'd be better to say "Windows" and "Unix-based OSs" > throughout the PEP? Because all of these (including Mac OS X) are > Unix-based. No, Just say POSIX. > > > It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we > > should mimic stat_result recent addition: the new > > stat_result.file_attributes field. Add DirEntry.file_attributes which > > would only be available on Windows. > > > > The Windows structure also contains > > > > FILETIME ftCreationTime; > > FILETIME ftLastAccessTime; > > FILETIME ftLastWriteTime; > > DWORDnFileSizeHigh; > > DWORDnFileSizeLow; > > > > It would be nice to expose them as well. I'm no more surprised that > > the exact API is different depending on the OS for functions of the os > > module. > > I think you've misunderstood how DirEntry.lstat() works on Windows -- > it's basically a no-op, as Windows returns the full stat information > with the original FindFirst/FindNext OS calls. This is fairly explict > in the PEP, but I'm sure I could make it clearer: > > DirEntry.lstat(): "like os.lstat(), but requires no system calls on Windows > > So you can already get the dwFileAttributes for free by saying > entry.lstat().st_file_attributes. You can also get all the other > fields you mentioned for free via .lstat() with no additional OS calls > on Windows, for example: entry.lstat().st_size. > > Feel free to suggest changes to the PEP or scandir docs if this isn't > clear. Note that is_dir()/is_file()/is_symlink() are free on all > systems, but .lstat() is only free on Windows. > > > Does your implementation uses a free list to avoid the cost of memory > > allocation? A short free list of 10 or maybe just 1 may help. The free > > list may be stored directly in the generator object. > > No, it doesn't. I might add this to the PEP under "possible > improvements". However, I think the speed increase by removing the > extra OS call and/or disk seek is going to be way more than memory > allocation improvements, so I'm not sure this would be worth it. > > > Does it support also bytes filenames on UNIX? > > > Python now supports undecodable filenames thanks to the PEP 383 > > (surrogateescape). I prefer to use the same type for filenames on > > Linux and Windows, so Unicode is better. But some users might prefer > > bytes for other reasons. > > I forget exactly now what my scandir module does, but for os.scandir() > I think this should behave exactly like os.listdir() does for > Unicode/bytes filenames. > > > Crazy idea: would it be possible to "convert" a DirEntry object to a > > pathlib.Path object without losing the cache? I guess that > > pathlib.Path expects a full stat_result object. > > The main problem is that pathlib.Path objects explicitly don't cache > stat info (and Guido doesn't want them to, for good reason I think). > There's a thread on python-dev about this earlier. I'll add it to a > "Rejected ideas" section. > > > I don't understand how you can build a full lstat() result without > > really calling stat. I see that WIN32_FIND_DATA contains the size, but > > here you call lstat(). > > See above. > > > Do you plan to continue to maintain your module for Python < 3.5, but > > upgrade your module for the final PEP? > > Yes, I intend to maintain the standalone scandir module for 2.6 <= > Python < 3.5, at least for a good while. For integration into the > Python 3.5 stdlib, the implementation will be integrated into > posixmodule.c, of course. > > >> Should there be a way to access the full path? > >> -- > >> > >> Should ``DirEntry``'s have a way to get the full path without using > >> ``os.path.join(path, entry.name)``? This is a pretty common pattern, > >> and it may be useful to add pathlib-like ``str(entry)`` functionality. > >> This functionality has also been requested in `issue 13`_ on GitHub. > >> > >> .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13 > > > > I think that it would be very convinient to store the directory name > > in the DirEntry. It should be light, it's just a reference. > > > > And provide a fullname() name which would just return > > os.path.join(path, entry.name) without trying to resolve path to get > > an absolute path. > > Yeah, fair suggestion. I'm still slightly on the fence about this, but > I think an explicit fullname() is a good suggestion. Ideally I think > it'd be better to mimic pathlib.Path.__str__() which is kind of the > equivalent of fullname(). But how does pathlib deal with unicode/bytes > issues if it's the str function which has to return a str object? Or > at least, it'd be very weird if __str__() returned bytes. But I think > it'd need to if you passed bytes i