Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Nick Coghlan
On 28 June 2014 19:17, Nick Coghlan  wrote:
> Agreed, but walking even a moderately large tree over the network can
> really hammer home the point that this offers a significant
> performance enhancement as the latency of access increases. I've found
> that kind of comparison can be eye-opening for folks that are used to
> only operating on local disks (even spinning disks, let alone SSDs)
> and/or relatively small trees (distro build trees aren't *that* big,
> but they're big enough for this kind of difference in access overhead
> to start getting annoying).

Oops, forgot to add - I agree this isn't a blocking issue for the PEP,
it's definitely only in "nice to have" territory.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Nick Coghlan
On 28 June 2014 16:17, Gregory P. Smith  wrote:
> On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan  wrote:
>> * it would be nice to see some relative performance numbers for NFS and
>> CIFS network shares - the additional network round trips can make excessive
>> stat calls absolutely brutal from a speed perspective when using a network
>> drive (that's why the stat caching added to the import system in 3.3
>> dramatically sped up the case of having network drives on sys.path, and why
>> I thought AJ had a point when he was complaining about the fact we didn't
>> expose the dirent data from os.listdir)
>
> fwiw, I wouldn't wait for benchmark numbers.
>
> A needless stat call when you've got the information from an earlier API
> call is already brutal. It is easy to compute from existing ballparks remote
> file server / cloud access: ~100ms, local spinning disk seek+read: ~10ms.
> fetch of stat info cached in memory on file server on the local network:
> ~500us.  You can go down further to local system call overhead which can
> vary wildly but should likely be assumed to be at least 10us.
>
> You don't need a benchmark to tell you that adding needless >= 500us-100ms
> blocking operations to your program is bad. :)

Agreed, but walking even a moderately large tree over the network can
really hammer home the point that this offers a significant
performance enhancement as the latency of access increases. I've found
that kind of comparison can be eye-opening for folks that are used to
only operating on local disks (even spinning disks, let alone SSDs)
and/or relatively small trees (distro build trees aren't *that* big,
but they're big enough for this kind of difference in access overhead
to start getting annoying).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fix Unicode-disabled build of Python 2.7

2014-06-28 Thread Paul Sokolovsky
Hello,

On Thu, 26 Jun 2014 22:49:40 +1000
Chris Angelico  wrote:

> On Thu, Jun 26, 2014 at 9:04 PM, Antoine Pitrou 
> wrote:
> > For the same reason, I agree with Victor that we should ditch the
> > threading-disabled builds. It's too much of a hassle for no actual,
> > practical benefit. People who want a threadless unicodeless Python
> > can install Python 1.5.2 for all I care.
> 
> Or some other implementation of Python. It's looking like micropython
> will be permanently supporting a non-Unicode build 

Yes.

> (although I stepped
> away from the project after a strong disagreement over what would and
> would not make sense, and haven't been following it since). 

Your patches with my further additions were finally merged. Unicode
strings still cannot be enabled by default due to
https://github.com/micropython/micropython/issues/726 . Any help with
reviewing/testing what's currently available is welcome.

> If someone
> wants a Python that doesn't have stuff that the core CPython devs
> treat as essential, s/he probably wants something like uPy anyway.

I hinted it during previous discussions of MicroPython, and would like
to say it again, that MicroPython already embraced a lot of ideas
rejected from CPython, like GC-only operation (which alone not
something to be proud of, but can you start up and do something in 2K
heap?) or tagged pointers
(https://mail.python.org/pipermail/python-dev/2004-July/046139.html).
So, it should be good vehicle to try any unorthodox ideas(*) or
implementations.


* MicroPython already implements intra-module constants for example.



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Akira Li
Ben Hoyt  writes:

> Hi Python dev folks,
>
> I've written a PEP proposing a specific os.scandir() API for a
> directory iterator that returns the stat-like info from the OS, *the
> main advantage of which is to speed up os.walk() and similar
> operations between 4-20x, depending on your OS and file system.*
> ...
> http://legacy.python.org/dev/peps/pep-0471/
> ...
> Specifically, this PEP proposes adding a single function to the ``os``
> module in the standard library, ``scandir``, that takes a single,
> optional string as its argument::
>
> scandir(path='.') -> generator of DirEntry objects
>

Have you considered adding support for paths relative to directory
descriptors [1] via keyword only dir_fd=None parameter if it may lead to
more efficient implementations on some platforms?

[1]: https://docs.python.org/3.4/library/os.html#dir-fd


--
akira

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Chris Angelico
On Sat, Jun 28, 2014 at 11:05 PM, Akira Li <4kir4...@gmail.com> wrote:
> Have you considered adding support for paths relative to directory
> descriptors [1] via keyword only dir_fd=None parameter if it may lead to
> more efficient implementations on some platforms?
>
> [1]: https://docs.python.org/3.4/library/os.html#dir-fd

Potentially more efficient and also potentially safer (see 'man
openat')... but an enhancement that can wait, if necessary.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Ben Hoyt
>> But the underlying system calls -- ``FindFirstFile`` /
>> ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
>
> What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?

I guess it'd be better to say "Windows" and "Unix-based OSs"
throughout the PEP? Because all of these (including Mac OS X) are
Unix-based.

> It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
> should mimic stat_result recent addition: the new
> stat_result.file_attributes field. Add DirEntry.file_attributes which
> would only be available on Windows.
>
> The Windows structure also contains
>
>   FILETIME ftCreationTime;
>   FILETIME ftLastAccessTime;
>   FILETIME ftLastWriteTime;
>   DWORDnFileSizeHigh;
>   DWORDnFileSizeLow;
>
> It would be nice to expose them as well. I'm  no more surprised that
> the exact API is different depending on the OS for functions of the os
> module.

I think you've misunderstood how DirEntry.lstat() works on Windows --
it's basically a no-op, as Windows returns the full stat information
with the original FindFirst/FindNext OS calls. This is fairly explict
in the PEP, but I'm sure I could make it clearer:

DirEntry.lstat(): "like os.lstat(), but requires no system calls on Windows

So you can already get the dwFileAttributes for free by saying
entry.lstat().st_file_attributes. You can also get all the other
fields you mentioned for free via .lstat() with no additional OS calls
on Windows, for example: entry.lstat().st_size.

Feel free to suggest changes to the PEP or scandir docs if this isn't
clear. Note that is_dir()/is_file()/is_symlink() are free on all
systems, but .lstat() is only free on Windows.

> Does your implementation uses a free list to avoid the cost of memory
> allocation? A short free list of 10 or maybe just 1 may help. The free
> list may be stored directly in the generator object.

No, it doesn't. I might add this to the PEP under "possible
improvements". However, I think the speed increase by removing the
extra OS call and/or disk seek is going to be way more than memory
allocation improvements, so I'm not sure this would be worth it.

> Does it support also bytes filenames on UNIX?

> Python now supports undecodable filenames thanks to the PEP 383
> (surrogateescape). I prefer to use the same type for filenames on
> Linux and Windows, so Unicode is better. But some users might prefer
> bytes for other reasons.

I forget exactly now what my scandir module does, but for os.scandir()
I think this should behave exactly like os.listdir() does for
Unicode/bytes filenames.

> Crazy idea: would it be possible to "convert" a DirEntry object to a
> pathlib.Path object without losing the cache? I guess that
> pathlib.Path expects a full  stat_result object.

The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
"Rejected ideas" section.

> I don't understand how you can build a full lstat() result without
> really calling stat. I see that WIN32_FIND_DATA contains the size, but
> here you call lstat().

See above.

> Do you plan to continue to maintain your module for Python < 3.5, but
> upgrade your module for the final PEP?

Yes, I intend to maintain the standalone scandir module for 2.6 <=
Python < 3.5, at least for a good while. For integration into the
Python 3.5 stdlib, the implementation will be integrated into
posixmodule.c, of course.

>> Should there be a way to access the full path?
>> --
>>
>> Should ``DirEntry``'s have a way to get the full path without using
>> ``os.path.join(path, entry.name)``? This is a pretty common pattern,
>> and it may be useful to add pathlib-like ``str(entry)`` functionality.
>> This functionality has also been requested in `issue 13`_ on GitHub.
>>
>> .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
>
> I think that it would be very convinient to store the directory name
> in the DirEntry. It should be light, it's just a reference.
>
> And provide a fullname() name which would just return
> os.path.join(path, entry.name) without trying to resolve path to get
> an absolute path.

Yeah, fair suggestion. I'm still slightly on the fence about this, but
I think an explicit fullname() is a good suggestion. Ideally I think
it'd be better to mimic pathlib.Path.__str__() which is kind of the
equivalent of fullname(). But how does pathlib deal with unicode/bytes
issues if it's the str function which has to return a str object? Or
at least, it'd be very weird if __str__() returned bytes. But I think
it'd need to if you passed bytes into scandir(). Do others have
thoughts?

> Would it be hard to implement the wildcard feature on UNIX to compare
> performances of scandir('*.jpg') with and without the wildcard built
> in os.scandir?

It's a good idea, the problem with this is that the Windows wildca

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Ben Hoyt
Re is_dir etc being properties rather than methods:

>> I find this behaviour a bit misleading: using methods and have them
>> return cached results. How much (implementation and/or performance
>> and/or memory) overhead would incur by using property-like access here?
>> I think this would underline the static nature of the data.
>>
>> This would break the semantics with respect to pathlib, but they're only
>> marginally equal anyways -- and as far as I understand it, pathlib won't
>> cache, so I think this has a fair point here.
>
> Indeed - using properties rather than methods may help emphasise the
> deliberate *difference* from pathlib in this case (i.e. value when the
> result was retrieved from the OS, rather than the value right now). The main
> benefit is that switching from using the DirEntry object to a pathlib Path
> will require touching all the places where the performance characteristics
> switch from "memory access" to "system call". This benefit is also the main
> downside, so I'd actually be OK with either decision on this one.

The problem with this is that properties "look free", they look just
like attribute access, so you wouldn't normally handle exceptions when
accessing them. But .lstat() and .is_dir() etc may do an OS call, so
if you're needing to be careful with error handling, you may want to
handle errors on them. Hence I think it's best practice to make them
functions().

Some of us discussed this on python-dev or python-ideas a while back,
and I think there was general agreement with what I've stated above
and therefore they should be methods. But I'll dig up the links and
add to a Rejected ideas section.

> * +1 on a new section in the PEP covering rejected design options (calling
> it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)

Great idea. I'll add a bunch of stuff, including the above, to a new
section, Rejected Design Options.

> * regarding "why not a 2-tuple", we know from experience that operating
> systems evolve and we end up wanting to add additional info to this kind of
> API. A dedicated DirEntry type lets us adjust the information returned over
> time, without breaking backwards compatibility and without resorting to ugly
> hacks like those in some of the time and stat APIs (or even our own codec
> info APIs)

Fully agreed.

> * it would be nice to see some relative performance numbers for NFS and CIFS
> network shares - the additional network round trips can make excessive stat
> calls absolutely brutal from a speed perspective when using a network drive
> (that's why the stat caching added to the import system in 3.3 dramatically
> sped up the case of having network drives on sys.path, and why I thought AJ
> had a point when he was complaining about the fact we didn't expose the
> dirent data from os.listdir)

Don't know if you saw, but there are actually some benchmarks,
including one over NFS, on the scandir GitHub page:

https://github.com/benhoyt/scandir#benchmarks

os.walk() was 23 times faster with scandir() than the current
listdir() + stat() implementation on the Windows NFS file system I
tried. Pretty good speedup!

-Ben
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Nick Coghlan
On 29 June 2014 05:48, Ben Hoyt  wrote:
>>> But the underlying system calls -- ``FindFirstFile`` /
>>> ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
>>
>> What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide 
>> readdir?
>
> I guess it'd be better to say "Windows" and "Unix-based OSs"
> throughout the PEP? Because all of these (including Mac OS X) are
> Unix-based.

*nix and POSIX-based are the two conventions I use.


>> Crazy idea: would it be possible to "convert" a DirEntry object to a
>> pathlib.Path object without losing the cache? I guess that
>> pathlib.Path expects a full  stat_result object.
>
> The main problem is that pathlib.Path objects explicitly don't cache
> stat info (and Guido doesn't want them to, for good reason I think).
> There's a thread on python-dev about this earlier. I'll add it to a
> "Rejected ideas" section.

The key problem with caches on pathlib.Path objects is that you could
end up with two separate path objects that referred to the same
filesystem location but returned different answers about the
filesystem state because their caches might be stale. DirEntry is
different, as the content is generally *assumed* to be stale
(referring to when the directory was scanned, rather than the current
filesystem state). DirEntry.lstat() on POSIX systems will be an
exception to that general rule (referring to the time of first lookup,
rather than when the directory was scanned, so the answer rom lstat()
may be inconsistent with other data stored directly on the DirEntry
object), but one we can probably live with.

More generally, as part of the pathlib PEP review, we figured out that
a *per-object* cache of filesystem state would be an inherently bad
idea, but a string based *process global* cache might make sense for
modules like walkdir (not part of the stdlib - it's an iterator
pipeline based approach to file tree scanning I wrote a while back,
that currently suffers badly from the performance impact of repeated
stat calls at different stages of the pipeline). We realised this was
getting into a space where application and library specific concerns
are likely to start affecting the caching design, though, so the
current status of standard library level stat caching is "it's not
clear if there's an available approach that would be sufficiently
general purpose to be appropriate for inclusion in the standard
library".

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Nick Coghlan
On 29 June 2014 05:55, Ben Hoyt  wrote:
> Re is_dir etc being properties rather than methods:
>
>>> I find this behaviour a bit misleading: using methods and have them
>>> return cached results. How much (implementation and/or performance
>>> and/or memory) overhead would incur by using property-like access here?
>>> I think this would underline the static nature of the data.
>>>
>>> This would break the semantics with respect to pathlib, but they're only
>>> marginally equal anyways -- and as far as I understand it, pathlib won't
>>> cache, so I think this has a fair point here.
>>
>> Indeed - using properties rather than methods may help emphasise the
>> deliberate *difference* from pathlib in this case (i.e. value when the
>> result was retrieved from the OS, rather than the value right now). The main
>> benefit is that switching from using the DirEntry object to a pathlib Path
>> will require touching all the places where the performance characteristics
>> switch from "memory access" to "system call". This benefit is also the main
>> downside, so I'd actually be OK with either decision on this one.
>
> The problem with this is that properties "look free", they look just
> like attribute access, so you wouldn't normally handle exceptions when
> accessing them. But .lstat() and .is_dir() etc may do an OS call, so
> if you're needing to be careful with error handling, you may want to
> handle errors on them. Hence I think it's best practice to make them
> functions().
>
> Some of us discussed this on python-dev or python-ideas a while back,
> and I think there was general agreement with what I've stated above
> and therefore they should be methods. But I'll dig up the links and
> add to a Rejected ideas section.

Yes, only the stuff that *never* needs a system call (regardless of
OS) would be a candidate for handling as a property rather than a
method call. Consistency of access would likely trump that idea
anyway, but it would still be worth ensuring that the PEP is clear on
which values are guaranteed to reflect the state at the time of the
directory scanning and which may imply an additional stat call.

>> * it would be nice to see some relative performance numbers for NFS and CIFS
>> network shares - the additional network round trips can make excessive stat
>> calls absolutely brutal from a speed perspective when using a network drive
>> (that's why the stat caching added to the import system in 3.3 dramatically
>> sped up the case of having network drives on sys.path, and why I thought AJ
>> had a point when he was complaining about the fact we didn't expose the
>> dirent data from os.listdir)
>
> Don't know if you saw, but there are actually some benchmarks,
> including one over NFS, on the scandir GitHub page:
>
> https://github.com/benhoyt/scandir#benchmarks

No, I hadn't seen those - may be worth referencing explicitly from the
PEP (and if there's already a reference... oops!)

> os.walk() was 23 times faster with scandir() than the current
> listdir() + stat() implementation on the Windows NFS file system I
> tried. Pretty good speedup!

Ah, nice!

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Gregory P. Smith
On Jun 28, 2014 12:49 PM, "Ben Hoyt"  wrote:
>
> >> But the underlying system calls -- ``FindFirstFile`` /
> >> ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
> >
> > What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide
readdir?
>
> I guess it'd be better to say "Windows" and "Unix-based OSs"
> throughout the PEP? Because all of these (including Mac OS X) are
> Unix-based.

No, Just say POSIX.

>
> > It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
> > should mimic stat_result recent addition: the new
> > stat_result.file_attributes field. Add DirEntry.file_attributes which
> > would only be available on Windows.
> >
> > The Windows structure also contains
> >
> >   FILETIME ftCreationTime;
> >   FILETIME ftLastAccessTime;
> >   FILETIME ftLastWriteTime;
> >   DWORDnFileSizeHigh;
> >   DWORDnFileSizeLow;
> >
> > It would be nice to expose them as well. I'm  no more surprised that
> > the exact API is different depending on the OS for functions of the os
> > module.
>
> I think you've misunderstood how DirEntry.lstat() works on Windows --
> it's basically a no-op, as Windows returns the full stat information
> with the original FindFirst/FindNext OS calls. This is fairly explict
> in the PEP, but I'm sure I could make it clearer:
>
> DirEntry.lstat(): "like os.lstat(), but requires no system calls on
Windows
>
> So you can already get the dwFileAttributes for free by saying
> entry.lstat().st_file_attributes. You can also get all the other
> fields you mentioned for free via .lstat() with no additional OS calls
> on Windows, for example: entry.lstat().st_size.
>
> Feel free to suggest changes to the PEP or scandir docs if this isn't
> clear. Note that is_dir()/is_file()/is_symlink() are free on all
> systems, but .lstat() is only free on Windows.
>
> > Does your implementation uses a free list to avoid the cost of memory
> > allocation? A short free list of 10 or maybe just 1 may help. The free
> > list may be stored directly in the generator object.
>
> No, it doesn't. I might add this to the PEP under "possible
> improvements". However, I think the speed increase by removing the
> extra OS call and/or disk seek is going to be way more than memory
> allocation improvements, so I'm not sure this would be worth it.
>
> > Does it support also bytes filenames on UNIX?
>
> > Python now supports undecodable filenames thanks to the PEP 383
> > (surrogateescape). I prefer to use the same type for filenames on
> > Linux and Windows, so Unicode is better. But some users might prefer
> > bytes for other reasons.
>
> I forget exactly now what my scandir module does, but for os.scandir()
> I think this should behave exactly like os.listdir() does for
> Unicode/bytes filenames.
>
> > Crazy idea: would it be possible to "convert" a DirEntry object to a
> > pathlib.Path object without losing the cache? I guess that
> > pathlib.Path expects a full  stat_result object.
>
> The main problem is that pathlib.Path objects explicitly don't cache
> stat info (and Guido doesn't want them to, for good reason I think).
> There's a thread on python-dev about this earlier. I'll add it to a
> "Rejected ideas" section.
>
> > I don't understand how you can build a full lstat() result without
> > really calling stat. I see that WIN32_FIND_DATA contains the size, but
> > here you call lstat().
>
> See above.
>
> > Do you plan to continue to maintain your module for Python < 3.5, but
> > upgrade your module for the final PEP?
>
> Yes, I intend to maintain the standalone scandir module for 2.6 <=
> Python < 3.5, at least for a good while. For integration into the
> Python 3.5 stdlib, the implementation will be integrated into
> posixmodule.c, of course.
>
> >> Should there be a way to access the full path?
> >> --
> >>
> >> Should ``DirEntry``'s have a way to get the full path without using
> >> ``os.path.join(path, entry.name)``? This is a pretty common pattern,
> >> and it may be useful to add pathlib-like ``str(entry)`` functionality.
> >> This functionality has also been requested in `issue 13`_ on GitHub.
> >>
> >> .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
> >
> > I think that it would be very convinient to store the directory name
> > in the DirEntry. It should be light, it's just a reference.
> >
> > And provide a fullname() name which would just return
> > os.path.join(path, entry.name) without trying to resolve path to get
> > an absolute path.
>
> Yeah, fair suggestion. I'm still slightly on the fence about this, but
> I think an explicit fullname() is a good suggestion. Ideally I think
> it'd be better to mimic pathlib.Path.__str__() which is kind of the
> equivalent of fullname(). But how does pathlib deal with unicode/bytes
> issues if it's the str function which has to return a str object? Or
> at least, it'd be very weird if __str__() returned bytes. But I think
> it'd need to if you passed bytes i