Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
On Feb 9, 2016, at 20:17, Stephen J. Turnbull wrote: >> It really requires going through all the OS calls and either (a) making >> them consistently decode bytes to str using the declared FS encoding >> (currently 'mbcs', but I see no reason we can't make it 'utf_8'), > > If it were that easy, it would have been done two decades ago. I'm no > fan of Windows[1], but it's obvious that Microsoft has devoted > enormous amounts of brainpower to the problem of encoding > rationalization since the early 90s. I don't think they would have > missed this idea. Microsoft spent a lot of time and effort on the idea that UTF-16 (or, originally, UCS-2) everywhere was the answer. Never call the A functions (or the msvcrt functions that emulate the C and POSIX stdlib), and there's never a problem. What if you read filenames out of a text file? No problem; text files are UTF-16-BOM. Over a socket? All network protocols are also UTF-16. What if you have to read a file written in Unix? Come on, nobody's ever created a useful file without Windows. What about Windows 3.1? Uh... that's a problem. Also, what happens when Unicode goes over 64k characters? And so on. So their grand project failed. That doesn't mean the problem can't be solved. Apple solved their equivalent problem, albeit by sacrificing backward compatibility in a way Microsoft can't get away with. I haven't seen a MacRoman or Shift-JIS filename since they broke the last holdout (the low-level AppleEvent interface) in 10.7--and most of the apps I was using back then don't run on 10.10 without an update. So Python 2 works great on Macs, whether you use bytes or unicode. But that doesn't help us on Windows, where you can't use bytes, or Linux, where you can't use Unicode (without surrogate escape or some other mechanism that Python 2 doesn't have). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
Executive summary: Code pages and POSIX locales aren't solutions, they're the Original Sin. Steve Dower writes: > On 09Feb2016 2017, Stephen J. Turnbull wrote: > > > The problem here is the protocol that Python uses to return > > > bytes paths, and that protocol is inconsistent between APIs > > > and information is lost. > > > > No, the problem is that the necessary information simply isn't always > > available. > > But if we return bytes paths and the user passes them back in unchanged, > that should be irrelevant. Yes. That's pretty much exactly the semantics of using the latin-1 codec. UTF-8 can't do that without surrogateescape, which Python 2 lacks. > The earlier issue was that that doesn't work (e.g. a bytes path > from os.scandir couldn't be passed back into open()). My purely-from-the-user-side take is that that's just a bug in os.scandir that should be fixed, and that even though the complexity that occasions such bugs is an undesirable aspect of Python (v2) programming, it's not a bug because it *can't* be fixed -- you have to fix the world, not Python. Or switch to Python 3. I don't know enough to have an opinion on whether "fixing" os.scandir could cause other problems. > I meant with Python's calls into the API. Anywhere Python does the > conversion from bytes to LPCWSTR (the UTF-16 type) there's a chance > it'll be wrong. Indeed. That's why converting the bytes is often the wrong thing to do *period*. The reasons that Python might be wrong apply to every agent that might decide the conversion -- except the user; the user is never wrong about these things. > Microsoft's solution here is the user's active code page, much like > *nix's solution as I understand it, except that where *nix will convert > _to_ the encoding as a normalized form, Windows will convert _from_ the > encoding to its UTF-16 "normalized" form. Not quite accurate. Unix by original design doesn't *have* a normalized form.[1] Bytez-iz-bytez-R-Us, that's Unix. Recently everybody (except for a few nationalist lunatics and the unteachables in some legislatures) has learned that some form of Unicode is the way to go internally. But that's "best practice", not POSIX requirement, and tons of software continues to operate[2] based on the assumption that users are monolingual with a canonical one-byte encoding, so it doesn't matter as long as *no conversion is ever done*, and the input methods and fonts are consistent with each other. Code pages just try to *enforce* that constraint (and as I already mentioned, that pissed me off so much in 1990 that I'm still a Windows refusenik today). > Back-compat concerns have prevented any significant changes being > made here, otherwise there wouldn't be a 'bytes' interface at > all. It's not just back-compat, it's absolutely necessary in a code-page- based world because you just can't be sure what encoding your content is in until the user tells you the crap you've spewed on her screen might be Klingon, but it's not any of the 7 human languages she knows. "Toto! I don't think we're in Kansas any more" The fact is that code-page-based content continues to be produced in significant quantities, despite the universal availability and absolute superiority (except for workstation reconfiguration costs) of Unicode. Footnotes: [1] The POSIX locale selects encodings for console input and output. File I/O is just bytes, both the content and the file name. The code page also defines the file name encoding as I understand it. [2] I would hope that nobody is *writing* software like that any more, but I live in Japan. That hope is years in the future for me. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
On 10 February 2016 at 08:00, Stephen J. Turnbull wrote: >> The earlier issue was that that doesn't work (e.g. a bytes path > > from os.scandir couldn't be passed back into open()). > > My purely-from-the-user-side take is that that's just a bug in > os.scandir that should be fixed, and that even though the complexity > that occasions such bugs is an undesirable aspect of Python (v2) > programming, it's not a bug because it *can't* be fixed -- you have to > fix the world, not Python. Or switch to Python 3. > > I don't know enough to have an opinion on whether "fixing" os.scandir > could cause other problems. The original os.scandir issue was encountered on Python 3. And I do agree with Victor that the correct answer was to point out to the user that they should be using unicode/surrogateescape. What I disagree with is mandating that (by removing the bytes interface) on anything other than all platforms at once, because that doesn't remove the problem (of coders using the wrong approach on Python 3) it just makes the code such users write non-portable. Whether removing the bytes interface is feasible, given that there's then no way that works across Python 2 and 3 of writing code that manipulates the sort of bytes-that-use-multiple-encodings data that you mention, is a separate issue. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
2016-02-10 9:30 GMT+01:00 Paul Moore : > Whether removing the bytes interface is feasible, given that there's > then no way that works across Python 2 and 3 of writing code that > manipulates the sort of bytes-that-use-multiple-encodings data that > you mention, is a separate issue. It's annoying that 8 years after the release of Python 3.0, Python 3 is still stuck by Python 2 :-( Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Experiences with Creating PEP 484 Stub Files
> On 9 Feb 2016, at 11:48 pm, Guido van Rossum wrote: > > [Phil] I found the documentation confusing regarding Optional. Intuitively it seems to be the way to specify arguments with default values. However it is explained in terms of (for example) Union[str, None] and I (intuitively but incorrectly) read that as meaning "a str or None" as opposed to "a str or nothing". > [me] >>> But it *does* mean 'str or None'. The *type* of an argument doesn't >>> have any bearing on whether it may be omitted from the argument list >>> by the caller -- these are orthogonal concepts (though sadly the word >>> optional might apply to both). It's possible (though unusual) to have >>> an optional argument that must be a str when given; it's also possible >>> to have a mandatory argument that may be a str or None. > [Phil] >> In the case of Python wrappers around a C++ library then *every* optional >> argument will have to have a specific type when given. > > IIUC you're saying that every argument that may be omitted must still > have a definite type other than None. Right? In that case just don't > use Optional[]. If a signature has the form > > def foo(a: str = 'xyz') -> str: ... > > then this means that str may be omitted or it may be a str -- you > cannot call foo(a=None). > > You can even (in a stub file) write this as: > > def foo(a: str = ...) -> str: ... > > (literal '...' i.e. ellipsis) if you don't want to commit to a > specific default value (it makes no difference to mypy). > >> So you are saying that a mandatory argument that may be a str or None would >> be specified as Union[str, None]? > > Or as Optional[str], which means the same. > >> But the docs say that that is the underlying implementation of Option[str] - >> which (to me) means an optional argument that should be a string when given. > > (Assuming you meant Option*al*.) There seems to be an utter confusion > of the two uses of the term "optional" here. An "optional argument" > (outside PEP 484) is one that has a default value. The "Optional[T]" > notation in PEP 484 means "Union[T, None]". They mean different > things. > >>> Can you help improve the wording in the docs (preferably by filing an >>> issue)? >> >> When I eventually understand what it means... I understand now. The documentation, as it stands, is correct and consistent but (to me) the meaning of Optional is completely counter-intuitive. What you suggest with str = ... is exactly what I need. Adding a section to the docs describing that should clear up the confusion. Thanks, Phil ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
On 10 February 2016 at 08:45, Victor Stinner wrote: > 2016-02-10 9:30 GMT+01:00 Paul Moore : >> Whether removing the bytes interface is feasible, given that there's >> then no way that works across Python 2 and 3 of writing code that >> manipulates the sort of bytes-that-use-multiple-encodings data that >> you mention, is a separate issue. > > It's annoying that 8 years after the release of Python 3.0, Python 3 > is still stuck by Python 2 :-( Agreed. Of course personally, I'm in favour of going Python 3/Unicode everywhere, it's the Unix guys with their legacy distros and Python installations and bytes-based filesystems that get in the way of that :-) And I don't think we're brave enough to force *Unix* users to use the right type for filenames :-) Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
On Wednesday, February 10, 2016 12:47 AM, Victor Stinner wrote: > > 2016-02-10 9:30 GMT+01:00 Paul Moore : >> Whether removing the bytes interface is feasible, given that there's >> then no way that works across Python 2 and 3 of writing code that >> manipulates the sort of bytes-that-use-multiple-encodings data that >> you mention, is a separate issue. Well, there's a surrogate-escape backport on PyPI (I think there's a standalone one, and one in python-future), so you _could_ do everything the same as in 3.x. Depending on what you're doing, you may also need to use the io module instead of file (which may just mean "from io import open", but could mean more work), wrap the stdio streams explicitly, manually decode argv, etc. But someone could write a six-like module (or add it to six) that does all of that. It may be a little slower and more memory-intensive in 2.7 than in 3.x, but for most apps, that doesn't matter. The big problem would be third-party libraries (and stdlib modules like csv) that want to use bytes in 2.x; convincing them all to support full-on-unicode in 2.x might be more trouble than it's worth. Still, if I were feeling the pain of maintaining lots of linux-bytes-Windows-unicode-2.7 code, I'd try it and see how far I get. > It's annoying that 8 years after the release of Python 3.0, Python 3 > is still stuck by Python 2 :-( I understand the frustration, but... time already goes too fast at my age; don't skip me ahead almost a whole year to December 2016. :) Also, unless you're the one guy who actually abandoned 2.6 for 3.0, it's probably more useful to count from 2.7, 3.2, or the no-2.8 declaration, which are all about 5 years ago. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
On Wed, Feb 10, 2016 at 12:41:08PM +1100, Chris Angelico wrote: > On Wed, Feb 10, 2016 at 12:37 PM, Steve Dower wrote: > > I really don't like the idea of not being able to use bytes in cross > > platform code. Unless it's become feasible to use Unicode for lossless > > filenames on Linux - last I heard it wasn't. > > It has, but only in Python 3 - anyone who needs to support 2.7 and > arbitrary bytes in filenames can't use Unicode strings. Are you sure? Unless I'm confused, which I may be, I don't think you can specify file names with arbitrary bytes in Python 3. Writing, and reading, filenames including odd bytes works in Python 2.7: [steve@ando ~]$ python -c 'open("/tmp/abc\xD8\x01", "w").write("Hello World\n")' [steve@ando ~]$ ls /tmp/abc* /tmp/abc?? [steve@ando ~]$ python -c 'print open("/tmp/abc\xD8\x01", "r").read()' Hello World [steve@ando ~]$ And I can read the file using bytes in Python 3: [steve@ando ~]$ python3.3 -c 'print(open(b"/tmp/abc\xD8\x01", "r").read())' Hello World [steve@ando ~]$ But Unicode fails: [steve@ando ~]$ python3.3 -c 'print(open("/tmp/abc\xD8\x01", "r").read())' Traceback (most recent call last): File "", line 1, in FileNotFoundError: [Errno 2] No such file or directory: '/tmp/abcØ\x01' What Unicode string does one need to give in order to open file b"/tmp/abc\xD8\x01"? I think one would need to find a valid unicode string which, when encoded to UTF-8, gives the byte sequence \xD8\x01, but since that's half of a surrogate pair it is an illegal UTF-8 byte sequence. So I don't think it can be done. Am I mistaken? -- Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
2016-02-10 11:18 GMT+01:00 Steven D'Aprano : > [steve@ando ~]$ python3.3 -c 'print(open(b"/tmp/abc\xD8\x01", "r").read())' > Hello World > > [steve@ando ~]$ python3.3 -c 'print(open("/tmp/abc\xD8\x01", "r").read())' > Traceback (most recent call last): > File "", line 1, in > FileNotFoundError: [Errno 2] No such file or directory: '/tmp/abcØ\x01' > > What Unicode string does one need to give in order to open file > b"/tmp/abc\xD8\x01"? Use os.fsdecode(b"/tmp/abc\xD8\x01") to get the filename as an Unicode string, it will work. Removing 'b' in front of byte strings is not enough to convert an arbitrary byte strings to Unicode :-D Encodings are more complex than that... See http://unicodebook.readthedocs.org/ The problem on Python 2 is that the UTF-8 encoders encode surrogate characters, which is wrong. You cannot use an error handler to choose how to handle these surrogate characters. On Python 3, you have a wide choice of builtin error handlers, and you can even write your own error handlers. Example with Python 3.6 and its new "namereplace" error handler. >>> def format_filename(filename, encoding='ascii', errors='backslashreplace'): ... return filename.encode(encoding, errors).decode(encoding) ... >>> print(format_filename(os.fsdecode(b'abc\xff'))) abc\udcff >>> print(format_filename(os.fsdecode(b'abc\xff'), errors='replace')) abc? >>> print(format_filename(os.fsdecode(b'abc\xff'), errors='ignore')) abc >>> print(format_filename(os.fsdecode(b'abc\xff') + "é", errors='namereplace')) abc\udcff\N{LATIN SMALL LETTER E WITH ACUTE} My locale encoding is UTF-8. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in the os module?
On 08.02.16 16:32, Victor Stinner wrote: On Python 2, it wasn't possible to use Unicode for filenames, many functions fail badly with Unicode, especially when you mix bytes and Unicode. Even not all os functions support Unicode. See http://bugs.python.org/issue18695. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Experiences with Creating PEP 484 Stub Files
On 10 February 2016 at 06:54, Guido van Rossum wrote: > [Just adding to Andrew's response] > > On Tue, Feb 9, 2016 at 9:58 AM, Andrew Barnert via Python-Dev > wrote: >> On Feb 9, 2016, at 03:44, Phil Thompson wrote: >>> >>> There are a number of things I'd like to express but cannot find a way to >>> do so... >>> >>> - objects that implement the buffer protocol >> >> That seems like it should be filed as a bug with the typing repo. Presumably >> this is just an empty type that registers bytes, bytearray, and memoryview, >> and third-party classes have to register with it manually? > > Hm, there's no way to talk about these in regular Python code either, > is there? I think that issue should be resolved first. Probably by > adding something to collections.abc. And then we can add the > corresponding name to typing.py. This will take time though (have to > wait for 3.6) so I'd recommend 'Any' for now (and filing those bugs). Somewhat related, there's actually no way to export PEP 3118 buffers directly from a type implemented in Python: http://bugs.python.org/issue13797 Cython and PyPy each have their own approach to handling that, but there's no language level cross-interpreter convention A type (e.g. BytesLike, given the change we made to relevant error messages) could still be added to collections.abc without addressing that problem, it would just need to be empty and used only for explicit registration without any structural typing support. Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
Victor Stinner writes: > It's annoying that 8 years after the release of Python 3.0, Python 3 > is still stuck by Python 2 :-( I prefer to think of it as the irritant that reminds me that I am very much alive, and so is Python, vibrantly so. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
Andrew Barnert via Python-Dev writes: > That doesn't mean the problem can't be solved. Apple solved their > equivalent problem, albeit by sacrificing backward compatibility in > a way Microsoft can't get away with. I haven't seen a MacRoman or > Shift-JIS filename since they broke the last holdout If you lived where I do, you'd still be seeing both, because you wouldn't be able to escape archival files on CD and removable media (typically written on Windows boxen). They still work, sort of == same as always, and as far as I know, that's because Apple has *not* sacrificed backward compatibility: under the hood, Darwin is still a POSIX kernel which thinks of file names and everything else outside of memory as bytestreams. One place they *fail very badly* is Shift JIS filenames in zipfiles, which nothing provided by Apple can deal with safely, and InfoZip breaks too (at least in MacPorts). Yes, I know that is specifically disallowed. Feel free to tell 1__ Japanese Windows users. Thank heaven for Python there! A three-line hack and I'm free! > So Python 2 works great on Macs, whether you use bytes or > unicode. But that doesn't help us on Windows, where you can't use > bytes, or Linux, where you can't use Unicode (without surrogate > escape or some other mechanism that Python 2 doesn't have). You contradict yourself! ;-) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Experiences with Creating PEP 484 Stub Files
On Wed, Feb 10, 2016 at 1:11 AM, Phil Thompson wrote: > I understand now. The documentation, as it stands, is correct and consistent > but (to me) the meaning of Optional is completely counter-intuitive. What you > suggest with str = ... is exactly what I need. Adding a section to the docs > describing that should clear up the confusion. I tried to add some clarity to the docs with this paragraph: Note that this is not the same concept as an optional argument, which is one that has a default. An optional argument with a default needn't use the ``Optional`` qualifier on its type annotation (although it is inferred if the default is ``None``). A mandatory argument may still have an ``Optional`` type if an explicit value of ``None`` is allowed. Should be live on docs.python.org with the next push (I don't recall the delay, at most a day IIRC). -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Experiences with Creating PEP 484 Stub Files
On 10 Feb 2016, at 5:52 pm, Guido van Rossum wrote: > > On Wed, Feb 10, 2016 at 1:11 AM, Phil Thompson > wrote: >> I understand now. The documentation, as it stands, is correct and consistent >> but (to me) the meaning of Optional is completely counter-intuitive. What >> you suggest with str = ... is exactly what I need. Adding a section to the >> docs describing that should clear up the confusion. > > I tried to add some clarity to the docs with this paragraph: > > Note that this is not the same concept as an optional argument, > which is one that has a default. An optional argument with a > default needn't use the ``Optional`` qualifier on its type > annotation (although it is inferred if the default is ``None``). > A mandatory argument may still have an ``Optional`` type if an > explicit value of ``None`` is allowed. > > Should be live on docs.python.org with the next push (I don't recall > the delay, at most a day IIRC). That should do it, thanks. A followup question... Is... def foo(bar: str = Optional[str]) ...valid? In other words, bar can be omitted, but if specified must be a str or None? Thanks, Phil ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Experiences with Creating PEP 484 Stub Files
On Wed, Feb 10, 2016 at 10:01 AM, Phil Thompson wrote: > On 10 Feb 2016, at 5:52 pm, Guido van Rossum wrote: [...] > That should do it, thanks. A followup question... > > Is... > > def foo(bar: str = Optional[str]) > > ...valid? In other words, bar can be omitted, but if specified must be a str > or None? The syntax you gave makes no sense (the default value shouldn't be a type) but to do what your words describe you can do def foo(bar: Optional[str] = ...): ... That's literally what you would put in the stub file (the ... are literal ellipses). In a .py file you'd have to specify a concrete default value. If your concrete default is neither str nor None you'd have to use cast(str, default_value), e.g. _NO_VALUE = object() # marker def foo(bar: Optional[str] = cast(str, _NO_VALUE)): ...implementation... Now the implementation can distinguish between foo(), foo(None) and foo(''). -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
On Wednesday, February 10, 2016 6:50 AM, Stephen J. Turnbull wrote: > Andrew Barnert via Python-Dev writes: > >> That doesn't mean the problem can't be solved. Apple solved their >> equivalent problem, albeit by sacrificing backward compatibility in >> a way Microsoft can't get away with. I haven't seen a MacRoman or >> Shift-JIS filename since they broke the last holdout > > If you lived where I do, you'd still be seeing both, because you > wouldn't be able to escape archival files on CD and removable media > (typically written on Windows boxen). They still work, sort of == > same as always, and as far as I know, that's because Apple has *not* > sacrificed backward compatibility: under the hood, Darwin is still a > POSIX kernel which thinks of file names and everything else outside of > memory as bytestreams. Sure, but the Darwin kernel can't read CDs; that's up to the CD filesystem driver. Anyway, Windows CDs can't cause this problem. Windows CDs use the Joliet filesystem,[^1] which stores everything in UCS2.[^2] When you call CreateFileA or fopen or _open with bytes, Windows decodes those bytes and stores them as UCS2. The filesystem drivers on POSIX platforms have to encode that UCS2 to _something_ (POSIX APIs make it very hard for you to deal with filename strings like "A\0B\0C\0.\0T\0X\0T\0\0\0"...). The linux driver uses a mount option to decide how to encode; the OS X driver always uses UTF-8. And every valid UCS2 string can be encoded as UTF-8, so you can use unicode everywhere, even in Python 2. Of course you can have mojibake problems, but that's a different issue,[^3] and no worse with unicode than with bytes.[^4] The same thing is true with NTFS external drives, VFAT USB drives, etc. Generally, it's usually not Windows media on *nix systems that break Python 2 unicode; it's native *nix filesystems where users mix locales. > One place they *fail very badly* is Shift JIS filenames in zipfiles, > which nothing provided by Apple can deal with safely, and InfoZip > breaks too (at least in MacPorts). Yes, I know that is specifically > disallowed. Feel free to tell 1__ Japanese Windows users. The good news is, as far as I can tell, it's not disallowed anymore.[^5] So we just have to tell them that they shouldn't have been doing it in the past. :) Anyway, zipfiles are data files as far as the OS is concerned; the fact that they contain filenames is no more relevant to the kernel (or filesystem driver or userland) than the fact that "List of PDFs to Read This Weekend.txt" contains filenames. PS, everything Apple provides is already using Info-ZIP. >> So Python 2 works great on Macs, whether you use bytes or >> unicode. But that doesn't help us on Windows, where you can't use >> bytes, or Linux, where you can't use Unicode (without surrogate >> escape or some other mechanism that Python 2 doesn't have). > > You contradict yourself! ;-) Yes, as I later realized, sometimes, you _can_ (or at least ought to be able to--I haven't actually tried) use Python 2 with unicode everywhere to write cross-platform software that actually works on linux, by using backports of surrogate-escape and pathlib, and the io module instead of the file type, as long as you only need stdlib and third-party modules that support unicode filenames. If that does work for at least some apps, then I'm perfectly happen to have been wrong earlier. And if catching myself before someone else did makes me a flip-flopper, well, I'm not running for president. :P [^1]: Except when Vista and 7 mistakenly think your CD is a DVD and use UDF instead of ISO9660--but in that case, the encoding is stored in the filesystem header, so it's also not a problem. [^2]: Actually, despite Microsoft's spec, later versions of Windows store UTF-16, even if there are surrogate pairs, or BMP-but-post-UCS2 code points. But that doesn't matter here; the linux, Mac, etc. drivers all assume UTF-16, which works either way. [^3]: Say you write a program that assumes it will only be run on Shift-JIS systems, and you use CreateFileA to create a file named "ハローワールド". The actual bytes you're sending are cp436 for "ânâìü[âÅü[âïâh", so the file on the CD is named, in Unicode, "ânâìü[âÅü[âïâh". So of course the Mac driver encodes that to UTF-8 b"ânâìü[âÅü[âïâh". You won't have any problems opening what you readdir, or what you copy from a UTF-8 terminal or a UTF-16 Cocoa app like Finder, etc. But of course you will have trouble getting your user to recognize that name as meaningful, unless you can figure out or guess or prompt the user to guess that it needs to be passed through s.encode('cp436').decode('shift-jis'). [^4]: Your locale is always UTF-8 on Mac. So the only significant difference is that if you're using bytes, you need b.decode('utf-8').encode('cp436').decode('shift-jis') to fix the problem. [^5]: Zipfiles using the Unicode extension can store a UTF-8 transcoding along with
[Python-Dev] why we have both re.match and re.string?
Hi, I hope the question is not too silly, but why I would like to understand the advantages of having both re.match() and re.search(). Wouldn't be more clear to have just one function with one additional parameters like this: re.search(regexp, text, from_beginning=True|False) ? In this way we prevent, as written in the documentation, people writing ".*" in front of the regexp used with re.match() Thanks. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 515: Underscores in Numeric Literals
This came up in python-ideas, and has met mostly positive comments, although the exact syntax rules are up for discussion. cheers, Georg PEP: 515 Title: Underscores in Numeric Literals Version: $Revision$ Last-Modified: $Date$ Author: Georg Brandl Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 10-Feb-2016 Python-Version: 3.6 Abstract and Rationale == This PEP proposes to extend Python's syntax so that underscores can be used in integral and floating-point number literals. This is a common feature of other modern languages, and can aid readability of long literals, or literals whose value should clearly separate into parts, such as bytes or words in hexadecimal notation. Examples:: # grouping decimal numbers by thousands amount = 10_000_000.0 # grouping hexadecimal addresses by words addr = 0xDEAD_BEEF # grouping bits into bytes in a binary literal flags = 0b_0011__0100_1110 Specification = The current proposal is to allow underscores anywhere in numeric literals, with these exceptions: * Leading underscores cannot be allowed, since they already introduce identifiers. * Trailing underscores are not allowed, because they look confusing and don't contribute much to readability. * The number base prefixes ``0x``, ``0o``, and ``0b`` cannot be split up, because they are fixed strings and not logically part of the number. * No underscore allowed after a sign in an exponent (``1e-_5``), because underscores can also not be used after the signs in front of the number (``-1e5``). * No underscore allowed after a decimal point, because this leads to ambiguity with attribute access (the lexer cannot know that there is no number literal in ``foo._5``). There appears to be no reason to restrict the use of underscores otherwise. The production list for integer literals would therefore look like this:: integer: decimalinteger | octinteger | hexinteger | bininteger decimalinteger: nonzerodigit [decimalrest] | "0" [("0" | "_")* "0"] nonzerodigit: "1"..."9" decimalrest: (digit | "_")* digit digit: "0"..."9" octinteger: "0" ("o" | "O") (octdigit | "_")* octdigit hexinteger: "0" ("x" | "X") (hexdigit | "_")* hexdigit bininteger: "0" ("b" | "B") (bindigit | "_")* bindigit octdigit: "0"..."7" hexdigit: digit | "a"..."f" | "A"..."F" bindigit: "0" | "1" For floating-point literals:: floatnumber: pointfloat | exponentfloat pointfloat: [intpart] fraction | intpart "." exponentfloat: (intpart | pointfloat) exponent intpart: digit (digit | "_")* fraction: "." intpart exponent: ("e" | "E") "_"* ["+" | "-"] digit [decimalrest] Alternative Syntax == Underscore Placement Rules -- Instead of the liberal rule specified above, the use of underscores could be limited. Common rules are (see the "other languages" section): * Only one consecutive underscore allowed, and only between digits. * Multiple consecutive underscore allowed, but only between digits. Different Separators A proposed alternate syntax was to use whitespace for grouping. Although strings are a precedent for combining adjoining literals, the behavior can lead to unexpected effects which are not possible with underscores. Also, no other language is known to use this rule, except for languages that generally disregard any whitespace. C++14 introduces apostrophes for grouping, which is not considered due to the conflict with Python's string literals. [1]_ Behavior in Other Languages === Those languages that do allow underscore grouping implement a large variety of rules for allowed placement of underscores. This is a listing placing the known rules into three major groups. In cases where the language spec contradicts the actual behavior, the actual behavior is listed. **Group 1: liberal (like this PEP)** * D [2]_ * Perl 5 (although docs say it's more restricted) [3]_ * Rust [4]_ * Swift (although textual description says "between digits") [5]_ **Group 2: only between digits, multiple consecutive underscores** * C# (open proposal for 7.0) [6]_ * Java [7]_ **Group 3: only between digits, only one underscore** * Ada [8]_ * Julia (but not in the exponent part of floats) [9]_ * Ruby (docs say "anywhere", in reality only between digits) [10]_ Implementation == A preliminary patch that implements the specification given above has been posted to the issue tracker. [11]_ References == .. [1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3499.html .. [2] http://dlang.org/spec/lex.html#integerliteral .. [3] http://perldoc.perl.org/perldata.html#Scalar-value-constructors .. [4] http://doc.rust-lang.org/reference.html#number-literals .. [5] https://developer.apple.com/library/ios/documentation/Swift/Concep
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On Wed, 10 Feb 2016 at 14:21 Georg Brandl wrote: > This came up in python-ideas, and has met mostly positive comments, > although the exact syntax rules are up for discussion. > > cheers, > Georg > > > > > PEP: 515 > Title: Underscores in Numeric Literals > Version: $Revision$ > Last-Modified: $Date$ > Author: Georg Brandl > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 10-Feb-2016 > Python-Version: 3.6 > > Abstract and Rationale > == > > This PEP proposes to extend Python's syntax so that underscores can be > used in > integral and floating-point number literals. > > This is a common feature of other modern languages, and can aid > readability of > long literals, or literals whose value should clearly separate into parts, > such > as bytes or words in hexadecimal notation. > > Examples:: > > # grouping decimal numbers by thousands > amount = 10_000_000.0 > > # grouping hexadecimal addresses by words > addr = 0xDEAD_BEEF > > # grouping bits into bytes in a binary literal > flags = 0b_0011__0100_1110 > I assume all of these examples are possible in either the liberal or restrictive approaches? > > > Specification > = > > The current proposal is to allow underscores anywhere in numeric literals, > with > these exceptions: > > * Leading underscores cannot be allowed, since they already introduce > identifiers. > * Trailing underscores are not allowed, because they look confusing and > don't > contribute much to readability. > * The number base prefixes ``0x``, ``0o``, and ``0b`` cannot be split up, > because they are fixed strings and not logically part of the number. > * No underscore allowed after a sign in an exponent (``1e-_5``), because > underscores can also not be used after the signs in front of the number > (``-1e5``). > * No underscore allowed after a decimal point, because this leads to > ambiguity > with attribute access (the lexer cannot know that there is no number > literal > in ``foo._5``). > > There appears to be no reason to restrict the use of underscores otherwise. > > The production list for integer literals would therefore look like this:: > >integer: decimalinteger | octinteger | hexinteger | bininteger >decimalinteger: nonzerodigit [decimalrest] | "0" [("0" | "_")* "0"] >nonzerodigit: "1"..."9" >decimalrest: (digit | "_")* digit >digit: "0"..."9" >octinteger: "0" ("o" | "O") (octdigit | "_")* octdigit >hexinteger: "0" ("x" | "X") (hexdigit | "_")* hexdigit >bininteger: "0" ("b" | "B") (bindigit | "_")* bindigit >octdigit: "0"..."7" >hexdigit: digit | "a"..."f" | "A"..."F" >bindigit: "0" | "1" > > For floating-point literals:: > >floatnumber: pointfloat | exponentfloat >pointfloat: [intpart] fraction | intpart "." >exponentfloat: (intpart | pointfloat) exponent >intpart: digit (digit | "_")* >fraction: "." intpart >exponent: ("e" | "E") "_"* ["+" | "-"] digit [decimalrest] > > > Alternative Syntax > == > > Underscore Placement Rules > -- > > Instead of the liberal rule specified above, the use of underscores could > be > limited. Common rules are (see the "other languages" section): > > * Only one consecutive underscore allowed, and only between digits. > * Multiple consecutive underscore allowed, but only between digits. > > Different Separators > > > A proposed alternate syntax was to use whitespace for grouping. Although > strings are a precedent for combining adjoining literals, the behavior can > lead > to unexpected effects which are not possible with underscores. Also, no > other > language is known to use this rule, except for languages that generally > disregard any whitespace. > > C++14 introduces apostrophes for grouping, which is not considered due to > the > conflict with Python's string literals. [1]_ > > > Behavior in Other Languages > === > > Those languages that do allow underscore grouping implement a large > variety of > rules for allowed placement of underscores. This is a listing placing the > known > rules into three major groups. In cases where the language spec > contradicts the > actual behavior, the actual behavior is listed. > > **Group 1: liberal (like this PEP)** > > * D [2]_ > * Perl 5 (although docs say it's more restricted) [3]_ > * Rust [4]_ > * Swift (although textual description says "between digits") [5]_ > > **Group 2: only between digits, multiple consecutive underscores** > > * C# (open proposal for 7.0) [6]_ > * Java [7]_ > > **Group 3: only between digits, only one underscore** > > * Ada [8]_ > * Julia (but not in the exponent part of floats) [9]_ > * Ruby (docs say "anywhere", in reality only between digits) [10]_ > > > Implementation > == > > A preliminary patch that implements the specifica
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On 2/10/2016 2:20 PM, Georg Brandl wrote: This came up in python-ideas, and has met mostly positive comments, although the exact syntax rules are up for discussion. cheers, Georg PEP: 515 Title: Underscores in Numeric Literals Version: $Revision$ Last-Modified: $Date$ Author: Georg Brandl Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 10-Feb-2016 Python-Version: 3.6 Abstract and Rationale == This PEP proposes to extend Python's syntax so that underscores can be used in integral and floating-point number literals. This is a common feature of other modern languages, and can aid readability of long literals, or literals whose value should clearly separate into parts, such as bytes or words in hexadecimal notation. Examples:: # grouping decimal numbers by thousands amount = 10_000_000.0 # grouping hexadecimal addresses by words addr = 0xDEAD_BEEF # grouping bits into bytes in a binary literal flags = 0b_0011__0100_1110 +1 You don't mention potential restrictions that decimal numbers should permit them only every three places, or hex ones only every 2 or 4, and your binary example mentions grouping into bytes, but actually groups into nybbles. But such restrictions would be annoying: if it is useful to the coder to use them, that is fine. But different situation may find other placements more useful... particularly in binary, as it might want to match widths of various bitfields. Adding that as a rejected consideration, with justifications, would be helpful. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On 10 February 2016 at 22:20, Georg Brandl wrote: > This came up in python-ideas, and has met mostly positive comments, > although the exact syntax rules are up for discussion. +1 on the PEP. Is there any value in allowing underscores in strings passed to the Decimal constructor as well? The same sorts of justifications would seem to apply. It's perfectly arguable that the change for Decimal would be so rarely used as to not be worth it, though, so I don't mind either way in practice. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] why we have both re.match and re.string?
Hi, Le 10/02/2016 22:59, Luca Sangiacomo a écrit : Hi, I hope the question is not too silly, but why I would like to understand the advantages of having both re.match() and re.search(). Wouldn't be more clear to have just one function with one additional parameters like this: re.search(regexp, text, from_beginning=True|False) ? Actually you can just do re.search(^regexp, text) But with match you express the intent to match the text with something, while with search, you express that you look for something in the text. Maybe that was the idea? In this way we prevent, as written in the documentation, people writing ".*" in front of the regexp used with re.match() Thanks. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/desmoulin.michel%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
It looks like the implementation https://bugs.python.org/issue26331 only changes the Python parser. What about other functions converting strings to numbers at runtime like int(str) and float(str)? Paul also asked for Decimal(str). Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On 2016-02-10 22:35, Brett Cannon wrote: [snip] Examples:: # grouping decimal numbers by thousands amount = 10_000_000.0 # grouping hexadecimal addresses by words addr = 0xDEAD_BEEF # grouping bits into bytes in a binary literal flags = 0b_0011__0100_1110 I assume all of these examples are possible in either the liberal or restrictive approaches? [snip] Strictly speaking, "0b_0011__0100_1110" wouldn't be valid if an underscore was allowed only between digits because the "b" isn't a digit. Similarly, "0x_FF_FF" wouldn't be valid, but "0xFF_FF" would. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
On Wed, Feb 10, 2016 at 2:30 PM, Andrew Barnert via Python-Dev wrote: > [^3]: Say you write a program that assumes it will only be run on Shift-JIS > systems, and you use > CreateFileA to create a file named "ハローワールド". The actual bytes you're sending > are cp436 > for "ânâìü[âÅü[âïâh", so the file on the CD is named, in Unicode, > "ânâìü[âÅü[âïâh". Unless the system default was changed or the program called SetFileApisToOEM, CreateFileA would decode using the ANSI codepage 1252, not the OEM codepage 437 (not 436), i.e. "ƒnƒ\x8d\x81[ƒ\x8f\x81[ƒ‹ƒh". Otherwise the example is right. But the transcoding strategy won't work in general. For example, if the tables are turned such that the ANSI codepage is 932 and the program passes a bytes name from codepage 1252, the user on the other end won't be able to transcode without error if the original bytes contained invalid DBCS sequences that were mapped to the default character, U+30FB. This transcodes as the meaningless string "\x81E". The user can replace that string with "--" and enjoy a nice game of hang man. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] why we have both re.match and re.string?
On Wed, Feb 10, 2016 at 10:59:18PM +0100, Luca Sangiacomo wrote: > Hi, > I hope the question is not too silly, but why I would like to understand > the advantages of having both re.match() and re.search(). Wouldn't be > more clear to have just one function with one additional parameters like > this: > > re.search(regexp, text, from_beginning=True|False) ? I guess the most important reason now is backwards compatibility. The oldest Python I have installed here is version 1.5, and it has the brand new "re" module (intended as a replacement for the old "regex" module). Both have search() and match() top-level functions. So my guess is that you would have to track down the author of the original "regex" module. But a more general answer is the principle, "Functions shouldn't take constant bool arguments". It is an API design principle which (if I remember correctly) Guido has stated a number of times. Functions should not take a boolean argument which (1) exists only to select between two different modes and (2) are nearly always given as a constant. Do you ever find yourself writing code like this? if some_calculation(): result = re.match(regex, string) else: result = re.search(regex, string) If you do, that would be a hint that perhaps match() and search() should be combined so you can write: result = re.search(regex, string, some_calculation()) But I expect that you almost never do. I would expect that if we combined the two functions into one, we would nearly always call them with a constant bool: # I always forget whether True means match from the start or not, # and which is the default... result = re.search(regex, string, False) which suggests that search() is actually two different functions, and should be split into two, just as we have now. It's a general principle, not a law of nature, so you may find exceptions in the standard library. But if I were designing the re module from scratch, I would either keep the two distinct functions, or just provide search() and let users use ^ to anchor the search to the beginning. > In this way we prevent, as written in the documentation, people writing > ".*" in front of the regexp used with re.match() I only see one example that does that: https://docs.python.org/3/library/re.html#checking-for-a-pair Perhaps it should be changed. -- Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On Wed, Feb 10, 2016 at 10:53:09PM +, Paul Moore wrote: > On 10 February 2016 at 22:20, Georg Brandl wrote: > > This came up in python-ideas, and has met mostly positive comments, > > although the exact syntax rules are up for discussion. > > +1 on the PEP. Is there any value in allowing underscores in strings > passed to the Decimal constructor as well? The same sorts of > justifications would seem to apply. It's perfectly arguable that the > change for Decimal would be so rarely used as to not be worth it, > though, so I don't mind either way in practice. Let's delay making any change to string conversions for now, and that includes Decimal. We can also do this: Decimal("123_456_789.0_12345_67890".replace("_", "")) for those who absolutely must include underscores in their numeric strings. The big win is for numeric literals, not numeric string conversions. -- Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On Feb 10, 2016, at 14:20, Georg Brandl wrote: First, general questions: should the PEP mention the Decimal constructor? What about int and float (I'd assume int(s) continues to work as always, while int(s, 0) gets the new behavior, but if that isn't obviously true, it may be worth saying explicitly). > * Trailing underscores are not allowed, because they look confusing and don't > contribute much to readability. Why is "123_456_" so ugly that we have to catch it, when "1___2_345__6" is just fine, or "123e__+456"? More to the point, if we really need an extra rule, and more complicated BNF, to outlaw this case, I don't think we want a liberal design at all. Also, notice that Swift, Rust, and D all show examples with trailing underscores in their references, and they don't look particularly out of place with the other examples. > There appears to be no reason to restrict the use of underscores otherwise. What other restrictions are there? I think the only place you've left that's not between digits is between the e and the sign. A dead-simple rule like Swift's seems better than five separate rules that I have to learn and remember that make lexing more complicated and that ultimately amount to the conservative rule plus one other place I can put underscores where I'd never want to. > **Group 1: liberal (like this PEP)** > > * D [2]_ > * Perl 5 (although docs say it's more restricted) [3]_ > * Rust [4]_ > * Swift (although textual description says "between digits") [5]_ I don't think any of these are liberal like this PEP. For example, Swift's actual grammar rule allows underscores anywhere but leading in the "digits" part of int literals and all three potential digit parts of float literals. That's the whole rule. It's more conservative than this PEP in not allowing them outside of digit parts (like between E and +), more liberal in allowing them to be trailing, but I'm pretty sure the reason behind the design wasn't specifically about how liberal or conservative they wanted to be, but about being as simple as possible. Rust's rule seems to be equivalent to Swift's, except that they forgot to define exponents anywhere. I don't think either of them was trying to be more liberal or more conservative; rather, they were both trying to be as simple as possible. D does go out of its way to be as liberal as possible, e.g., allowing things like "0x_1_" that the others wouldn't (they'd treat the "_1_" as a digit part, which can't have leading underscores), but it's also more conservative than this spec in not allowing underscores between e and the sign. I think Perl is the only language that allows them anywhere but in the digits part. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?
On Feb 10, 2016, at 15:11, eryk sun wrote: > > On Wed, Feb 10, 2016 at 2:30 PM, Andrew Barnert via Python-Dev > wrote: >> [^3]: Say you write a program that assumes it will only be run on Shift-JIS >> systems, and you use >> CreateFileA to create a file named "ハローワールド". The actual bytes you're >> sending are cp436 >> for "ânâìü[âÅü[âïâh", so the file on the CD is named, in Unicode, >> "ânâìü[âÅü[âïâh". > > Unless the system default was changed or the program called > SetFileApisToOEM, CreateFileA would decode using the ANSI codepage > 1252, not the OEM codepage 437 (not 436), i.e. > "ƒnƒ\x8d\x81[ƒ\x8f\x81[ƒ‹ƒh". Otherwise the example is right. But the > transcoding strategy won't work in general. For example, if the tables > are turned such that the ANSI codepage is 932 and the program passes a > bytes name from codepage 1252, the user on the other end won't be able > to transcode without error if the original bytes contained invalid > DBCS sequences that were mapped to the default character, U+30FB. > This > transcodes as the meaningless string "\x81E". The user can replace > that string with "--" and enjoy a nice game of hang man. Of course there's no way to recover the actual intended filenames if that information was thrown out instead of being stored, but that's no different from the situation where the user mashed the keyboard instead of typing what they intended. The point remains: the Mac strategy (which is also the linux strategy for filesystems that are inherently UTF-16) always generates valid UTF-8, and doesn't try to magically cure mojibake but doesn't get in the way of the user manually curing it. When the Unicode encoding is lossy, of course the user can't cure that, but UTF-8 isn't making it any harder. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On Wed, Feb 10, 2016 at 11:20:38PM +0100, Georg Brandl wrote: > This came up in python-ideas, and has met mostly positive comments, > although the exact syntax rules are up for discussion. Nicely done. But I would change the restrictions to a simpler version. Instead of five rules to learn: > The current proposal is to allow underscores anywhere in numeric literals, > with > these exceptions: > > * Leading underscores cannot be allowed, since they already introduce > identifiers. > * Trailing underscores are not allowed, because they look confusing and don't > contribute much to readability. > * The number base prefixes ``0x``, ``0o``, and ``0b`` cannot be split up, > because they are fixed strings and not logically part of the number. > * No underscore allowed after a sign in an exponent (``1e-_5``), because > underscores can also not be used after the signs in front of the number > (``-1e5``). > * No underscore allowed after a decimal point, because this leads to ambiguity > with attribute access (the lexer cannot know that there is no number literal > in ``foo._5``). change to a single rule "one or more underscores may appear between two (hex)digits, but otherwise nowhere else". That's much simpler to understand than a series of restrictions as given above. That would be your second restrictive rule: "Multiple consecutive underscore allowed, but only between digits." That forbids leading and trailing underscores, underscores inside or immediately after the leading number base (since x, o and b aren't digits), and immediately before or after the sign, decimal point or e|E exponent symbol. > There appears to be no reason to restrict the use of underscores otherwise. I don't like underscores immediately before the . or e|E in floats either: 123_.000_456 The dot is already visually distinctive enough, as is the e|E, and placing an underscore immediately before them doesn't aid in grouping the digits. > Instead of the liberal rule specified above, the use of underscores could be > limited. Common rules are (see the "other languages" section): > > * Only one consecutive underscore allowed, and only between digits. > * Multiple consecutive underscore allowed, but only between digits. I don't think there is any need to restrict it to only a single underscore. There are uses for more than one: Fraction(3__141_592_654, 1_000_000_000) hints that the 3 is somewhat special (for obvious reasons). -- Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Time for a change of random number generator?
The Mersenne Twister is no longer regarded as quite state-of-the art because it can get into states that produce long sequences that are not very random. There is a variation on MT called WELL that has better properties in this regard. Does anyone think it would be a good idea to replace MT with WELL as Python's default rng? https://en.wikipedia.org/wiki/Well_equidistributed_long-period_linear -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On 02/10/2016 04:04 PM, Steven D'Aprano wrote: > change to a single rule "one or more underscores may appear between > two (hex)digits, but otherwise nowhere else". That's much simpler to > understand than a series of restrictions as given above. I like the simpler rule, but I would also allow for an underscore between the base and the first digit: 0x_1ef9_ab22 is easier (at least, for me ;) to parse than 0x1ef9_ab22 However, since Georg is doing the work, I'm not going to argue too hard. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On Wed, Feb 10, 2016 at 03:45:48PM -0800, Andrew Barnert via Python-Dev wrote: > On Feb 10, 2016, at 14:20, Georg Brandl wrote: > > First, general questions: should the PEP mention the Decimal constructor? > What about int and float (I'd assume int(s) continues to work as always, > while int(s, 0) gets the new behavior, but if that isn't obviously true, it > may be worth saying explicitly). > > > * Trailing underscores are not allowed, because they look confusing and > > don't > > contribute much to readability. > > Why is "123_456_" so ugly that we have to catch it, when > "1___2_345__6" is just fine, It's not just fine, it's ugly as sin, but it shouldn't be a matter for the parser to decide a style-issue. Just as we allow people to write ugly tuples: t = ( 1, 2,3 ,4, 5, ) so we should allow people to write ugly ints rather than try to enforce good taste in the parser. There are uses for allowing multiple underscores, and odd groupings, so rather than a blanket ban, we trust that people won't do stupid things. > or "123e__+456"? That I would prohibit. I think that the decimal point and exponent sign provide sufficient visual distinctiveness that putting underscores around them doesn't gain you anything. In some cases it looks like you might have missed a group of digits: 1.234_e-89 hints that perhaps there ought to be more digits after the 4. I'd be okay with a rule "no underscores in the exponent at all", but I don't particularly see the need for it since that's pretty much covered by the style guide saying "don't use underscores unnecessarily". For floats, exponents have a practical limitation of three digits, so there's not much need for grouping them. +1 on allowing underscores between digits +0 on prohibiting underscores in the exponent > More to the point, > if we really need an extra rule, and more complicated BNF, to outlaw > this case, I don't think we want a liberal design at all. I think "underscores can occur between any two digits" is pretty liberal, since it allows multiple underscores, and allows grouping in any size group (including mixed sizes, and stupid sizes like 1). To me, the opposite of a liberal rule is something like "underscores may only occur between groups of three digits". > Also, notice that Swift, Rust, and D all show examples with trailing > underscores in their references, and they don't look particularly out > of place with the other examples. That's a matter of opinion. -- Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
I have occasionally wondered about this missing feature. On 10 February 2016 at 22:20, Georg Brandl wrote: > Abstract and Rationale > == > > This PEP proposes to extend Python's syntax so that underscores can be used in > integral and floating-point number literals. This should extend complex or imaginary literals like 10_000j for consistency. > Specification > = > > * Trailing underscores are not allowed, because they look confusing and don't > contribute much to readability. > * No underscore allowed after a sign in an exponent (``1e-_5``), because > underscores can also not be used after the signs in front of the number > (``-1e5``). > [. . .] > > The production list for integer literals would therefore look like this:: > >integer: decimalinteger | octinteger | hexinteger | bininteger >decimalinteger: nonzerodigit [decimalrest] | "0" [("0" | "_")* "0"] >nonzerodigit: "1"..."9" >decimalrest: (digit | "_")* digit >digit: "0"..."9" >octinteger: "0" ("o" | "O") (octdigit | "_")* octdigit >hexinteger: "0" ("x" | "X") (hexdigit | "_")* hexdigit >bininteger: "0" ("b" | "B") (bindigit | "_")* bindigit >octdigit: "0"..."7" >hexdigit: digit | "a"..."f" | "A"..."F" >bindigit: "0" | "1" > > For floating-point literals:: > >floatnumber: pointfloat | exponentfloat >pointfloat: [intpart] fraction | intpart "." >exponentfloat: (intpart | pointfloat) exponent >intpart: digit (digit | "_")* This allows trailing underscores such as 1_.2, 1.2_, 1.2_e-5. Your bullet point above suggests at least some of these are not desired. >fraction: "." intpart >exponent: ("e" | "E") "_"* ["+" | "-"] digit [decimalrest] This allows underscores in the exponent (1e-5_0), contradicting the other bullet point. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On Feb 10, 2016, at 16:21, Steven D'Aprano wrote: > >> On Wed, Feb 10, 2016 at 03:45:48PM -0800, Andrew Barnert via Python-Dev >> wrote: >> On Feb 10, 2016, at 14:20, Georg Brandl wrote: >> >> First, general questions: should the PEP mention the Decimal constructor? >> What about int and float (I'd assume int(s) continues to work as always, >> while int(s, 0) gets the new behavior, but if that isn't obviously true, it >> may be worth saying explicitly). >> >>> * Trailing underscores are not allowed, because they look confusing and >>> don't >>> contribute much to readability. >> >> Why is "123_456_" so ugly that we have to catch it, when >> "1___2_345__6" is just fine, > > It's not just fine, it's ugly as sin, but it shouldn't be a matter for > the parser to decide a style-issue. Exactly. So why should it be any more of a matter for the parser to decide that "123_456_" is illegal? Leave that in the style guide, and keep the parser, and the reference documentation, as simple as possible. >> or "123e__+456"? > > That I would prohibit. The PEP allows that. The simpler rule used by Swift and Rust prohibits it. >> More to the point, >> if we really need an extra rule, and more complicated BNF, to outlaw >> this case, I don't think we want a liberal design at all. > > I think "underscores can occur between any two digits" is pretty > liberal, since it allows multiple underscores, and allows grouping in > any size group (including mixed sizes, and stupid sizes like 1). The PEP calls that a type-2 conservative proposal, and uses "liberal" to mean that underscores can appear in places that aren't between digits. I don't think we want that liberalism, especially if it requires 5 rules instead of 1 to get it right. Again, Swift and Rust only allow underscores in the digit part of integers, and the up to three digit parts of floats, and the only rule they impose is no leading underscore. (In some caass they lead to ambiguity, in others they don't, but it's easier to just always ban them.) I don't see anything wrong with that rule. The fact that it doesn't allow "1.2e_+3" seems fine. The fact that it doesn't prevent "123_" seems fine also. It's not about being as liberal as possible, or as restrictive as possible, because those edge cases just don't matter, so being as simple as possible seems like an obvious win. >> Also, notice that Swift, Rust, and D all show examples with trailing >> underscores in their references, and they don't look particularly out >> of place with the other examples. > > That's a matter of opinion. Sure, but it's apparently the opinion of the people who designed and/or documented this feature in three out of the four languages I looked at (aka every language but Perl), not mine. And honestly, are you really claiming that in your opinion, "123_456_" is worse than all of their other examples, like "1_23__4"? They're both presented as something the syntax allows, and neither one looks like something I'd ever want to write, much less promote in a style guide or something, but neither one screams out as something that's so heinous we need to complicate the language to ensure it raises a SyntaxError. Yes, that's my opinion, but do.you really have a different opinion about any part of that? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On 02/11/2016 02:16 AM, Martin Panter wrote: > I have occasionally wondered about this missing feature. > > On 10 February 2016 at 22:20, Georg Brandl wrote: >> Abstract and Rationale >> == >> >> This PEP proposes to extend Python's syntax so that underscores can be used >> in >> integral and floating-point number literals. > > This should extend complex or imaginary literals like 10_000j for consistency. Yes, that was always the case, but I guess it should be explicit. >> Specification >> = >> >> * Trailing underscores are not allowed, because they look confusing and don't >> contribute much to readability. >> * No underscore allowed after a sign in an exponent (``1e-_5``), because >> underscores can also not be used after the signs in front of the number >> (``-1e5``). >> [. . .] >> >> The production list for integer literals would therefore look like this:: >> >>integer: decimalinteger | octinteger | hexinteger | bininteger >>decimalinteger: nonzerodigit [decimalrest] | "0" [("0" | "_")* "0"] >>nonzerodigit: "1"..."9" >>decimalrest: (digit | "_")* digit >>digit: "0"..."9" >>octinteger: "0" ("o" | "O") (octdigit | "_")* octdigit >>hexinteger: "0" ("x" | "X") (hexdigit | "_")* hexdigit >>bininteger: "0" ("b" | "B") (bindigit | "_")* bindigit >>octdigit: "0"..."7" >>hexdigit: digit | "a"..."f" | "A"..."F" >>bindigit: "0" | "1" >> >> For floating-point literals:: >> >>floatnumber: pointfloat | exponentfloat >>pointfloat: [intpart] fraction | intpart "." >>exponentfloat: (intpart | pointfloat) exponent >>intpart: digit (digit | "_")* > > This allows trailing underscores such as 1_.2, 1.2_, 1.2_e-5. Your > bullet point above suggests at least some of these are not desired. The middle one isn't, indeed. I updated the grammar accordingly. >>fraction: "." intpart >>exponent: ("e" | "E") "_"* ["+" | "-"] digit [decimalrest] > > This allows underscores in the exponent (1e-5_0), contradicting the > other bullet point. I clarified the bullet points. An "immediately" was missing. Thanks for the feedback! Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On 02/11/2016 12:45 AM, Andrew Barnert via Python-Dev wrote: > On Feb 10, 2016, at 14:20, Georg Brandl wrote: > > First, general questions: should the PEP mention the Decimal constructor? > What about int and float (I'd assume int(s) continues to work as always, > while int(s, 0) gets the new behavior, but if that isn't obviously true, it > may be worth saying explicitly). > >> * Trailing underscores are not allowed, because they look confusing and >> don't contribute much to readability. > > Why is "123_456_" so ugly that we have to catch it, when "1___2_345__6" > is just fine, or "123e__+456"? More to the point, if we really need an extra > rule, and more complicated BNF, to outlaw this case, I don't think we want a > liberal design at all. > > Also, notice that Swift, Rust, and D all show examples with trailing > underscores in their references, and they don't look particularly out of > place with the other examples. That's a point. I'll look into the implementation. >> There appears to be no reason to restrict the use of underscores >> otherwise. > > What other restrictions are there? I think the only place you've left that's > not between digits is between the e and the sign. There are other places left: * between 0x and the digits * between the digits and "j" * before and after the decimal point > A dead-simple rule like > Swift's seems better than five separate rules that I have to learn and > remember that make lexing more complicated and that ultimately amount to the > conservative rule plus one other place I can put underscores where I'd never > want to. Not quite, see above. >> **Group 1: liberal (like this PEP)** >> >> * D [2]_ * Perl 5 (although docs say it's more restricted) [3]_ * Rust >> [4]_ * Swift (although textual description says "between digits") [5]_ > > I don't think any of these are liberal like this PEP. > > For example, Swift's actual grammar rule allows underscores anywhere but > leading in the "digits" part of int literals and all three potential digit > parts of float literals. That's the whole rule. It's more conservative than > this PEP in not allowing them outside of digit parts (like between E and +), > more liberal in allowing them to be trailing, but I'm pretty sure the reason > behind the design wasn't specifically about how liberal or conservative they > wanted to be, but about being as simple as possible. Rust's rule seems to be > equivalent to Swift's, except that they forgot to define exponents anywhere. > I don't think either of them was trying to be more liberal or more > conservative; rather, they were both trying to be as simple as possible. I actually modelled this PEP closely on Rust. It has restrictions as in this PEP, except that trailing underscores are allowed, and that "1.0e_+5" is not allowed (allowed by the PEP), and "1.0e+_5" is (not allowed by the PEP). I don't think you can argue that it's simpler. (If the PEP and our lexical reference were as loosely worded as Rust's, one could probably say it's "simple", too.) Also, both Swift and Rust don't have the baggage of allowing ".5" style literals, which makes the grammar simpler in Swift's case. > D does go out of its way to be as liberal as possible, e.g., allowing things > like "0x_1_" that the others wouldn't (they'd treat the "_1_" as a digit > part, which can't have leading underscores), but it's also more conservative > than this spec in not allowing underscores between e and the sign. > > I think Perl is the only language that allows them anywhere but in the digits > part. Thanks for the feedback! Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 515: Underscores in Numeric Literals
On 02/10/2016 11:35 PM, Brett Cannon wrote: >> Examples:: >> >> # grouping decimal numbers by thousands >> amount = 10_000_000.0 >> >> # grouping hexadecimal addresses by words >> addr = 0xDEAD_BEEF >> >> # grouping bits into bytes in a binary literal >> flags = 0b_0011__0100_1110 >> > > I assume all of these examples are possible in either the liberal or > restrictive > approaches? The last one isn't for restrictive -- its first underscore isn't between digits. >> >> Implementation >> == >> >> A preliminary patch that implements the specification given above has >> been >> posted to the issue tracker. [11]_ >> > > Is the implementation made easier or harder if we went with the Group 2 or 3 > approaches? Are there any reasonable examples that the Group 1 approach allows > that Group 3 doesn't that people have used in other languages? Group 3 is probably a little more work than group 2, since you have to make sure only one consecutive underscore is present. I don't see a point to that. > I'm +1 on the idea, but which approach I prefer is going to be partially > dependent on the difficulty of implementing (else I say Group 3 to make it > easier to explain the rules). Based on the feedback so far, I have an easier rule in mind that I will base the next PEP revision on. It's basically "One ore more underscores allowed anywhere after a digit or a base specifier." This preserves my preferred non-restrictive cases (0b__, 1.5_j) and disallows more controversial versions like "1.5e_+_2". cheers, Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com