Re: ipython2 does not work anymore
Cecil Westerhof writes: > ... >> If you do mean 'pathlib', it was introduced in Python 3.4. > > It is about python2. I can remember to have seen announcements for enhanced "path" modules in this list. Your previously posted traceback shows that the problem comes from the package "pickleshare" which expects a "path" module (likely one of those enhanced modules) and cannot find one. I would try to find out what "path" module "pickleshare" requires and then install it. -- https://mail.python.org/mailman/listinfo/python-list
Re: python corrupted double-linked list error
Xristos Xristoou writes: > i am a python 2.7 and ubuntu 16.04 user. While analysing a problem upgrading to Ubuntu 16.04 (unrelated to Python) I found messages reporting about a problem with Python an Ubuntu 16.04[.0] (leading to memory corruption - something you are seeing). The proposed solution has been to upgrade to Ubuntu 16.04.1. -- https://mail.python.org/mailman/listinfo/python-list
RE: Must target be only one bit one such as 0001, 0010, 0100, 1000 In supervised neural learning f(w*p+b) with perceptron rule w = w + e for linear case?
Ho Yeung Lee wrote, on January 19, 2017 12:05 AM > > Must target be only one bit one such as 0001,0010,0100,1000 > In supervised neural learning f(w*p+b) with perceptron rule w > = w + e for linear case? > > with neural network design > > does it means that can not use two or more one as target such > as 0011,0110,1100,1010, 0111,1110,1101, etc when training weight? With all respect, this sounds like a question you should be asking your professor, or one of the teaching assistants (TAs) for the course, or maybe your local computer science tutoring volunteers, if none of the above are available. This list is for questions and discussion about python. Maybe you're trying to code this problem in python, or maybe you're planning to at some stage, but you need to understand the fundamentals of the problem first. It's possible someone on this list would recognize the problem you're having, but it's unlikely. Come back with a clearly stated python problem, and I'm sure lots of people would help you with it. Deborah -- https://mail.python.org/mailman/listinfo/python-list
Re: Using python to start programs after logging in
In <[email protected]> Cecil Westerhof writes: > > I think using your window manager's built-in facilities for starting > > programs would be better. Why are you using Python instead? > Because when you use the window managers builtin facilities then all > programs will be started on the same virtual desktop and I want to > start them on different ones. The window manager doesn't allow you to specify a target desktop? That seems like a pretty heinous feature omission. -- John Gordon A is for Amy, who fell down the stairs [email protected] B is for Basil, assaulted by bears -- Edward Gorey, "The Gashlycrumb Tinies" -- https://mail.python.org/mailman/listinfo/python-list
PEP 393 vs UTF-8 Everywhere
Can anyone point me at a rationale for PEP 393 being incorporated in Python 3.3 over using UTF-8 as an internal string representation? I've found good articles by Nick Coghlan, Armin Ronacher and others on the matter. What I have not found is discussion of pros and cons of alternatives to the old narrow or wide implementation of Unicode strings. ISTM that most operations on strings are via iterators and thus agnostic to variable or fixed width encodings. How important is it to be able to get to part of a string with a simple index? Just because old skool strings could be treated as a sequence of characters, is that a reason to shoehorn the subtleties of Unicode into that model? -- Pete Forman -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote: > Can anyone point me at a rationale for PEP 393 being incorporated in > Python 3.3 over using UTF-8 as an internal string representation? I've > found good articles by Nick Coghlan, Armin Ronacher and others on the > matter. What I have not found is discussion of pros and cons of > alternatives to the old narrow or wide implementation of Unicode > strings. The PEP itself has the rational for the problems with the narrow/wide idea, the quote from https://www.python.org/dev/peps/pep-0393/: There are two classes of complaints about the current implementation of the unicode type:on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported. On systems using UCS-4 internally (and also sometimes on systems using UCS-2), there is a complaint that Unicode strings take up too much memory - especially compared to Python 2.x, where the same code would often use ASCII strings (i.e. ASCII-encoded byte strings). With the proposed approach, ASCII-only Unicode strings will again use only one byte per character; while still allowing efficient indexing of strings containing non-BMP characters (as strings containing them will use 4 bytes per character). Basically, narrow builds had very odd behavior with non-BMP characters, namely that indexing into the string could easily produce mojibake. Wide builds used quite a bit more memory, which generally translates to reduced performance. > ISTM that most operations on strings are via iterators and thus agnostic > to variable or fixed width encodings. How important is it to be able to > get to part of a string with a simple index? Just because old skool > strings could be treated as a sequence of characters, is that a reason > to shoehorn the subtleties of Unicode into that model? I think you are underestimating the indexing usages of strings. Every operation on a string using UTF8 that contains larger characters must be completed by starting at index 0 - you can never start anywhere else safely. rfind/rsplit/rindex/rstrip and the other related reverse functions would require walking the string from start to end, rather than short-circuiting by reading from right to left. With indexing becoming linear time, many simple algorithms need to be written with that in mind, to avoid n*n time. Such performance regressions can often go unnoticed by developers, who are likely to be testing with small data, and thus may cause (accidental) DOS attacks when used on real data. The exact same problems occur with the old narrow builds (UTF16; note that this was NOT implemented in those builds, however, which caused the mojibake problems) as well - only a UTF32 or PEP393 implementation can avoid those problems. Note that from a user (including most developers, if not almost all), PEP393 strings can be treated as if they were UTF32, but with many of the benefits of UTF8. As far as I'm aware, it is only developers writing extension modules that need to care - and only then if they need maximum performance, and thus cannot convert every string they access to UTF32 or UTF8. -- Chris Kaynor -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
On 01/20/2017 03:06 PM, Chris Kaynor wrote: [...snip...] -- Chris Kaynor I was able to delete my response which was a wholly contained subset of this one. :) But I have one extra question. Is string indexing guaranteed to be constant-time for python? I thought so, but I couldn't find it documented anywhere. (Not that I think it practically matters, since it couldn't really change if it weren't for all the reasons you mentioned.) I found this which at details (if not explicitly "guarantees") the complexity properties of other datatypes: https://wiki.python.org/moin/TimeComplexity Cheers, Thomas -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
On Sat, Jan 21, 2017 at 10:15 AM, Thomas Nyberg wrote: > But I have one extra question. Is string indexing guaranteed to be > constant-time for python? I thought so, but I couldn't find it documented > anywhere. (Not that I think it practically matters, since it couldn't really > change if it weren't for all the reasons you mentioned.) I found this which > at details (if not explicitly "guarantees") the complexity properties of > other datatypes: > No, it isn't; this question came up in the context of MicroPython, which chose to go UTF-8 internally instead of PEP 393. But the considerations for uPy are different - it's not designed to handle gobs of data, so constant-time vs linear isn't going to have as much impact. But in normal work, it's important enough to have predictable string performance. You can't afford to deploy a web application, test it, and then have someone send a large amount of data at it, causing massive O(n^2) blowouts. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
. On Fri, Jan 20, 2017 at 3:15 PM, Thomas Nyberg wrote: > On 01/20/2017 03:06 PM, Chris Kaynor wrote: >> >> >> [...snip...] >> >> -- >> Chris Kaynor >> > > I was able to delete my response which was a wholly contained subset of this > one. :) > > > But I have one extra question. Is string indexing guaranteed to be > constant-time for python? I thought so, but I couldn't find it documented > anywhere. (Not that I think it practically matters, since it couldn't really > change if it weren't for all the reasons you mentioned.) I found this which > at details (if not explicitly "guarantees") the complexity properties of > other datatypes: > > https://wiki.python.org/moin/TimeComplexity As far as I'm aware, the language does not guarantee it. In fact, I believe it was decided that MicroPython could use UTF8 strings with linear indexing while still calling itself Python. This was very useful for MicroPython due to the platforms it supports (embedded), and needing to keep the memory footprint very small. I believe Guido (on Python-ideas) has stated that constant-time string indexing is a guarantee of CPython, however. The only reference I found in my (very quick) search is the Python-Dev thread at https://groups.google.com/forum/#!msg/dev-python/3lfXwljNLj8/XxO2s0TGYrYJ -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
On 2017-01-20 23:06, Chris Kaynor wrote: On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote: Can anyone point me at a rationale for PEP 393 being incorporated in Python 3.3 over using UTF-8 as an internal string representation? I've found good articles by Nick Coghlan, Armin Ronacher and others on the matter. What I have not found is discussion of pros and cons of alternatives to the old narrow or wide implementation of Unicode strings. The PEP itself has the rational for the problems with the narrow/wide idea, the quote from https://www.python.org/dev/peps/pep-0393/: There are two classes of complaints about the current implementation of the unicode type:on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported. On systems using UCS-4 internally (and also sometimes on systems using UCS-2), there is a complaint that Unicode strings take up too much memory - especially compared to Python 2.x, where the same code would often use ASCII strings (i.e. ASCII-encoded byte strings). With the proposed approach, ASCII-only Unicode strings will again use only one byte per character; while still allowing efficient indexing of strings containing non-BMP characters (as strings containing them will use 4 bytes per character). Basically, narrow builds had very odd behavior with non-BMP characters, namely that indexing into the string could easily produce mojibake. Wide builds used quite a bit more memory, which generally translates to reduced performance. ISTM that most operations on strings are via iterators and thus agnostic to variable or fixed width encodings. How important is it to be able to get to part of a string with a simple index? Just because old skool strings could be treated as a sequence of characters, is that a reason to shoehorn the subtleties of Unicode into that model? I think you are underestimating the indexing usages of strings. Every operation on a string using UTF8 that contains larger characters must be completed by starting at index 0 - you can never start anywhere else safely. rfind/rsplit/rindex/rstrip and the other related reverse functions would require walking the string from start to end, rather than short-circuiting by reading from right to left. With indexing becoming linear time, many simple algorithms need to be written with that in mind, to avoid n*n time. Such performance regressions can often go unnoticed by developers, who are likely to be testing with small data, and thus may cause (accidental) DOS attacks when used on real data. The exact same problems occur with the old narrow builds (UTF16; note that this was NOT implemented in those builds, however, which caused the mojibake problems) as well - only a UTF32 or PEP393 implementation can avoid those problems. You could implement rsplit and rstrip easily enough, but rfind and rindex return the index, so you'd need to scan the string to return that. Note that from a user (including most developers, if not almost all), PEP393 strings can be treated as if they were UTF32, but with many of the benefits of UTF8. As far as I'm aware, it is only developers writing extension modules that need to care - and only then if they need maximum performance, and thus cannot convert every string they access to UTF32 or UTF8. As someone who has written an extension, I can tell you that I much prefer dealing with a fixed number of bytes per codepoint than a variable number of bytes per codepoint, especially as I'm also supporting earlier versions of Python where that was the case. -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
Chris Kaynor writes: > On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote: >> Can anyone point me at a rationale for PEP 393 being incorporated in >> Python 3.3 over using UTF-8 as an internal string representation? >> I've found good articles by Nick Coghlan, Armin Ronacher and others >> on the matter. What I have not found is discussion of pros and cons >> of alternatives to the old narrow or wide implementation of Unicode >> strings. > > The PEP itself has the rational for the problems with the narrow/wide > idea, the quote from https://www.python.org/dev/peps/pep-0393/: There > are two classes of complaints about the current implementation of the > unicode type:on systems only supporting UTF-16, users complain that > non-BMP characters are not properly supported. On systems using UCS-4 > internally (and also sometimes on systems using UCS-2), there is a > complaint that Unicode strings take up too much memory - especially > compared to Python 2.x, where the same code would often use ASCII > strings (i.e. ASCII-encoded byte strings). With the proposed approach, > ASCII-only Unicode strings will again use only one byte per character; > while still allowing efficient indexing of strings containing non-BMP > characters (as strings containing them will use 4 bytes per > character). > > Basically, narrow builds had very odd behavior with non-BMP > characters, namely that indexing into the string could easily produce > mojibake. Wide builds used quite a bit more memory, which generally > translates to reduced performance. I'm taking as a given that the old way was often sub-optimal in many scenarios. My questions were about the alternatives, and why PEP 393 was chosen over other approaches. >> ISTM that most operations on strings are via iterators and thus >> agnostic to variable or fixed width encodings. How important is it to >> be able to get to part of a string with a simple index? Just because >> old skool strings could be treated as a sequence of characters, is >> that a reason to shoehorn the subtleties of Unicode into that model? > > I think you are underestimating the indexing usages of strings. Every > operation on a string using UTF8 that contains larger characters must > be completed by starting at index 0 - you can never start anywhere > else safely. rfind/rsplit/rindex/rstrip and the other related reverse > functions would require walking the string from start to end, rather > than short-circuiting by reading from right to left. With indexing > becoming linear time, many simple algorithms need to be written with > that in mind, to avoid n*n time. Such performance regressions can > often go unnoticed by developers, who are likely to be testing with > small data, and thus may cause (accidental) DOS attacks when used on > real data. The exact same problems occur with the old narrow builds > (UTF16; note that this was NOT implemented in those builds, however, > which caused the mojibake problems) as well - only a UTF32 or PEP393 > implementation can avoid those problems. I was asserting that most useful operations on strings start from index 0. The r* operations would not be slowed down that much as UTF-8 has the useful property that attempting to interpret from a byte that is not at the start of a sequence (in the sense of a code point rather than Python) is invalid and so quick to move over while working backwards from the end. The only significant use of an index dereference that I could come up with was the result of a find() or index(). I put out this public question so that I could be enclued as to other uses. My personal experience is that in most cases where I might consider find() that I end up using re and use the return from match groups which has copies of the (sub)strings that I want. > Note that from a user (including most developers, if not almost all), > PEP393 strings can be treated as if they were UTF32, but with many of > the benefits of UTF8. As far as I'm aware, it is only developers > writing extension modules that need to care - and only then if they > need maximum performance, and thus cannot convert every string they > access to UTF32 or UTF8. PEP 393 already says that "the specification chooses UTF-8 as the recommended way of exposing strings to C code". -- Pete Forman -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
MRAB writes: > As someone who has written an extension, I can tell you that I much > prefer dealing with a fixed number of bytes per codepoint than a > variable number of bytes per codepoint, especially as I'm also > supporting earlier versions of Python where that was the case. At the risk of sounding harsh, if supporting variable bytes per codepoint is a pain you should roll with it for the greater good of supporting users. PEP 393 / Python 3.3 required extension writers to revisit their access to strings. My explicit question was about why PEP 393 was adopted to replace the deficient old implementations rather than another approach. The implicit question is whether a UTF-8 internal representation should replace that of PEP 393. -- Pete Forman -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote: > I was asserting that most useful operations on strings start from index > 0. The r* operations would not be slowed down that much as UTF-8 has the > useful property that attempting to interpret from a byte that is not at > the start of a sequence (in the sense of a code point rather than > Python) is invalid and so quick to move over while working backwards > from the end. Let's take one very common example: decoding JSON. A ton of web servers out there will call json.loads() on user-supplied data. The bulk of the work is in the scanner, which steps through the string and does the actual parsing. That function is implemented in Python, so it's a good example. (There is a C accelerator, but we can ignore that and look at the pure Python one.) So, how could you implement this function? The current implementation maintains an index - an integer position through the string. It repeatedly requests the next character as string[idx], and can also slice the string (to check for keywords like "true") or use a regex (to check for numbers). Everything's clean, but it's lots of indexing. Alternatively, it could remove and discard characters as they're consumed. It would maintain a string that consists of all the unparsed characters. All indexing would be at or near zero, but after every tiny piece of parsing, the string would get sliced. With immutable UTF-8 strings, both of these would be O(n^2). Either indexing is linear, so parsing the tail of the string means scanning repeatedly; or slicing is linear, so parsing the head of the string means slicing all the rest away. The only way for it to be fast enough would be to have some sort of retainable string iterator, which means exposing an opaque "position marker" that serves no purpose other than parsing. Every string parse operation would have to be reimplemented this way, lest it perform abysmally on large strings. It'd mean some sort of magic "thing" that probably has a reference to the original string, so you don't get the progressive RAM refunds that slicing gives, and you'd still have to deal with lots of the other consequences. It's probably doable, but it would be a lot of pain. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
On Sat, Jan 21, 2017 at 11:51 AM, Pete Forman wrote: > MRAB writes: > >> As someone who has written an extension, I can tell you that I much >> prefer dealing with a fixed number of bytes per codepoint than a >> variable number of bytes per codepoint, especially as I'm also >> supporting earlier versions of Python where that was the case. > > At the risk of sounding harsh, if supporting variable bytes per > codepoint is a pain you should roll with it for the greater good of > supporting users. That hasn't been demonstrated, though. There's plenty of evidence regarding cache usage that shows that direct indexing is incredibly beneficial on large strings. What are the benefits of variable-sized encodings? AFAIK, the only real benefit is that you can use less memory for strings that contain predominantly ASCII but a small number of astral characters (plus *maybe* a faster encode-to-UTF-8; you wouldn't get a faster decode-from-UTF-8, because you still need to check that the byte sequence is valid). Can you show a use-case that would be materially improved by UTF-8? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
On 2017-01-21 00:51, Pete Forman wrote: MRAB writes: As someone who has written an extension, I can tell you that I much prefer dealing with a fixed number of bytes per codepoint than a variable number of bytes per codepoint, especially as I'm also supporting earlier versions of Python where that was the case. At the risk of sounding harsh, if supporting variable bytes per codepoint is a pain you should roll with it for the greater good of supporting users. Or I could decide not bother and leave it to someone else to continue the project. After all, it's not like I'm not getting paid for the work, it's purely voluntary. PEP 393 / Python 3.3 required extension writers to revisit their access to strings. My explicit question was about why PEP 393 was adopted to replace the deficient old implementations rather than another approach. The implicit question is whether a UTF-8 internal representation should replace that of PEP 393. I already had to handle 1-byte bytestrings and 2/4-byte (narrow/wide) Unicode strings, so switching to 1/2/4 strings wasn't too bad. Switching to a completely different, variable-width system would've been a lot more work. -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
Chris Angelico writes: > decoding JSON... the scanner, which steps through the string and > does the actual parsing. ... > The only way for it to be fast enough would be to have some sort of > retainable string iterator, which means exposing an opaque "position > marker" that serves no purpose other than parsing. Python already has that type of iterator: x = "foo" for c in x: > It'd mean some sort of magic "thing" that probably has a reference to > the original string It's a regular old string iterator unless I'm missing something. Of course a json parser should use it, though who uses the non-C json parser anyway these days? [Chris Kaynor writes:] > rfind/rsplit/rindex/rstrip and the other related reverse > functions would require walking the string from start to end, rather > than short-circuiting by reading from right to left. UTF-8 can be read from right to left because you can recognize when a codepoint begins by looking at the top 2 bits of each byte as you scan backwards. Any combination except for 11 is a leading byte, and 11 is always a continuation byte. This "prefix property" of UTF8 is a design feature and not a trick someone noticed after the fact. Also if you really want O(1) random access, you could put an auxiliary table into long strings, giving the byte offset of every 256th codepoint or something like that. Then you'd go to the nearest table entry and scan from there. This would usually be in-cache scanning so quite fast. Or use the related representation of "ropes" which are also very easy to concatenate if they can be nested. Erlang does something like that with what it calls "binaries". -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
On Sat, Jan 21, 2017 at 5:01 PM, Paul Rubin wrote: > Chris Angelico writes: >> decoding JSON... the scanner, which steps through the string and >> does the actual parsing. ... >> The only way for it to be fast enough would be to have some sort of >> retainable string iterator, which means exposing an opaque "position >> marker" that serves no purpose other than parsing. > > Python already has that type of iterator: >x = "foo" >for c in x: > >> It'd mean some sort of magic "thing" that probably has a reference to >> the original string > > It's a regular old string iterator unless I'm missing something. Of > course a json parser should use it, though who uses the non-C json > parser anyway these days? You can't do a look-ahead with a vanilla string iterator. That's necessary for a lot of parsers. > Also if you really want O(1) random access, you could put an auxiliary > table into long strings, giving the byte offset of every 256th codepoint > or something like that. Then you'd go to the nearest table entry and > scan from there. This would usually be in-cache scanning so quite fast. > Or use the related representation of "ropes" which are also very easy to > concatenate if they can be nested. Erlang does something like that > with what it calls "binaries". Yes, which gives a two-level indexing (first find the strand, then the character), and that's going to play pretty badly with CPU caches. I'd be curious to know how an alternate Python with that implementation would actually perform. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: PEP 393 vs UTF-8 Everywhere
Chris Angelico writes: > On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote: >> I was asserting that most useful operations on strings start from >> index 0. The r* operations would not be slowed down that much as >> UTF-8 has the useful property that attempting to interpret from a >> byte that is not at the start of a sequence (in the sense of a code >> point rather than Python) is invalid and so quick to move over while >> working backwards from the end. > > Let's take one very common example: decoding JSON. A ton of web > servers out there will call json.loads() on user-supplied data. The > bulk of the work is in the scanner, which steps through the string and > does the actual parsing. That function is implemented in Python, so > it's a good example. (There is a C accelerator, but we can ignore that > and look at the pure Python one.) > > So, how could you implement this function? The current implementation > maintains an index - an integer position through the string. It > repeatedly requests the next character as string[idx], and can also > slice the string (to check for keywords like "true") or use a regex > (to check for numbers). Everything's clean, but it's lots of indexing. > Alternatively, it could remove and discard characters as they're > consumed. It would maintain a string that consists of all the unparsed > characters. All indexing would be at or near zero, but after every > tiny piece of parsing, the string would get sliced. > > With immutable UTF-8 strings, both of these would be O(n^2). Either > indexing is linear, so parsing the tail of the string means scanning > repeatedly; or slicing is linear, so parsing the head of the string > means slicing all the rest away. > > The only way for it to be fast enough would be to have some sort of > retainable string iterator, which means exposing an opaque "position > marker" that serves no purpose other than parsing. Every string parse > operation would have to be reimplemented this way, lest it perform > abysmally on large strings. It'd mean some sort of magic "thing" that > probably has a reference to the original string, so you don't get the > progressive RAM refunds that slicing gives, and you'd still have to > deal with lots of the other consequences. It's probably doable, but it > would be a lot of pain. Julia does this. It has immutable UTF-8 strings, and there is a JSON parser. The "opaque position marker" is just the byte index. An attempt to use an invalid index throws an error. A substring type points to an underlying string. An iterator, called graphemes, even returns substrings that correspond to what people might consider a character. I offer Julia as evidence. My impression is that Julia's UTF-8-based system works and is not a pain. I wrote a toy function once to access the last line of a large memory-mapped text file, so I have just this little bit of personal experience of it, so far. Incidentally, can Python memory-map a UTF-8 file as a string? http://docs.julialang.org/en/stable/manual/strings/ https://github.com/JuliaIO/JSON.jl -- https://mail.python.org/mailman/listinfo/python-list
