Re: How do I display unicode value stored in a string variable using ord()
>>> sys.version
'3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
37.32762490493721
timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
0.8158757139801764
>>> sys.version
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 bit
(Intel)]'
>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
61.919225272152346
>>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
1.2918679017971044
timeit.timeit("('ab…' * 10).replace('…', '€…')")
1.2484133226156757
* I intuitively and empirically noticed, this happens for
cp1252 or mac-roman characters and not characters which are
elements of the latin-1 coding scheme.
* Bad luck, such characters are usual characters in French scripts
(and in some other European language).
* I do not recall the extreme cases I found. Believe me, when
I'm speaking about a few 100%, I do not lie.
My take of the subject.
This is a typical Python desease. Do not solve a problem, but
find a way, a workaround, which is expecting to solve a problem
and which finally solves nothing. As far as I know, to break
the "BMP limit", the tools are here. They are called utf-8 or
ucs-4/utf-32.
One day, I fell on very, very old mail message, dating at the
time of the introduction of the unicode type in Python 2.
If I recall correctly it was from Victor Stinner. He wrote
something like this "Let's go with ucs-4, and the problems
are solved for ever". He was so right.
I'm spying the dev-list since years, my feeling is that
there is always a latent and permanent conflict between
"ascii users" and "non ascii users" (see the unicode
literal reintroduction).
Please, do not get me wrong. As a non-computer scientist,
I'm very happy with Python. If I try to take a distant
eye, I became more and more sceptical.
PS Py3.3b2 is still crashing, silently exiting, with
cp65001.
jmf
--
http://mail.python.org/mailman/listinfo/python-list
Re: Dynamically determine base classes on instantiation
On 18/08/2012 02:44, Steven D'Aprano wrote: Makes you think that Google is interested in fixing the bugs in their crappy web apps? They have become as arrogant and as obnoxious as Microsoft used to be. Charging off topic again, but I borrowed a book from the local library a couple of months back about Google Apps as it looked interesting. I returned it in disgust rather rapidly as it was basically a "let's bash Microsoft" tome. -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
On 18/08/2012 06:42, Chris Angelico wrote: On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti wrote: Hi, I'm new to regular expressions. I want to be able to match for tokens with all their properties in the following examples. I would appreciate some direction on how to proceed. @foo1 @foo2() @foo3(anything could go here) You can find regular expression primers all over the internet - fire up your favorite search engine and type those three words in. But it may be that what you want here is a more flexible parser; have you looked at BeautifulSoup (so rich and green)? ChrisA Totally agree with the sentiment. There's a comparison of python parsers here http://nedbatchelder.com/text/python-parsers.html -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
sys.version
> '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
timeit.timeit("('ab…' * 1000).replace('…', '……')")
> 37.32762490493721
> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
>
sys.version
> '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32
> bit (Intel)]'
imeit.timeit("('ab…' * 1000).replace('…', '……')")
> 61.919225272152346
"imeit"?
It is hard to take your results seriously when you have so obviously
edited your timing results, not just copied and pasted them.
Here are my results, on my laptop running Debian Linux. First, testing on
Python 3.2:
steve@runes:~$ python3.2 -m timeit "('abc' * 1000).replace('c', 'de')"
1 loops, best of 3: 50.2 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '……')"
1 loops, best of 3: 45.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'x…')"
1 loops, best of 3: 51.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
1 loops, best of 3: 47.6 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '€…')"
1 loops, best of 3: 45.9 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
1 loops, best of 3: 57.5 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
1 loops, best of 3: 49.7 usec per loop
As you can see, the timing results are all consistently around 50
microseconds per loop, regardless of which characters I use, whether they
are in Latin-1 or not. The differences between one test and another are
not meaningful.
Now I do them again using Python 3.3:
steve@runes:~$ python3.3 -m timeit "('abc' * 1000).replace('c', 'de')"
1 loops, best of 3: 64.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '……')"
1 loops, best of 3: 67.8 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'x…')"
1 loops, best of 3: 66 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
1 loops, best of 3: 67.6 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '€…')"
1 loops, best of 3: 68.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
1 loops, best of 3: 67.9 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
1 loops, best of 3: 66.9 usec per loop
The results are all consistently around 67 microseconds. So Python's
string handling is about 30% slower in the examples show here.
If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:
http://bugs.python.org/
Don't forget to report your operating system.
> My take of the subject.
>
> This is a typical Python desease. Do not solve a problem, but find a
> way, a workaround, which is expecting to solve a problem and which
> finally solves nothing. As far as I know, to break the "BMP limit", the
> tools are here. They are called utf-8 or ucs-4/utf-32.
The problem with UCS-4 is that every character requires four bytes.
Every. Single. One.
So under UCS-4, the pure-ascii string "hello world" takes 44 bytes plus
the object overhead. Under UCS-2, it takes half that space: 22 bytes, but
of course UCS-2 can only represent characters in the BMP. A pure ASCII
string would only take 11 bytes, but we're not going back to pure ASCII.
(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.)
The difference between 44 bytes and 22 bytes for one little string is not
very important, but when you double the memory required for every single
string it becomes huge. Remember that every class, function and method
has a name, which is a string; every attribute and variable has a name,
all strings; functions and classes have doc strings, all strings. Strings
are used everywhere in Python, and doubling the memory needed by Python
means that it will perform worse.
With PEP 393, each Python string will be stored in the most efficient
format possible:
- if it only contains ASCII characters, it will be stored using 1 byte
per character;
- if it only contains characters in the BMP, it will be stored using
UCS-2 (2 bytes per character);
- if it contains non-BMP characters, the string will be stored using
UCS-4 (4 bytes per character).
--
Steven
--
http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
In article <[email protected]>, Frank Koshti wrote: > I'm new to regular expressions. I want to be able to match for tokens > with all their properties in the following examples. I would > appreciate some direction on how to proceed. > > > @foo1 > @foo2() > @foo3(anything could go here) Don't try to parse HTML with regexes. Use a real HTML parser, such as lxml (http://lxml.de/). -- http://mail.python.org/mailman/listinfo/python-list
Re: Top-posting &c. (was Re: [ANNC] pybotwar-0.8)
I am aware of this. I'm just to lazy to use Google Groups! "Come on Ramchandra, you can switch to Google Groups." On 17 August 2012 13:09, rusi wrote: > On Aug 17, 3:36 am, Chris Angelico wrote: > > On Fri, Aug 17, 2012 at 1:40 AM, Ramchandra Apte > wrote: > > > On 16 August 2012 21:00, Mark Lawrence > wrote: > > >> and "bottom" reads better than "top" > > > > > Look you are the only person complaining about top-posting. > > > GMail uses top-posting by default. > > > I can't help it if you feel irritated by it. > > > > I post using gmail, > > If you register on the mailing list as well as google groups, you can > then use googlegroups. > Thereafter appropriately cutting out the unnecessary stuff is easy > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list
Re: ONLINE SERVER TO STORE AND RUN PYTHON SCRIPTS
Please don't use all caps. On 17 August 2012 18:16, coldfire wrote: > I would like to know that where can a python script be stored on-line from > were it keep running and can be called any time when required using > internet. > I have used mechanize module which creates a webbroswer instance to open a > website and extract data and email me. > I have tried Python anywhere but they dont support opening of anonymous > websites. > What s the current what to DO this? > Can someone point me in the write direction. > My script have no interaction with User It just Got on-line searches for > something and emails me. > > Thanks > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list
Re: pythonic interface to SAPI5?
A simple workaround is to use:
speak = subprocess.Popen("espeak",stdin = subprocess.PIPE)
speak.stdin.write("Hello world!")
time.sleep(1)
speak.terminate() #end the speaking
On 17 August 2012 21:49, Vojtěch Polášek wrote:
> Hi,
> I am developing audiogame for visually impaired users and I want it to
> be multiplatform. I know, that there is library called accessible_output
> but it is not working when used in Windows for me.
> I tried pyttsx, which should use Espeak on Linux and SAPI5 on Windows.
> It works on Windows, on Linux I decided to use speech dispatcher bindings.
> But it seems that I can't interrupt speech when using pyttsx and this is
> showstopper for me.
> Does anyone has any working solution for using SAPI5 on windows?
> Thank you very much,
> Vojta
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
http://mail.python.org/mailman/listinfo/python-list
Re: remote read eval print loop
Not really. Try modifying ast.literal_eval. This will be quite secure. On 17 August 2012 19:36, Chris Angelico wrote: > On Fri, Aug 17, 2012 at 11:28 PM, Eric Frederich > wrote: > > Within the debugging console, after importing all of the bindings, there > > would be no reason to import anything whatsoever. > > With just the bindings I created and the Python language we could do > > meaningful debugging. > > So if I block the ability to do any imports and calls to eval I should be > > safe right? > > Nope. Python isn't a secured language in that way. I tried the same > sort of thing a while back, but found it effectively impossible. (And > this after people told me "It's not possible, don't bother trying". I > tried anyway. It wasn't possible.) > > If you really want to do that, consider it equivalent to putting an > open SSH session into your debugging console. Would you give that much > power to your application's users? And if you would, is it worth > reinventing SSH? > > ChrisA > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list
Re: [ANNC] pybotwar-0.8
On 17 August 2012 18:23, Hans Mulder wrote: > On 16/08/12 23:34:25, Walter Hurry wrote: > > On Thu, 16 Aug 2012 17:20:29 -0400, Terry Reedy wrote: > > > >> On 8/16/2012 11:40 AM, Ramchandra Apte wrote: > >> > >>> Look you are the only person complaining about top-posting. > >> > >> No he is not. Recheck all the the responses. > >> > >>> GMail uses top-posting by default. > >> > >> It only works if everyone does it. > >> > >>> I can't help it if you feel irritated by it. > >> > >> Your out-of-context comments are harder to understand. I mostly do not > >> read them. > > > > It's strange, but I don't even *see* his contributions (I am using a > > regular newsreader - on comp.lang.python - and I don't have him in the > > bozo bin). It doesn't sound as though I'm missing much. > > I don't see him either. That is to say: my ISP doesn't have his > posts in comp.lang.python, The group gmane.comp.python.general > on Gname has them. so if you're really curious, you can point > your NNTP client at news.gmane.org. > > > But I'm just curious. Any idea why that would be the case? > > Maybe there's some kind of filer in the mail->usenet gateway? > > HTH, > > -- HansM > > > Let's not go overkill. I'll be using Google Groups (hopefully it won't top-post by default) to post stuff. > > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
On Fri, 17 Aug 2012 21:41:07 -0700, Frank Koshti wrote: > Hi, > > I'm new to regular expressions. I want to be able to match for tokens > with all their properties in the following examples. I would appreciate > some direction on how to proceed. Others have already given you excellent advice to NOT use regular expressions to parse HTML files, but to use a proper HTML parser instead. However, since I remember how hard it was to get started with regexes, I'm going to ignore that advice and show you how to abuse regexes to search for text, and pretend that they aren't HTML tags. Here's your string you want to search for: > @foo1 You want to find a piece of text that starts with "@", followed by any alphanumeric characters, followed by "". We start by compiling a regex: import re pattern = r"@\w+" regex = re.compile(pattern, re.I) First we import the re module. Then we define a pattern string. Note that I use a "raw string" instead of a regular string -- this is not compulsory, but it is very common. The difference between a raw string and a regular string is how they handle backslashes. In Python, some (but not all!) backslashes are special. For example, the regular string "\n" is not two characters, backslash-n, but a single character, Newline. The Python string parser converts backslash combinations as special characters, e.g.: \n => newline \t => tab \0 => ASCII Null character \\ => a single backslash etc. We often call these "backslash escapes". Regular expressions use a lot of backslashes, and so it is useful to disable the interpretation of backlash escapes when writing regex patterns. We do that with a "raw string" -- if you prefix the string with the letter r, the string is raw and backslash-escapes are ignored: # ordinary "cooked" string: "abc\n" => a b c newline # raw string r"abc\n" => a b c backslash n Here is our pattern again: pattern = r"@\w+" which is thirteen characters: less-than h 1 greater-than at-sign backslash w plus-sign less-than slash h 1 greater-than Most of the characters shown just match themselves. For example, the @ sign will only match another @ sign. But some have special meaning to the regex: \w doesn't match "backslash w", but any alphanumeric character; + doesn't match a plus sign, but tells the regex to match the previous symbol one or more times. Since it immediately follows \w, this means "match at least one alphanumeric character". Now we feed that string into the re.compile, to create a pre-compiled regex. (This step is optional: any function which takes a compiled regex will also accept a string pattern. But pre-compiling regexes which you are going to use repeatedly is a good idea.) regex = re.compile(pattern, re.I) The second argument to re.compile is a flag, re.I which is a special value that tells the regular expression to ignore case, so "h" will match both "h" and "H". Now on to use the regex. Here's a bunch of text to search: text = """Now is the time for all good men blah blah blah spam and more text here blah blah blah and some more @victory blah blah blah""" And we search it this way: mo = re.search(regex, text) "mo" stands for "Match Object", which is returned if the regular expression finds something that matches your pattern. If nothing matches, then None is returned instead. if mo is not None: print(mo.group(0)) => prints @victory So far so good. But we can do better. In this case, we don't really care about the tags , we only care about the "victory" part. Here's how to use grouping to extract substrings from the regex: pattern = r"@(\w+)" # notice the round brackets () regex = re.compile(pattern, re.I) mo = re.search(regex, text) if mo is not None: print(mo.group(0)) print(mo.group(1)) This prints: @victory victory Hope this helps. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
I think the point was missed. I don't want to use an XML parser. The point is to pick up those tokens, and yes I've done my share of RTFM. This is what I've come up with: '\$\w*\(?.*?\)' Which doesn't work well on the above example, which is partly why I reached out to the group. Can anyone help me with the regex? Thanks, Frank -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Hey Steven,
Thank you for the detailed (and well-written) tutorial on this very
issue. I actually learned a few things! Though, I still have
unresolved questions.
The reason I don't want to use an XML parser is because the tokens are
not always placed in HTML, and even in HTML, they may appear in
strange places, such as Hello. My specific issue is
I need to match, process and replace $foo(x=3), knowing that (x=3) is
optional, and the token might appear simply as $foo.
To do this, I decided to use:
re.compile('\$\w*\(?.*?\)').findall(mystring)
the issue with this is it doesn't match $foo by itself, and requires
there to be () at the end.
Thanks,
Frank
--
http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : > [...] > The problem with UCS-4 is that every character requires four bytes. > [...] I'm aware of this (and all the blah blah blah you are explaining). This always the same song. Memory. Let me ask. Is Python an 'american" product for us-users or is it a tool for everybody [*]? Is there any reason why non ascii users are somehow penalized compared to ascii users? This flexible string representation is a regression (ascii users or not). I recognize in practice the real impact is for many users closed to zero (including me) but I have shown (I think) that this flexible representation is, by design, not as optimal as it is supposed to be. This is in my mind the relevant point. [*] This not even true, if we consider the €uro currency symbol used all around the world (banking, accounting applications). jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
(Resending this to the list because I previously sent it only to Steven by mistake. Also showing off a case where top-posting is reasonable, since this bit requires no context. :-) On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly wrote: > > On Aug 17, 2012 10:17 PM, "Steven D'Aprano" > wrote: >> >> Unicode strings are not represented as Latin-1 internally. Latin-1 is a >> byte encoding, not a unicode internal format. Perhaps you mean to say >> that they are represented as a single byte format? > > They are represented as a single-byte format that happens to be equivalent > to Latin-1, because Latin-1 is a proper subset of Unicode; every character > representable in Latin-1 has a byte value equal to its Unicode codepoint. > This talk of whether it's a byte encoding or a 1-byte Unicode representation > is then just semantics. Even the PEP refers to the 1-byte representation as > Latin-1. > >> >> >> I understand the complaint >> >> to be that while the change is great for strings that happen to fit in >> >> Latin-1, it is less efficient than previous versions for strings that >> >> do not. >> > >> > That's not the way I interpreted the PEP 393. It takes a pure unicode >> > string, finds the largest code point in that string, and chooses 1, 2 or >> > 4 bytes for every character, based on how many bits it'd take for that >> > largest code point. >> >> That's how I interpret it too. > > I don't see how this is any different from what I described. Using all 4 > bytes of the code point, you get UCS-4. Truncating to 2 bytes, you get > UCS-2. Truncating to 1 byte, you get Latin-1. -- http://mail.python.org/mailman/listinfo/python-list
How to get initial absolute working dir reliably?
What's the most reliable way for "module code" to determine the absolute path of the working directory at the start of execution? (By "module code" I mean code that lives in a file that is not meant to be run as a script, but rather it is meant to be loaded as the result of some import statement. In other words, "module code" is code that must operate under the assumption that it can be loaded at any time after the start of execution.) Functions like os.path.abspath produce wrong results if the working directory is changed, e.g. through os.chdir, so it is not terribly reliable for determining the initial working directory. Basically, I'm looking for a read-only variable (or variables) initialized by Python at the start of execution, and from which the initial working directory may be read or computed. Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 16:07, [email protected] wrote: Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : [...] The problem with UCS-4 is that every character requires four bytes. [...] I'm aware of this (and all the blah blah blah you are explaining). This always the same song. Memory. Let me ask. Is Python an 'american" product for us-users or is it a tool for everybody [*]? Is there any reason why non ascii users are somehow penalized compared to ascii users? This flexible string representation is a regression (ascii users or not). I recognize in practice the real impact is for many users closed to zero (including me) but I have shown (I think) that this flexible representation is, by design, not as optimal as it is supposed to be. This is in my mind the relevant point. [*] This not even true, if we consider the €uro currency symbol used all around the world (banking, accounting applications). jmf Sorry but you've got me completely baffled. Could you please explain in words of one syllable or less so I can attempt to grasp what the hell you're on about? -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Top-posting &c. (was Re: [ANNC] pybotwar-0.8)
On 2012-08-17, rusi wrote: > I was in a corporate environment for a while. And carried my > 'trim&interleave' habits there. And got gently scolded for seeming to > hide things!! I have, rarely, gotten the opposite raction from "corporate e-mailers" used to top posting. I got one comment something like "That's cool how you interleaved your reponses -- it's like having a real conversation." -- Grant Edwards grant.b.edwardsYow! Somewhere in Tenafly, at New Jersey, a chiropractor gmail.comis viewing "Leave it to Beaver"! -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 1:07 AM, wrote: > I'm aware of this (and all the blah blah blah you are > explaining). This always the same song. Memory. > > Let me ask. Is Python an 'american" product for us-users > or is it a tool for everybody [*]? > Is there any reason why non ascii users are somehow penalized > compared to ascii users? Regardless of your own native language, "len" is the name of a popular Python function. And "dict" is a well-used class. Both those names are representable in ASCII, even if every quoted string in your code requires more bytes to store. And memory usage has significance in many other areas, too. CPU cache utilization turns a space saving into a time saving. That's why structure packing still exists, even though member alignment has other advantages. You'd be amazed how many non-USA strings still fit inside seven bits, too. Are you appending a space to something? Splitting on newlines? You'll have lots of strings that are going now to be space-optimized. Of course, the performance gains from shortening some of the strings may be offset by costs when comparing one-byte and multi-byte strings, but presumably that's all been gone into in great detail elsewhere. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Frank Koshti wrote:
> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
>
> To do this, I decided to use:
>
> re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.
>>> s = """
... $foo1
... $foo2()
... $foo3(anything could go here)
... """
>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s)
['$foo1', '$foo2()', '$foo3(anything could go here)']
--
http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, Aug 18, 2012 at 9:07 AM, wrote: > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : >> [...] >> The problem with UCS-4 is that every character requires four bytes. >> [...] > > I'm aware of this (and all the blah blah blah you are > explaining). This always the same song. Memory. > > Let me ask. Is Python an 'american" product for us-users > or is it a tool for everybody [*]? > Is there any reason why non ascii users are somehow penalized > compared to ascii users? The change does not just benefit ASCII users. It primarily benefits anybody using a wide unicode build with strings mostly containing only BMP characters. Even for narrow build users, there is the benefit that with approximately the same amount of memory usage in most cases, they no longer have to worry about non-BMP characters sneaking in and breaking their code. There is some additional benefit for Latin-1 users, but this has nothing to do with Python. If Python is going to have the option of a 1-byte representation (and as long as we have the flexible representation, I can see no reason not to), then it is going to be Latin-1 by definition, because that's what 1-byte Unicode (UCS-1, if you will) is. If you have an issue with that, take it up with the designers of Unicode. > > This flexible string representation is a regression (ascii users > or not). > > I recognize in practice the real impact is for many users > closed to zero (including me) but I have shown (I think) that > this flexible representation is, by design, not as optimal > as it is supposed to be. This is in my mind the relevant point. You've shown nothing of the sort. You've demonstrated only one out of many possible benchmarks, and other users on this list can't even reproduce that. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
2012/8/18 Frank Koshti :
> Hey Steven,
>
> Thank you for the detailed (and well-written) tutorial on this very
> issue. I actually learned a few things! Though, I still have
> unresolved questions.
>
> The reason I don't want to use an XML parser is because the tokens are
> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as Hello. My specific issue is
> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
>
> To do this, I decided to use:
>
> re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.
>
> Thanks,
> Frank
> --
> http://mail.python.org/mailman/listinfo/python-list
Hi,
Although I don't quite get the pattern you are using (with respect to
the specified task), you most likely need raw string syntax for the
pattern, e.g.: r"...", instead of "...", or you have to double all
backslashes (which should be escaped), i.e. \\w etc.
I am likely misunderstanding the specification, as the following:
>>> re.sub(r"\$foo\(x=3\)", "bar", "Hello")
'Hello'
>>>
is probably not the desired output.
For some kind of "processing" the matched text, you can use the
replace function instead of the replace pattern in re.sub too.
see
http://docs.python.org/library/re.html#re.sub
hth,
vbr
--
http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
On Aug 18, 11:48 am, Peter Otten <[email protected]> wrote: > Frank Koshti wrote: > > I need to match, process and replace $foo(x=3), knowing that (x=3) is > > optional, and the token might appear simply as $foo. > > > To do this, I decided to use: > > > re.compile('\$\w*\(?.*?\)').findall(mystring) > > > the issue with this is it doesn't match $foo by itself, and requires > > there to be () at the end. > >>> s = """ > > ... $foo1 > ... $foo2() > ... $foo3(anything could go here) > ... """>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s) > > ['$foo1', '$foo2()', '$foo3(anything could go here)'] PERFECT- -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Frank Koshti writes:
> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as Hello. My specific issue
> is I need to match, process and replace $foo(x=3), knowing that
> (x=3) is optional, and the token might appear simply as $foo.
>
> To do this, I decided to use:
>
> re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.
Adding a ? after the meant-to-be-optional expression would let the
regex engine know what you want. You can also separate the mandatory
and the optional part in the regex to receive pairs as matches. The
test program below prints this:
>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etchttp://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Steven, Well done!!! Regards, Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Sorry guys, I'm not stupid (I think). I can open IDLE with Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is always slower. Period. Now, the reason. I think it is due the "flexible represention". Deeper reason. The "boss" do not wish to hear from a (pure) ucs-4/utf-32 "engine" (this has been discussed I do not know how many times). jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 2:38 AM, wrote: > Sorry guys, I'm not stupid (I think). I can open IDLE with > Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is > always slower. Period. Ah, but what about all those other operations that use strings under the covers? As mentioned, namespace lookups do, among other things. And how is performance in the (very real) case where a C routine wants to return a value to Python as a string, where the data is currently guaranteed to be ASCII (previously using PyUnicode_FromString, now able to use PyUnicode_FromKindAndData)? Again, I'm sure this has been gone into in great detail before the PEP was accepted (am I negative-bikeshedding here? "atomic reactoring"???), and I'm sure that the gains outweigh the costs. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 17:38, [email protected] wrote: Sorry guys, I'm not stupid (I think). I can open IDLE with Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is always slower. Period. Proof that is acceptable to everybody please, not just yourself. Now, the reason. I think it is due the "flexible represention". Deeper reason. The "boss" do not wish to hear from a (pure) ucs-4/utf-32 "engine" (this has been discussed I do not know how many times). jmf -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Top-posting &c. (was Re: [ANNC] pybotwar-0.8)
On Aug 18, 8:34 pm, Grant Edwards wrote: > On 2012-08-17, rusi wrote: > > > I was in a corporate environment for a while. And carried my > > 'trim&interleave' habits there. And got gently scolded for seeming to > > hide things!! > > I have, rarely, gotten the opposite raction from "corporate e-mailers" > used to top posting. I got one comment something like "That's cool > how you interleaved your reponses -- it's like having a real > conversation." Well sure. If I could civilize people around me, God (or Darwin?) would give me a medal. Usually though, I find it expedient to remember G.B. Shaw's: "The reasonable man adapts himself to the world: the unreasonable one persists to adapt the world to himself. Therefore all progress depends on the unreasonable man." and then decide exactly how (un)reasonable to be in a given context. [No claims to always succeed in these calibrations :-) ] And which brings me back to the question of how best to tell new folk about the netiquette around here. I was once teaching C to a batch of first year students. I was younger then and being more passionate and less reasonable, I made a rule that students should indent their programs correctly. [Nothing like python in sight those days!] A few days later there was a commotion. Students appeared in class with black-badges, complained to the head-of-department and what not. Very perplexed I said: "Why?! I allowed you to indent in any which way you like as long as you have some rules and follow them." I imagined that I had been perfectly reasonable and lenient! Only later did I realize that students did not understand - how to indent - why to indent - what program structure meant So when people top-post it seems reasonable to assume that they dont know better before jumping to conclusions of carelessness, rudeness, inattention etc. For example, my sister recently saw some of my mails and was mystified that I had sent back 'blank mails' until I explained and pointed out that my answers were interleaved into what was originally sent! Clearly she had only ever seen (and therefore expected) top-posted mail-threads. Like your 'corporate-emailer' she found it damn neat, after the initiation. Whether such simple unfamiliarity with culture is the case in the particular case (this thread's discussion) I am not sure. Good to remember Hanlon's razor... -- http://mail.python.org/mailman/listinfo/python-list
Re: Crashes always on Windows 7
On 8/18/2012 2:18 AM, [email protected] wrote: Open using File>Open on the Shell The important question, as I said in my previous post, is *exactly* what you do in the OpenFile dialog. Some things work, others do not. And we (Python) have no control. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: python+libxml2+scrapy AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'
Dmitry Arsentiev, 15.08.2012 14:49: > Has anybody already meet the problem like this? - > AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER' > > When I run scrapy, I get > > File "/usr/local/lib/python2.7/site-packages/scrapy/selector/factories.py", > line 14, in > libxml2.HTML_PARSE_NOERROR + \ > AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER' > > > When I run > python -c 'import libxml2; libxml2.HTML_PARSE_RECOVER' > > I get > Traceback (most recent call last): > File "", line 1, in > AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER' > > How can I cure it? > > Python 2.7 > libxml2-python 2.6.9 > 2.6.11-gentoo-r6 That version of libxml2 is way too old and doesn't support parsing real-world HTML. IIRC, that started with 2.6.21 and got improved a bit after that. Get a 2.8.0 installation, as someone pointed out already. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : >> [...] >> The problem with UCS-4 is that every character requires four bytes. >> [...] > > I'm aware of this (and all the blah blah blah you are explaining). This > always the same song. Memory. Exactly. The reason it is always the same song is because it is an important song. > Let me ask. Is Python an 'american" product for us-users or is it a tool > for everybody [*]? It is a product for everyone, which is exactly why PEP 393 is so important. PEP 393 means that users who have only a few non-BMP characters don't have to pay the cost of UCS-4 for every single string in their application, only for the ones that actually require it. PEP 393 means that using Unicode strings is now cheaper for everybody. You seem to be arguing that the way forward is not to make Unicode cheaper for everyone, but to make ASCII strings more expensive so that everyone suffers equally. I reject that idea. > Is there any reason why non ascii users are somehow penalized compared > to ascii users? Of course there is a reason. If you want to represent 1114111 different characters in a string, as Unicode supports, you can't use a single byte per character, or even two bytes. That is a fact of basic mathematics. Supporting 1114111 characters must be more expensive than supporting 128 of them. But why should you carry the cost of 4-bytes per character just because someday you *might* need a non-BMP character? > This flexible string representation is a regression (ascii users or > not). No it is not. It is a great step forward to more efficient Unicode. And it means that now Python can correctly deal with non-BMP characters without the nonsense of UTF-16 surrogates: steve@runes:~$ python3.3 -c "print(len(chr(1114000)))" # Right! 1 steve@runes:~$ python3.2 -c "print(len(chr(1114000)))" # Wrong! 2 without doubling the storage of every string. This is an important step towards making the full range of Unicode available more widely. > I recognize in practice the real impact is for many users closed to zero Then what's the problem? > (including me) but I have shown (I think) that this flexible > representation is, by design, not as optimal as it is supposed to be. You have not shown any real problem at all. You have shown untrustworthy, edited timing results that don't match what other people are reporting. Even if your timing results are genuine, you haven't shown that they make any difference for real code that does useful work. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: pythonic interface to SAPI5?
Thank you very much,
I have found a DLL which is designed exactly for us and I use it through
ctypes.
Vojta
On 18.8.2012 15:44, Ramchandra Apte wrote:
> A simple workaround is to use:
> speak = subprocess.Popen("espeak",stdin = subprocess.PIPE)
> speak.stdin.write("Hello world!")
> time.sleep(1)
> speak.terminate() #end the speaking
>
>
>
> On 17 August 2012 21:49, Vojtěch Polášek > wrote:
>
> Hi,
> I am developing audiogame for visually impaired users and I want it to
> be multiplatform. I know, that there is library called
> accessible_output
> but it is not working when used in Windows for me.
> I tried pyttsx, which should use Espeak on Linux and SAPI5 on Windows.
> It works on Windows, on Linux I decided to use speech dispatcher
> bindings.
> But it seems that I can't interrupt speech when using pyttsx and
> this is
> showstopper for me.
> Does anyone has any working solution for using SAPI5 on windows?
> Thank you very much,
> Vojta
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
>
>
--
http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit : > > Proof that is acceptable to everybody please, not just yourself. > > I cann't, I'm only facing the fact it works slower on my Windows platform. As I understand (I think) the undelying mechanism, I can only say, it is not a surprise that it happens. Imagine an editor, I type an "a", internally the text is saved as ascii, then I type en "é", the text can only be saved in at least latin-1. Then I enter an "€", the text become an internal ucs-4 "string". The remove the "€" and so on. Intuitively I expect there is some kind slow down between all these "strings" conversion. When I tested this flexible representation, a few months ago, at the first alpha release. This is precisely what, I tested. String manipulations which are forcing this internal change and I concluded the result is not brillant. Realy, a factor 0.n up to 10. This are simply my conclusions. Related question. Does any body know a way to get the size of the internal "string" in bytes? In the narrow or wide build it is easy, I can encode with the "unicode_internal" codec. In Py 3.3, I attempted to toy with sizeof and stuct, but without success. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano writes: > (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters > using two code points. This is fragile and doesn't work very well, > because string-handling methods can break the surrogate pairs apart, > leaving you with invalid unicode string. Not good.) ... > With PEP 393, each Python string will be stored in the most efficient > format possible: Can you explain the issue of "breaking surrogate pairs apart" a little more? Switching between encodings based on the string contents seems silly at first glance. Strings are immutable so I don't understand why not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in Latin-based alphabets and UTF-16 may be more efficient for some other languages. I think even UCS-4 doesn't completely fix the surrogate pair issue if it means the only thing I can think of. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:05, [email protected] wrote: Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit : Proof that is acceptable to everybody please, not just yourself. I cann't, I'm only facing the fact it works slower on my Windows platform. As I understand (I think) the undelying mechanism, I can only say, it is not a surprise that it happens. Imagine an editor, I type an "a", internally the text is saved as ascii, then I type en "é", the text can only be saved in at least latin-1. Then I enter an "€", the text become an internal ucs-4 "string". The remove the "€" and so on. [snip] "a" will be stored as 1 byte/codepoint. Adding "é", it will still be stored as 1 byte/codepoint. Adding "€", it will still be stored as 2 bytes/codepoint. But then you wouldn't be adding them one at a time in Python, you'd be building a list and then joining them together in one operation. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit : > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > > > > > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : > > >> [...] > > >> The problem with UCS-4 is that every character requires four bytes. > > >> [...] > > > > > > I'm aware of this (and all the blah blah blah you are explaining). This > > > always the same song. Memory. > > > > Exactly. The reason it is always the same song is because it is an > > important song. > > No offense here. But this is an *american* answer. The same story as the coding of text files, where "utf-8 == ascii" and the rest of the world doesn't count. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Aug 18, 10:59 pm, Steven D'Aprano wrote: > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > > Is there any reason why non ascii users are somehow penalized compared > > to ascii users? > > Of course there is a reason. > > If you want to represent 1114111 different characters in a string, as > Unicode supports, you can't use a single byte per character, or even two > bytes. That is a fact of basic mathematics. Supporting 1114111 characters > must be more expensive than supporting 128 of them. > > But why should you carry the cost of 4-bytes per character just because > someday you *might* need a non-BMP character? I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 Original above does not open for me but here's a copy that does: http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:26, Paul Rubin wrote: Steven D'Aprano writes: (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters using two code points. This is fragile and doesn't work very well, because string-handling methods can break the surrogate pairs apart, leaving you with invalid unicode string. Not good.) ... With PEP 393, each Python string will be stored in the most efficient format possible: Can you explain the issue of "breaking surrogate pairs apart" a little more? Switching between encodings based on the string contents seems silly at first glance. Strings are immutable so I don't understand why not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in Latin-based alphabets and UTF-16 may be more efficient for some other languages. I think even UCS-4 doesn't completely fix the surrogate pair issue if it means the only thing I can think of. On a narrow build, codepoints outside the BMP are stored as a surrogate pair (2 codepoints). On a wide build, all codepoints can be represented without the need for surrogate pairs. The problem with strings containing surrogate pairs is that you could inadvertently slice the string in the middle of the surrogate pair. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:30, [email protected] wrote: Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit : On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : [...] The problem with UCS-4 is that every character requires four bytes. [...] I'm aware of this (and all the blah blah blah you are explaining). This always the same song. Memory. Exactly. The reason it is always the same song is because it is an important song. No offense here. But this is an *american* answer. The same story as the coding of text files, where "utf-8 == ascii" and the rest of the world doesn't count. jmf Thinking about it I entirely agree with you. Steven D'Aprano strikes me as typically American, in the same way that I'm typically Brazilian :) -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:40, rusi wrote: On Aug 18, 10:59 pm, Steven D'Aprano wrote: On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: Is there any reason why non ascii users are somehow penalized compared to ascii users? Of course there is a reason. If you want to represent 1114111 different characters in a string, as Unicode supports, you can't use a single byte per character, or even two bytes. That is a fact of basic mathematics. Supporting 1114111 characters must be more expensive than supporting 128 of them. But why should you carry the cost of 4-bytes per character just because someday you *might* need a non-BMP character? I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 Original above does not open for me but here's a copy that does: http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html ROFLMAO doesn't adequately some up how much I laughed. -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/18/2012 12:38 PM, [email protected] wrote: Sorry guys, I'm not stupid (I think). I can open IDLE with Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is always slower. Period. You have not tried enough tests ;-). On my Win7-64 system: from timeit import timeit print(timeit(" 'a'*1 ")) 3.3.0b2: .5 3.2.3: .8 print(timeit("c in a", "c = '…'; a = 'a'*1")) 3.3: .05 (independent of len(a)!) 3.2: 5.8 100 times slower! Increase len(a) and the ratio can be made as high as one wants! print(timeit("a.encode()", "a = 'a'*1000")) 3.2: 1.5 3.3: .26 Similar with encoding='utf-8' added to call. Jim, please stop the ranting. It does not help improve Python. utf-32 is not a panacea; it has problems of time, space, and system compatibility (Windows and others). Victor Stinner, whatever he may have once thought and said, put a *lot* of effort into making the new implementation both correct and fast. On your replace example >>> imeit.timeit("('ab…' * 1000).replace('…', '……')") > 61.919225272152346 >>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") > 1.2918679017971044 I do not see the point of changing both length and replacement. For me, the time is about the same for either replacement. I do see about the same slowdown ratio for 3.3 versus 3.2 I also see it for pure search without replacement. print(timeit("c in a", "c = '…'; a = 'a'*1000+c")) # .6 in 3.2.3, 1.2 in 3.3.0 This does not make sense to me and I will ask about it. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
On Aug 18, 12:22 pm, Jussi Piitulainen
wrote:
> Frank Koshti writes:
> > not always placed in HTML, and even in HTML, they may appear in
> > strange places, such as Hello. My specific issue
> > is I need to match, process and replace $foo(x=3), knowing that
> > (x=3) is optional, and the token might appear simply as $foo.
>
> > To do this, I decided to use:
>
> > re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> > the issue with this is it doesn't match $foo by itself, and requires
> > there to be () at the end.
>
> Adding a ? after the meant-to-be-optional expression would let the
> regex engine know what you want. You can also separate the mandatory
> and the optional part in the regex to receive pairs as matches. The
> test program below prints this:
>
> >$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc
> ('$foo', '')
> ('$foo', '(bar=3)')
> ('$foo', '($)')
> ('$foo', '')
> ('$bar', '(v=0)')
>
> Here is the program:
>
> import re
>
> def grab(text):
> p = re.compile(r'([$]\w+)([(][^()]+[)])?')
> return re.findall(p, text)
>
> def test(html):
> print(html)
> for hit in grab(html):
> print(hit)
>
> if __name__ == '__main__':
> test('>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etchttp://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit : > On Aug 18, 10:59 pm, Steven D'Aprano > [email protected]> wrote: > > > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > > > > Is there any reason why non ascii users are somehow penalized compared > > > > to ascii users? > > > > > > Of course there is a reason. > > > > > > If you want to represent 1114111 different characters in a string, as > > > Unicode supports, you can't use a single byte per character, or even two > > > bytes. That is a fact of basic mathematics. Supporting 1114111 characters > > > must be more expensive than supporting 128 of them. > > > > > > But why should you carry the cost of 4-bytes per character just because > > > someday you *might* need a non-BMP character? > > > > I am reminded of: > http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 > > > > Original above does not open for me but here's a copy that does: > > > > http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html I thing it's time to leave the discussion and to go to bed. You can take the problem the way you wish, Python 3.3 is "slower" than Python 3.2. If you see the present status as an optimisation, I'm condidering this as a regression. I'm pretty sure a pure ucs-4/utf-32 can only be, by nature, the correct solution. To be extreme, tools using pure utf-16 or utf-32 are, at least, considering all the citizen on this planet in the same way. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: set and dict iteration
On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote: > On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady wrote: > > > Is there a problem with hacking on the Beta? > > > > Nope. Hack on the beta, then when the release arrives, rebase your > > work onto it. I doubt that anything of this nature will be changed > > between now and then. > > > > ChrisA Thanks Chris, your post was encouraging. I have a question about involving the 'tp_clear' field of the types. http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear ''' ...The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples. ''' I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient. Nevertheless, the earlier diagram contains an overt homogeneous reference cycle. Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in the linked list don't need to be included in the iterators' reference counts. Can someone who is more familiar with the cycle detector and cycle breaker, help prove or disprove the above? -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get initial absolute working dir reliably?
On Sat, Aug 18, 2012 at 11:19 AM, kj wrote:
>
> Basically, I'm looking for a read-only variable (or variables)
> initialized by Python at the start of execution, and from which
> the initial working directory may be read or computed.
>
This will work for Linux and Mac OS X (and maybe Cygwin, but unlikely for
native Windows): try the PWD environment variable.
>>> import os
>>> os.getcwd()
'/Users/swails'
>>> os.getenv('PWD')
'/Users/swails'
>>> os.chdir('..')
>>> os.getcwd()
'/Users'
>>> os.getenv('PWD')
'/Users/swails'
Of course this environment variable can still be messed with, but there
isn't much reason to do so generally (if I'm mistaken here, someone please
correct me).
Hopefully this is of some help,
Jason
--
http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 21:22, [email protected] wrote: Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit : On Aug 18, 10:59 pm, Steven D'Aprano wrote: On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: Is there any reason why non ascii users are somehow penalized compared to ascii users? Of course there is a reason. If you want to represent 1114111 different characters in a string, as Unicode supports, you can't use a single byte per character, or even two bytes. That is a fact of basic mathematics. Supporting 1114111 characters must be more expensive than supporting 128 of them. But why should you carry the cost of 4-bytes per character just because someday you *might* need a non-BMP character? I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 Original above does not open for me but here's a copy that does: http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html I thing it's time to leave the discussion and to go to bed. In plain English, duck out cos I'm losing. You can take the problem the way you wish, Python 3.3 is "slower" than Python 3.2. I'll ask for the second time. Provide proof that is acceptable to everybody and not just yourself. If you see the present status as an optimisation, I'm condidering this as a regression. Considering does not equate to proof. Where are the figures which back up your claim? I'm pretty sure a pure ucs-4/utf-32 can only be, by nature, the correct solution. I look forward to seeing your patch on the bug tracker. If and only if you can find something that needs patching, which from the course of this thread I think is highly unlikely. To be extreme, tools using pure utf-16 or utf-32 are, at least, considering all the citizen on this planet in the same way. jmf -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: set and dict iteration
On 18/08/2012 21:29, Aaron Brady wrote: On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote: On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady wrote: > Is there a problem with hacking on the Beta? Nope. Hack on the beta, then when the release arrives, rebase your work onto it. I doubt that anything of this nature will be changed between now and then. ChrisA Thanks Chris, your post was encouraging. I have a question about involving the 'tp_clear' field of the types. http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear ''' ...The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples. ''' I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient. Nevertheless, the earlier diagram contains an overt homogeneous reference cycle. Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in the linked list don't need to be included in the iterators' reference counts. Can someone who is more familiar with the cycle detector and cycle breaker, help prove or disprove the above? In simple terms, when you create an immutable object it can contain only references to pre-existing objects, but in order to create a cycle you need to make an object refer to another which is created later, so it's not possible to create a cycle out of immutable objects. However, using Python's C API it _is_ possible to create such a cycle, by mutating an otherwise-immutable tuple (see PyTuple_SetItem and PyTuple_SET_ITEM). -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin wrote: > Can you explain the issue of "breaking surrogate pairs apart" a little > more? Switching between encodings based on the string contents seems > silly at first glance. Strings are immutable so I don't understand why > not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in > Latin-based alphabets and UTF-16 may be more efficient for some other > languages. I think even UCS-4 doesn't completely fix the surrogate pair > issue if it means the only thing I can think of. UTF-8 is highly inefficient for indexing. Given a buffer of (say) a few thousand bytes, how do you locate the 273rd character? You have to scan from the beginning. The same applies when surrogate pairs are used to represent single characters, unless the representation leaks and a surrogate is indexed as two - which is where the breaking-apart happens. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: [pyxl] xlrd 0.8.0 released!
My compliments to John and Chris and to any others who contributed to the new xlsx capability. This is a most welcome development. Thank you. Brent -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico writes: > UTF-8 is highly inefficient for indexing. Given a buffer of (say) a > few thousand bytes, how do you locate the 273rd character? How often do you need to do that, as opposed to traversing the string by iteration? Anyway, you could use a rope-like implementation, or an index structure over the string. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 12:11 PM, Paul Rubin wrote: > Chris Angelico writes: >> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a >> few thousand bytes, how do you locate the 273rd character? > > How often do you need to do that, as opposed to traversing the string by > iteration? Anyway, you could use a rope-like implementation, or an > index structure over the string. Well, imagine if Python strings were stored in UTF-8. How would you slice it? >>> "asdfqwer"[4:] 'qwer' That's a not uncommon operation when parsing strings or manipulating data. You'd need to completely rework your algorithms to maintain a position somewhere. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: set and dict iteration
On Saturday, August 18, 2012 5:14:05 PM UTC-5, MRAB wrote: > On 18/08/2012 21:29, Aaron Brady wrote: > > > On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote: > > >> On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady wrote: > > >> > > >> > Is there a problem with hacking on the Beta? > > >> > > >> > > >> > > >> Nope. Hack on the beta, then when the release arrives, rebase your > > >> > > >> work onto it. I doubt that anything of this nature will be changed > > >> > > >> between now and then. > > >> > > >> > > >> > > >> ChrisA > > > > > > Thanks Chris, your post was encouraging. > > > > > > I have a question about involving the 'tp_clear' field of the types. > > > > > > http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear > > > > > > ''' > > > ...The tuple type does not implement a tp_clear function, because it’s > > possible to prove that no reference cycle can be composed entirely of > > tuples. > > > ''' > > > > > > I didn't follow the reasoning in the proof; the premise is necessary but > > IMHO not obviously sufficient. Nevertheless, the earlier diagram contains > > an overt homogeneous reference cycle. > > > > > > Reposting: > > http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png > > > > > > In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't > > need to visit the auxiliary collection; the same fields of the iterators > > don't need to visit the primary set or other iterators; and references in > > the linked list don't need to be included in the iterators' reference > > counts. > > > > > > Can someone who is more familiar with the cycle detector and cycle breaker, > > help prove or disprove the above? > > > > > In simple terms, when you create an immutable object it can contain > > only references to pre-existing objects, but in order to create a cycle > > you need to make an object refer to another which is created later, so > > it's not possible to create a cycle out of immutable objects. > > > > However, using Python's C API it _is_ possible to create such a cycle, > > by mutating an otherwise-immutable tuple (see PyTuple_SetItem and > > PyTuple_SET_ITEM). Are there any precedents for storing uncounted references to PyObject's? One apparent problematic case is creating an iterator to a set, then adding it to the set. However the operation is a modification, and causes the iterator to be removed from the secondary list before the set is examined for collection. Otherwise, the iterator keeps a counted reference to the set, but the set does not keep a counted reference to the iterator, so the iterator will always be freed first. Therefore, the set's secondary list will be empty when the set is freed. Concurrent addition and deletion of iterators should be disabled, and the iterators should remove themselves from the set's secondary list before they decrement their references to the set. Please refresh the earlier diagram; counted references are distinguished separately. Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico writes: "asdfqwer"[4:] > 'qwer' > > That's a not uncommon operation when parsing strings or manipulating > data. You'd need to completely rework your algorithms to maintain a > position somewhere. Scanning 4 characters (or a few dozen, say) to peel off a token in parsing a UTF-8 string is no big deal. It gets more expensive if you want to index far more deeply into the string. I'm asking how often that is done in real code. Obviously one can concoct hypothetical examples that would suffer. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 12:35 PM, Paul Rubin wrote: > Chris Angelico writes: > "asdfqwer"[4:] >> 'qwer' >> >> That's a not uncommon operation when parsing strings or manipulating >> data. You'd need to completely rework your algorithms to maintain a >> position somewhere. > > Scanning 4 characters (or a few dozen, say) to peel off a token in > parsing a UTF-8 string is no big deal. It gets more expensive if you > want to index far more deeply into the string. I'm asking how often > that is done in real code. Obviously one can concoct hypothetical > examples that would suffer. Sure, four characters isn't a big deal to step through. But it still makes indexing and slicing operations O(N) instead of O(1), plus you'd have to zark the whole string up to where you want to work. It'd be workable, but you'd have to redo your algorithms significantly; I don't have a Python example of parsing a huge string, but I've done it in other languages, and when I can depend on indexing being a cheap operation, I'll happily do exactly that. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/18/2012 4:09 PM, Terry Reedy wrote:
print(timeit("c in a", "c = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0
This does not make sense to me and I will ask about it.
I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default of 100 repetitions in a loop, the reported times
are microseconds per operation and thus not practically significant.'
3. 'There is a stringbench.py with a large number of such micro benchmarks.'
I believe there are also whole-application benchmarks that try to mimic
real-world mixtures of operations.
People making improvements must consider performance on multiple systems
and multiple benchmarks. If someone wants to work on search speed, they
cannot just optimize that one operation on one system.
--
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico writes: > Sure, four characters isn't a big deal to step through. But it still > makes indexing and slicing operations O(N) instead of O(1), plus you'd > have to zark the whole string up to where you want to work. I know some systems chop the strings into blocks of (say) a few hundred chars, so you can immediately get to the correct block, then scan into the block to get to the desired char offset. > I don't have a Python example of parsing a huge string, but I've done > it in other languages, and when I can depend on indexing being a cheap > operation, I'll happily do exactly that. I'd be interested to know what the context was, where you parsed a big unicode string in a way that required random access to the nth character in the string. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin wrote: > Chris Angelico writes: >> I don't have a Python example of parsing a huge string, but I've done >> it in other languages, and when I can depend on indexing being a cheap >> operation, I'll happily do exactly that. > > I'd be interested to know what the context was, where you parsed > a big unicode string in a way that required random access to > the nth character in the string. It's something I've done in C/C++ fairly often. Take one big fat buffer, slice it and dice it as you get the information you want out of it. I'll retain and/or calculate indices (when I'm not using pointers, but that's a different kettle of fish). Generally, I'm working with pure ASCII, but port those same algorithms to Python and you'll easily be able to read in a file in some known encoding and manipulate it as Unicode. It's not so much 'random access to the nth character' as an efficient way of jumping forward. For instance, if I know that the next thing is a literal string of n characters (that I don't care about), I want to skip over that and keep parsing. The Adobe Message Format is particularly noteworthy in this, but it's a stupid format and I don't recommend people spend too much time reading up on it (unless you like that sensation of your brain trying to escape through your ear). ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Encapsulation, inheritance and polymorphism
On 7/23/2012 11:18 AM, Albert van der Horst wrote: In article <[email protected]>, Steven D'Aprano wrote: Even with a break, why bother continuing through the body of the function when you already have the result? When your calculation is done, it's done, just return for goodness sake. You wouldn't write a search that keeps going after you've found the value that you want, out of some misplaced sense that you have to look at every value. Why write code with unnecessary guard values and temporary variables out of a misplaced sense that functions must only have one exit? Example from recipee's: Stirr until the egg white is stiff. Alternative: Stirr egg white for half an hour, but if the egg white is stiff keep your spoon still. (Cooking is not my field of expertise, so the wording may not be quite appropriate. ) -- Steven Groetjes Albert Note that you forgot applying enough heat to do the cooking. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico writes: > Generally, I'm working with pure ASCII, but port those same algorithms > to Python and you'll easily be able to read in a file in some known > encoding and manipulate it as Unicode. If it's pure ASCII, you can use the bytes or bytearray type. > It's not so much 'random access to the nth character' as an efficient > way of jumping forward. For instance, if I know that the next thing is > a literal string of n characters (that I don't care about), I want to > skip over that and keep parsing. I don't understand how this is supposed to work. You're going to read a large unicode text file (let's say it's UTF-8) into a single big string? So the runtime library has to scan the encoded contents to find the highest numbered codepoint (let's say it's mostly ascii but has a few characters outside the BMP), expand it all (in this case) to UCS-4 giving 4x memory bloat and requiring decoding all the UTF-8 regardless, and now we should worry about the efficiency of skipping n characters? Since you have to decode the n characters regardless, I'd think this skipping part should only be an issue if you have to do it a lot of times. -- http://mail.python.org/mailman/listinfo/python-list
Re: Encapsulation, inheritance and polymorphism
On Tuesday, July 17, 2012 12:39:53 PM UTC-7, Mark Lawrence wrote: > I would like to spend more time on this thread, but unfortunately the 44 > ton artic carrying "Java in a Nutshell Volume 1 Part 1 Chapter 1 > Paragraph 1 Sentence 1" has just arrived outside my abode and needs > unloading :-) That reminds me of a remark I made nearly 10 years ago: "Well, I followed one friend's advice and investigated Java, perhaps a little too quickly. I purchased Ivor Horton's _Beginning_Java_2_ book. It is reasonably well-written. But how many pages did I have to read before I got through everything I needed to know, in order to read and write files? Four hundred! I need to keep straight detailed information about objects, inheritance, exceptions, buffers, and streams, just to read data from a text file??? I haven't actually sat down to program in Java yet. But at first glance, it would seem to be a step backwards even from the procedural C programming that I was doing a decade ago. I was willing to accept the complexity of the Windows GUI, and program with manuals open on my lap. It is a lot harder for me to accept that I will need to do this in order to process plain old text, perhaps without even any screen output." https://groups.google.com/d/topic/bionet.software/kk-EGGTHN1M/discussion Some things never change! :^) -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with "To recap".
On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:
> Steven D'Aprano writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP
>> characters using two code points. This is fragile and doesn't work very
>> well, because string-handling methods can break the surrogate pairs
>> apart, leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
>
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more? Switching between encodings based on the string contents seems
> silly at first glance.
Forget encodings! We're not talking about encodings. Encodings are used
for converting text as bytes for transmission over the wire or storage on
disk. PEP 393 talks about the internal representation of text within
Python, the C-level data structure.
In 3.2, that data structure depends on a compile-time switch. In a
"narrow build", text is stored using two-bytes per character, so the
string "len" (as in the name of the built-in function) will be stored as
006c 0065 006e
(or possibly 6c00 6500 6e00, depending on whether your system is
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.
Since most identifiers are ASCII, that's already using twice as much
memory as needed. This standard data structure is called UCS-2, and it
only handles characters in the Basic Multilingual Plane, the BMP (roughly
the first 64000 Unicode code points). I'll come back to that.
In a "wide build", text is stored as four-bytes per character, so "len"
is stored as either:
006c 0065 006e
6c00 6500 6e00
Now memory is cheap, but it's not *that* cheap, and no matter how much
memory you have, you can always use more.
This system is called UCS-4, and it can handle the entire Unicode
character set, for now and forever. (If we ever need more that four-bytes
worth of characters, it won't be called Unicode.)
Remember I said that UCS-2 can only handle the 64K characters
[technically: code points] in the Basic Multilingual Plane? There's an
extension to UCS-2 called UTF-16 which extends it to the entire Unicode
range. Yes, that's the same name as the UTF-16 encoding, because it's
more or less the same system.
UTF-16 says "let's represent characters in the BMP by two bytes, but
characters outside the BMP by four bytes." There's a neat trick to this:
the BMP doesn't use the entire two-byte range, so there are some byte
pairs which are illegal in UCS-2 -- they don't correspond to *any*
character. UTF-16 used those byte pairs to signal "this is half a
character, you need to look at the next pair for the rest of the
character".
Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".
Except this comes at a big cost: you can no longer tell how long a string
is by counting the number of bytes, which is fast, because sometimes four
bytes is two characters and sometimes it's one and you can't tell which
it will be until you actually inspect all four bytes.
Copying sub-strings now becomes either slow, or buggy. Say you want to
grab the 10th characters in a string. The fast way using UCS-2 is to
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we
start counting at zero) and you're done. Fast and safe if you're willing
to give up the non-BMP characters.
It's also fast and safe if you use USC-4, but then everything takes twice
as much space, so you probably end up spending so much time copying null
bytes that you're probably slower anyway. Especially when your OS starts
paging memory like mad.
But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8
and 9 are half of a surrogate pair, and you've now split the pair and
ended up with an invalid string. That's what Python 3.2 does, it fails to
handle surrogate pairs properly:
py> s = chr(0x + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'
I've just split a single valid Unicode character into two invalid
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that
my data is now junk.
Since any character can be a surrogate pair, you have to scan every pair
of bytes in order to index a string, or work out it's length, or copy a
substring. It's not enough to just check if the last pair is a surrogate.
When you don't, you have bugs like this from Python 3.2:
py> s = "01234" + chr(0x + 1) + "6789"
py> s[9] == '9'
False
py> s[9], len(s)
('8', 11)
Which is now fixed in Python 3.3.
So variable-width data structures like UTF-8 or UTF-16 are crap for the
internal representation of strings -- they are either fast or correct but
cannot be both.
But UCS-2 is sub-optimal, because it can only handle the BMP, and
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote: >> > I'm aware of this (and all the blah blah blah you are explaining). >> > This always the same song. Memory. >> >> >> >> Exactly. The reason it is always the same song is because it is an >> important song. >> >> > No offense here. But this is an *american* answer. I am not American. I am not aware that computers outside of the USA, and Australia, have unlimited amounts of memory. You must be very lucky. > The same story as the coding of text files, where "utf-8 == ascii" and > the rest of the world doesn't count. UTF-8 is not ASCII. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393: > The change does not just benefit ASCII users. It primarily benefits > anybody using a wide unicode build with strings mostly containing only > BMP characters. Just to be clear: If you have many strings which are *mostly* BMP, but have one or two non- BMP characters in *each* string, you will see no benefit. But if you have many strings which are all BMP, and only a few strings containing non-BMP characters, then you will see a big benefit. > Even for narrow build users, there is the benefit that > with approximately the same amount of memory usage in most cases, they > no longer have to worry about non-BMP characters sneaking in and > breaking their code. Yes! +1000 on that. > There is some additional benefit for Latin-1 users, but this has nothing > to do with Python. If Python is going to have the option of a 1-byte > representation (and as long as we have the flexible representation, I > can see no reason not to), The PEP explicitly states that it only uses a 1-byte format for ASCII strings, not Latin-1: "ASCII-only Unicode strings will again use only one byte per character" and later: "If the maximum character is less than 128, they use the PyASCIIObject structure" and: "The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient)." > then it is going to be Latin-1 by definition, Certainly not, either in fact or in principle. There are a large number of 1-byte encodings, Latin-1 is hardly the only one. > because that's what 1-byte Unicode (UCS-1, if you will) is. If you have > an issue with that, take it up with the designers of Unicode. The designers of Unicode have never created a standard "1-byte Unicode" or UCS-1, as far as I can determine. The Unicode standard refers to some multiple million code points, far too many to fit in a single byte. There is some historical justification for using "Unicode" to mean UCS-2, but with the standard being extended beyond the BMP, that is no longer valid. See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details. I think what you are trying to say is that the Unicode designers deliberately matched the Latin-1 standard for Unicode's first 256 code points. That's not the same thing though: there is no Unicode standard mapping to a single byte format. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:
> As I understand (I think) the undelying mechanism, I can only say, it is
> not a surprise that it happens.
>
> Imagine an editor, I type an "a", internally the text is saved as ascii,
> then I type en "é", the text can only be saved in at least latin-1. Then
> I enter an "€", the text become an internal ucs-4 "string". The remove
> the "€" and so on.
Firstly, that is not what Python does. For starters, € is in the BMP, and
so is nearly every character you're ever going to use unless you are
Asian or a historian using some obscure ancient script. NONE of the
examples you have shown in your emails have included 4-byte characters,
they have all been ASCII or UCS-2.
You are suffering from a misunderstanding about what is going on and
misinterpreting what you have seen.
In *both* Python 3.2 and 3.3, both é and € are represented by two bytes.
That will not change. There is a tiny amount of fixed overhead for
strings, and that overhead is slightly different between the versions,
but you'll never notice the difference.
Secondly, how a text editor or word processor chooses to store the text
that you type is not the same as how Python does it. A text editor is not
going to be creating a new immutable string after every key press. That
will be slow slow SLOW. The usual way is to keep a buffer for each
paragraph, and add and subtract characters from the buffer.
> Intuitively I expect there is some kind slow down between all these
> "strings" conversion.
Your intuition is wrong. Strings are not converted from ASCII to USC-2 to
USC-4 on the fly, they are converted once, when the string is created.
The tests we ran earlier, e.g.:
('ab…' * 1000).replace('…', 'œ…')
show the *worst possible case* for the new string handling, because all
we do is create new strings. First we create a string 'ab…', then we
create another string 'ab…'*1000, then we create two new strings '…' and
'œ…', and finally we call replace and create yet another new string.
But in real applications, once you have created a string, you don't just
immediately create a new one and throw the old one away. You likely do
work with that string:
steve@runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
10 loops, best of 3: 2.41 usec per loop
steve@runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
10 loops, best of 3: 2.29 usec per loop
Once you start doing *real work* with the strings, the overhead of
deciding whether they should be stored using 1, 2 or 4 bytes begins to
fade into the noise.
> When I tested this flexible representation, a few months ago, at the
> first alpha release. This is precisely what, I tested. String
> manipulations which are forcing this internal change and I concluded the
> result is not brillant. Realy, a factor 0.n up to 10.
Like I said, if you really think that there is a significant, repeatable
slow-down on Windows, report it as a bug.
> Does any body know a way to get the size of the internal "string" in
> bytes?
sys.getsizeof(some_string)
steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10030
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10038
As I said, there is a *tiny* overhead difference. But identifiers will
generally be smaller:
steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
(size.__name__))"
48
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
(size.__name__))"
34
You can check the object overhead by looking at the size of the empty
string.
--
Steven
--
http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote: > "a" will be stored as 1 byte/codepoint. > > Adding "é", it will still be stored as 1 byte/codepoint. Wrong. It will be 2 bytes, just like it already is in Python 3.2. I don't know where people are getting this myth that PEP 393 uses Latin-1 internally, it does not. Read the PEP, it explicitly states that 1-byte formats are only used for ASCII strings. > Adding "€", it will still be stored as 2 bytes/codepoint. That is correct. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
