date:20120818

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth

>>> sys.version
'3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
37.32762490493721
timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
0.8158757139801764

>>> sys.version
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 bit 
(Intel)]'
>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
61.919225272152346
>>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
1.2918679017971044

timeit.timeit("('ab…' * 10).replace('…', '€…')")
1.2484133226156757

* I intuitively and empirically noticed, this happens for
cp1252 or mac-roman characters and not characters which are
elements of the latin-1 coding scheme.

* Bad luck, such characters are usual characters in French scripts
(and in some other European language).

* I do not recall the extreme cases I found. Believe me, when
I'm speaking about a few 100%, I do not lie.

My take of the subject.

This is a typical Python desease. Do not solve a problem, but
find a way, a workaround, which is expecting to solve a problem
and which finally solves nothing. As far as I know, to break
the "BMP limit", the tools are here. They are called utf-8 or
ucs-4/utf-32.

One day, I fell on very, very old mail message, dating at the
time of the introduction of the unicode type in Python 2.
If I recall correctly it was from Victor Stinner. He wrote
something like this "Let's go with ucs-4, and the problems
are solved for ever". He was so right.

I'm spying the dev-list since years, my feeling is that
there is always a latent and permanent conflict between
"ascii users" and "non ascii users" (see the unicode
literal reintroduction).

Please, do not get me wrong. As a non-computer scientist,
I'm very happy with Python. If I try to take a distant
eye, I became more and more sceptical.

PS Py3.3b2 is still crashing, silently exiting, with
cp65001.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Dynamically determine base classes on instantiation

2012-08-18 Thread Mark Lawrence


On 18/08/2012 02:44, Steven D'Aprano wrote:


Makes you think that Google is interested in fixing the bugs in their
crappy web apps? They have become as arrogant and as obnoxious as
Microsoft used to be.



Charging off topic again, but I borrowed a book from the local library a 
couple of months back about Google Apps as it looked interesting.  I 
returned it in disgust rather rapidly as it was basically a "let's bash 
Microsoft" tome.


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Mark Lawrence


On 18/08/2012 06:42, Chris Angelico wrote:

On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti  wrote:

Hi,

I'm new to regular expressions. I want to be able to match for tokens
with all their properties in the following examples. I would
appreciate some direction on how to proceed.


@foo1
@foo2()
@foo3(anything could go here)


You can find regular expression primers all over the internet - fire
up your favorite search engine and type those three words in. But it
may be that what you want here is a more flexible parser; have you
looked at BeautifulSoup (so rich and green)?

ChrisA



Totally agree with the sentiment.  There's a comparison of python 
parsers here http://nedbatchelder.com/text/python-parsers.html


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Steven D'Aprano

On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

 sys.version
> '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
 timeit.timeit("('ab…' * 1000).replace('…', '……')")
> 37.32762490493721
> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
> 
 sys.version
> '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32
> bit (Intel)]'
 imeit.timeit("('ab…' * 1000).replace('…', '……')")
> 61.919225272152346

"imeit"?

It is hard to take your results seriously when you have so obviously 
edited your timing results, not just copied and pasted them.

Here are my results, on my laptop running Debian Linux. First, testing on 
Python 3.2:

steve@runes:~$ python3.2 -m timeit "('abc' * 1000).replace('c', 'de')"
1 loops, best of 3: 50.2 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '……')"
1 loops, best of 3: 45.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'x…')"
1 loops, best of 3: 51.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
1 loops, best of 3: 47.6 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '€…')"
1 loops, best of 3: 45.9 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
1 loops, best of 3: 57.5 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
1 loops, best of 3: 49.7 usec per loop

As you can see, the timing results are all consistently around 50 
microseconds per loop, regardless of which characters I use, whether they 
are in Latin-1 or not. The differences between one test and another are 
not meaningful.

Now I do them again using Python 3.3:

steve@runes:~$ python3.3 -m timeit "('abc' * 1000).replace('c', 'de')"
1 loops, best of 3: 64.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '……')"
1 loops, best of 3: 67.8 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'x…')"
1 loops, best of 3: 66 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
1 loops, best of 3: 67.6 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '€…')"
1 loops, best of 3: 68.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
1 loops, best of 3: 67.9 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
1 loops, best of 3: 66.9 usec per loop

The results are all consistently around 67 microseconds. So Python's  
string handling is about 30% slower in the examples show here.

If you can consistently replicate a 100% to 1000% slowdown in string 
handling, please report it as a performance bug:

http://bugs.python.org/

Don't forget to report your operating system.

> My take of the subject.
> 
> This is a typical Python desease. Do not solve a problem, but find a
> way, a workaround, which is expecting to solve a problem and which
> finally solves nothing. As far as I know, to break the "BMP limit", the
> tools are here. They are called utf-8 or ucs-4/utf-32.

The problem with UCS-4 is that every character requires four bytes. 
Every. Single. One.

So under UCS-4, the pure-ascii string "hello world" takes 44 bytes plus 
the object overhead. Under UCS-2, it takes half that space: 22 bytes, but 
of course UCS-2 can only represent characters in the BMP. A pure ASCII 
string would only take 11 bytes, but we're not going back to pure ASCII.

(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters 
using two code points. This is fragile and doesn't work very well, 
because string-handling methods can break the surrogate pairs apart, 
leaving you with invalid unicode string. Not good.)

The difference between 44 bytes and 22 bytes for one little string is not 
very important, but when you double the memory required for every single 
string it becomes huge. Remember that every class, function and method 
has a name, which is a string; every attribute and variable has a name, 
all strings; functions and classes have doc strings, all strings. Strings 
are used everywhere in Python, and doubling the memory needed by Python 
means that it will perform worse.

With PEP 393, each Python string will be stored in the most efficient 
format possible:

- if it only contains ASCII characters, it will be stored using 1 byte 
per character;

- if it only contains characters in the BMP, it will be stored using 
UCS-2 (2 bytes per character);

- if it contains non-BMP characters, the string will be stored using 
UCS-4 (4 bytes per character).

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Roy Smith

In article 
<[email protected]>,
 Frank Koshti  wrote:

> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would
> appreciate some direction on how to proceed.
> 
> 
> @foo1
> @foo2()
> @foo3(anything could go here)

Don't try to parse HTML with regexes.  Use a real HTML parser, such as 
lxml (http://lxml.de/).
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Top-posting &c. (was Re: [ANNC] pybotwar-0.8)

2012-08-18 Thread Ramchandra Apte

I am aware of this. I'm just to lazy to use Google Groups! "Come on
Ramchandra, you can switch to Google Groups."

On 17 August 2012 13:09, rusi  wrote:

> On Aug 17, 3:36 am, Chris Angelico  wrote:
> > On Fri, Aug 17, 2012 at 1:40 AM, Ramchandra Apte 
> wrote:
> > > On 16 August 2012 21:00, Mark Lawrence 
> wrote:
> > >> and "bottom" reads better than "top"
> >
> > > Look you are the only person complaining about top-posting.
> > > GMail uses top-posting by default.
> > > I can't help it if you feel irritated by it.
> >
> > I post using gmail,
>
> If you register on the mailing list as well as google groups, you can
> then use googlegroups.
> Thereafter appropriately cutting out the unnecessary stuff is easy
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: ONLINE SERVER TO STORE AND RUN PYTHON SCRIPTS

2012-08-18 Thread Ramchandra Apte

Please don't use all caps.

On 17 August 2012 18:16, coldfire  wrote:

> I would like to know that where can a python script be stored on-line from
> were it keep running and can be called any time when required using
> internet.
> I have used mechanize module which creates a webbroswer instance to open a
> website and extract data and email me.
> I have tried Python anywhere but they dont support opening of anonymous
> websites.
> What s the current what to DO this?
> Can someone point me in the write direction.
> My script have no interaction with User It just Got on-line searches for
> something and emails me.
>
> Thanks
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: pythonic interface to SAPI5?

2012-08-18 Thread Ramchandra Apte

A simple workaround is to use:
speak = subprocess.Popen("espeak",stdin = subprocess.PIPE)
speak.stdin.write("Hello world!")
time.sleep(1)
speak.terminate() #end the speaking


On 17 August 2012 21:49, Vojtěch Polášek  wrote:

> Hi,
> I am developing audiogame for visually impaired users and I want it to
> be multiplatform. I know, that there is library called accessible_output
> but it is not working when used in Windows for me.
> I tried pyttsx, which should use Espeak on Linux and SAPI5 on Windows.
> It works on Windows, on Linux I decided to use speech dispatcher bindings.
> But it seems that I can't interrupt speech when using pyttsx and this is
> showstopper for me.
> Does anyone has any working solution for using SAPI5 on windows?
> Thank you very much,
> Vojta
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: remote read eval print loop

2012-08-18 Thread Ramchandra Apte

Not really. Try modifying ast.literal_eval. This will be quite secure.

On 17 August 2012 19:36, Chris Angelico  wrote:

> On Fri, Aug 17, 2012 at 11:28 PM, Eric Frederich
>  wrote:
> > Within the debugging console, after importing all of the bindings, there
> > would be no reason to import anything whatsoever.
> > With just the bindings I created and the Python language we could do
> > meaningful debugging.
> > So if I block the ability to do any imports and calls to eval I should be
> > safe right?
>
> Nope. Python isn't a secured language in that way. I tried the same
> sort of thing a while back, but found it effectively impossible. (And
> this after people told me "It's not possible, don't bother trying". I
> tried anyway. It wasn't possible.)
>
> If you really want to do that, consider it equivalent to putting an
> open SSH session into your debugging console. Would you give that much
> power to your application's users? And if you would, is it worth
> reinventing SSH?
>
> ChrisA
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: [ANNC] pybotwar-0.8

2012-08-18 Thread Ramchandra Apte

On 17 August 2012 18:23, Hans Mulder  wrote:

> On 16/08/12 23:34:25, Walter Hurry wrote:
> > On Thu, 16 Aug 2012 17:20:29 -0400, Terry Reedy wrote:
> >
> >> On 8/16/2012 11:40 AM, Ramchandra Apte wrote:
> >>
> >>> Look you are the only person complaining about top-posting.
> >>
> >> No he is not. Recheck all the the responses.
> >>
> >>> GMail uses top-posting by default.
> >>
> >> It only works if everyone does it.
> >>
> >>> I can't help it if you feel irritated by it.
> >>
> >> Your out-of-context comments are harder to understand. I mostly do not
> >> read them.
> >
> > It's strange, but I don't even *see* his contributions (I am using a
> > regular newsreader - on comp.lang.python - and I don't have him in the
> > bozo bin). It doesn't sound as though I'm missing much.
>
> I don't see him either.  That is to say: my ISP doesn't have his
> posts in comp.lang.python,  The group gmane.comp.python.general
> on Gname has them. so if you're really curious, you can point
> your NNTP client at news.gmane.org.
>
> > But I'm just curious. Any idea why that would be the case?
>
> Maybe there's some kind of filer in the mail->usenet gateway?
>
> HTH,
>
> -- HansM
>
>
>  Let's not go overkill.
I'll be using Google Groups (hopefully it won't top-post by default) to
post stuff.

>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Steven D'Aprano

On Fri, 17 Aug 2012 21:41:07 -0700, Frank Koshti wrote:

> Hi,
> 
> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would appreciate
> some direction on how to proceed.

Others have already given you excellent advice to NOT use regular 
expressions to parse HTML files, but to use a proper HTML parser instead.

However, since I remember how hard it was to get started with regexes, 
I'm going to ignore that advice and show you how to abuse regexes to 
search for text, and pretend that they aren't HTML tags.

Here's your string you want to search for:

> @foo1

You want to find a piece of text that starts with "@", followed by 
any alphanumeric characters, followed by "".

We start by compiling a regex:

import re
pattern = r"@\w+"
regex = re.compile(pattern, re.I)

First we import the re module. Then we define a pattern string. Note that 
I use a "raw string" instead of a regular string -- this is not 
compulsory, but it is very common.

The difference between a raw string and a regular string is how they 
handle backslashes. In Python, some (but not all!) backslashes are 
special. For example, the regular string "\n" is not two characters, 
backslash-n, but a single character, Newline. The Python string parser 
converts backslash combinations as special characters, e.g.:

\n => newline
\t => tab
\0 => ASCII Null character
\\ => a single backslash
etc.

We often call these "backslash escapes".

Regular expressions use a lot of backslashes, and so it is useful to 
disable the interpretation of backlash escapes when writing regex 
patterns. We do that with a "raw string" -- if you prefix the string with 
the letter r, the string is raw and backslash-escapes are ignored:

# ordinary "cooked" string:
"abc\n" => a b c newline

# raw string
r"abc\n" => a b c backslash n

Here is our pattern again:

pattern = r"@\w+"

which is thirteen characters:

less-than h 1 greater-than at-sign backslash w plus-sign less-than slash 
h 1 greater-than

Most of the characters shown just match themselves. For example, the @ 
sign will only match another @ sign. But some have special meaning to the 
regex:

\w doesn't match "backslash w", but any alphanumeric character;

+ doesn't match a plus sign, but tells the regex to match the previous 
symbol one or more times. Since it immediately follows \w, this means 
"match at least one alphanumeric character".

Now we feed that string into the re.compile, to create a pre-compiled 
regex. (This step is optional: any function which takes a compiled regex 
will also accept a string pattern. But pre-compiling regexes which you 
are going to use repeatedly is a good idea.)

regex = re.compile(pattern, re.I)

The second argument to re.compile is a flag, re.I which is a special 
value that tells the regular expression to ignore case, so "h" will match 
both "h" and "H".

Now on to use the regex. Here's a bunch of text to search:

text = """Now is the time for all good men blah blah blah spam
and more text here blah blah blah
and some more @victory blah blah blah"""

And we search it this way:

mo = re.search(regex, text)

"mo" stands for "Match Object", which is returned if the regular 
expression finds something that matches your pattern. If nothing matches, 
then None is returned instead.

if mo is not None:
print(mo.group(0))

=> prints @victory

So far so good. But we can do better. In this case, we don't really care 
about the tags , we only care about the "victory" part. Here's how to 
use grouping to extract substrings from the regex:

pattern = r"@(\w+)"  # notice the round brackets ()
regex = re.compile(pattern, re.I)
mo = re.search(regex, text)
if mo is not None:
print(mo.group(0))
print(mo.group(1))

This prints:

@victory
victory

Hope this helps.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

I think the point was missed. I don't want to use an XML parser. The
point is to pick up those tokens, and yes I've done my share of RTFM.
This is what I've come up with:

'\$\w*\(?.*?\)'

Which doesn't work well on the above example, which is partly why I
reached out to the group. Can anyone help me with the regex?

Thanks,
Frank
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

Hey Steven,

Thank you for the detailed (and well-written) tutorial on this very
issue. I actually learned a few things! Though, I still have
unresolved questions.

The reason I don't want to use an XML parser is because the tokens are
not always placed in HTML, and even in HTML, they may appear in
strange places, such as Hello. My specific issue is
I need to match, process and replace $foo(x=3), knowing that (x=3) is
optional, and the token might appear simply as $foo.

To do this, I decided to use:

re.compile('\$\w*\(?.*?\)').findall(mystring)

the issue with this is it doesn't match $foo by itself, and requires
there to be () at the end.

Thanks,
Frank
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth

Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
> [...]
> The problem with UCS-4 is that every character requires four bytes. 
> [...]

I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american" product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

This flexible string representation is a regression (ascii users
or not).

I recognize in practice the real impact is for many users
closed to zero (including me) but I have shown (I think) that
this flexible representation is, by design, not as optimal
as it is supposed to be. This is in my mind the relevant point.

[*] This not even true, if we consider the €uro currency
symbol used all around the world (banking, accounting
applications).

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Ian Kelly

(Resending this to the list because I previously sent it only to
Steven by mistake.  Also showing off a case where top-posting is
reasonable, since this bit requires no context. :-)

On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly  wrote:
>
> On Aug 17, 2012 10:17 PM, "Steven D'Aprano"
>  wrote:
>>
>> Unicode strings are not represented as Latin-1 internally. Latin-1 is a
>> byte encoding, not a unicode internal format. Perhaps you mean to say
>> that they are represented as a single byte format?
>
> They are represented as a single-byte format that happens to be equivalent
> to Latin-1, because Latin-1 is a proper subset of Unicode; every character
> representable in Latin-1 has a byte value equal to its Unicode codepoint.
> This talk of whether it's a byte encoding or a 1-byte Unicode representation
> is then just semantics. Even the PEP refers to the 1-byte representation as
> Latin-1.
>
>>
>> >> I understand the complaint
>> >> to be that while the change is great for strings that happen to fit in
>> >> Latin-1, it is less efficient than previous versions for strings that
>> >> do not.
>> >
>> > That's not the way I interpreted the PEP 393.  It takes a pure unicode
>> > string, finds the largest code point in that string, and chooses 1, 2 or
>> > 4 bytes for every character, based on how many bits it'd take for that
>> > largest code point.
>>
>> That's how I interpret it too.
>
> I don't see how this is any different from what I described. Using all 4
> bytes of the code point, you get UCS-4. Truncating to 2 bytes, you get
> UCS-2. Truncating to 1 byte, you get Latin-1.
-- 
http://mail.python.org/mailman/listinfo/python-list

How to get initial absolute working dir reliably?

2012-08-18 Thread kj



What's the most reliable way for "module code" to determine the
absolute path of the working directory at the start of execution?

(By "module code" I mean code that lives in a file that is not
meant to be run as a script, but rather it is meant to be loaded
as the result of some import statement.  In other words, "module
code" is code that must operate under the assumption that it can
be loaded at any time after the start of execution.)

Functions like os.path.abspath produce wrong results if the working
directory is changed, e.g. through os.chdir, so it is not terribly
reliable for determining the initial working directory.

Basically, I'm looking for a read-only variable (or variables)
initialized by Python at the start of execution, and from which
the initial working directory may be read or computed.


Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence


On 18/08/2012 16:07, [email protected] wrote:

Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :

[...]
The problem with UCS-4 is that every character requires four bytes.
[...]


I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american" product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

This flexible string representation is a regression (ascii users
or not).

I recognize in practice the real impact is for many users
closed to zero (including me) but I have shown (I think) that
this flexible representation is, by design, not as optimal
as it is supposed to be. This is in my mind the relevant point.

[*] This not even true, if we consider the €uro currency
symbol used all around the world (banking, accounting
applications).

jmf



Sorry but you've got me completely baffled.  Could you please explain in 
words of one syllable or less so I can attempt to grasp what the hell 
you're on about?


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Top-posting &c. (was Re: [ANNC] pybotwar-0.8)

2012-08-18 Thread Grant Edwards

On 2012-08-17, rusi  wrote:

> I was in a corporate environment for a while.  And carried my
> 'trim&interleave' habits there. And got gently scolded for seeming to
> hide things!!

I have, rarely, gotten the opposite raction from "corporate e-mailers"
used to top posting.  I got one comment something like "That's cool
how you interleaved your reponses -- it's like having a real
conversation."

-- 
Grant Edwards   grant.b.edwardsYow! Somewhere in Tenafly,
  at   New Jersey, a chiropractor
  gmail.comis viewing "Leave it to
   Beaver"!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico

On Sun, Aug 19, 2012 at 1:07 AM,   wrote:
> I'm aware of this (and all the blah blah blah you are
> explaining). This always the same song. Memory.
>
> Let me ask. Is Python an 'american" product for us-users
> or is it a tool for everybody [*]?
> Is there any reason why non ascii users are somehow penalized
> compared to ascii users?

Regardless of your own native language, "len" is the name of a popular
Python function. And "dict" is a well-used class. Both those names are
representable in ASCII, even if every quoted string in your code
requires more bytes to store.

And memory usage has significance in many other areas, too. CPU cache
utilization turns a space saving into a time saving. That's why
structure packing still exists, even though member alignment has other
advantages.

You'd be amazed how many non-USA strings still fit inside seven bits,
too. Are you appending a space to something? Splitting on newlines?
You'll have lots of strings that are going now to be space-optimized.
Of course, the performance gains from shortening some of the strings
may be offset by costs when comparing one-byte and multi-byte strings,
but presumably that's all been gone into in great detail elsewhere.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Peter Otten

Frank Koshti wrote:

> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
> 
> To do this, I decided to use:
> 
> re.compile('\$\w*\(?.*?\)').findall(mystring)
> 
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.

>>> s = """
... $foo1
... $foo2()
... $foo3(anything could go here)
... """
>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s)
['$foo1', '$foo2()', '$foo3(anything could go here)']


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Ian Kelly

On Sat, Aug 18, 2012 at 9:07 AM,   wrote:
> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
>
> I'm aware of this (and all the blah blah blah you are
> explaining). This always the same song. Memory.
>
> Let me ask. Is Python an 'american" product for us-users
> or is it a tool for everybody [*]?
> Is there any reason why non ascii users are somehow penalized
> compared to ascii users?

The change does not just benefit ASCII users.  It primarily benefits
anybody using a wide unicode build with strings mostly containing only
BMP characters.  Even for narrow build users, there is the benefit
that with approximately the same amount of memory usage in most cases,
they no longer have to worry about non-BMP characters sneaking in and
breaking their code.

There is some additional benefit for Latin-1 users, but this has
nothing to do with Python.  If Python is going to have the option of a
1-byte representation (and as long as we have the flexible
representation, I can see no reason not to), then it is going to be
Latin-1 by definition, because that's what 1-byte Unicode (UCS-1, if
you will) is.  If you have an issue with that, take it up with the
designers of Unicode.

>
> This flexible string representation is a regression (ascii users
> or not).
>
> I recognize in practice the real impact is for many users
> closed to zero (including me) but I have shown (I think) that
> this flexible representation is, by design, not as optimal
> as it is supposed to be. This is in my mind the relevant point.

You've shown nothing of the sort.  You've demonstrated only one out of
many possible benchmarks, and other users on this list can't even
reproduce that.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Vlastimil Brom

2012/8/18 Frank Koshti :
> Hey Steven,
>
> Thank you for the detailed (and well-written) tutorial on this very
> issue. I actually learned a few things! Though, I still have
> unresolved questions.
>
> The reason I don't want to use an XML parser is because the tokens are
> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as Hello. My specific issue is
> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
>
> To do this, I decided to use:
>
> re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.
>
> Thanks,
> Frank
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
Although I don't quite get the pattern you are using (with respect to
the specified task), you most likely need raw string syntax for the
pattern, e.g.: r"...", instead of "...", or you have to double all
backslashes (which should be escaped), i.e. \\w etc.

I am likely misunderstanding the specification, as the following:
>>> re.sub(r"\$foo\(x=3\)", "bar", "Hello")
'Hello'
>>>
is probably not the desired output.

For some kind of "processing" the matched text, you can use the
replace function instead of the replace pattern in re.sub too.
see
http://docs.python.org/library/re.html#re.sub

hth,
  vbr
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

On Aug 18, 11:48 am, Peter Otten <[email protected]> wrote:
> Frank Koshti wrote:
> > I need to match, process and replace $foo(x=3), knowing that (x=3) is
> > optional, and the token might appear simply as $foo.
>
> > To do this, I decided to use:
>
> > re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> > the issue with this is it doesn't match $foo by itself, and requires
> > there to be () at the end.
> >>> s = """
>
> ... $foo1
> ... $foo2()
> ... $foo3(anything could go here)
> ... """>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s)
>
> ['$foo1', '$foo2()', '$foo3(anything could go here)']

PERFECT-
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Jussi Piitulainen

Frank Koshti writes:

> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as Hello. My specific issue
> is I need to match, process and replace $foo(x=3), knowing that
> (x=3) is optional, and the token might appear simply as $foo.
> 
> To do this, I decided to use:
> 
> re.compile('\$\w*\(?.*?\)').findall(mystring)
> 
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.

Adding a ? after the meant-to-be-optional expression would let the
regex engine know what you want. You can also separate the mandatory
and the optional part in the regex to receive pairs as matches. The
test program below prints this:

>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etchttp://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread python

Steven,

Well done!!!

Regards,
Malcolm
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Now, the reason. I think it is due the "flexible represention".

Deeper reason. The "boss" do not wish to hear from a (pure)
ucs-4/utf-32 "engine" (this has been discussed I do not know
how many times).

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico

On Sun, Aug 19, 2012 at 2:38 AM,   wrote:
> Sorry guys, I'm not stupid (I think). I can open IDLE with
> Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
> always slower. Period.

Ah, but what about all those other operations that use strings under
the covers? As mentioned, namespace lookups do, among other things.
And how is performance in the (very real) case where a C routine wants
to return a value to Python as a string, where the data is currently
guaranteed to be ASCII (previously using PyUnicode_FromString, now
able to use PyUnicode_FromKindAndData)? Again, I'm sure this has been
gone into in great detail before the PEP was accepted (am I
negative-bikeshedding here? "atomic reactoring"???), and I'm sure that
the gains outweigh the costs.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence


On 18/08/2012 17:38, [email protected] wrote:

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.


Proof that is acceptable to everybody please, not just yourself.



Now, the reason. I think it is due the "flexible represention".

Deeper reason. The "boss" do not wish to hear from a (pure)
ucs-4/utf-32 "engine" (this has been discussed I do not know
how many times).

jmf



--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Top-posting &c. (was Re: [ANNC] pybotwar-0.8)

2012-08-18 Thread rusi

On Aug 18, 8:34 pm, Grant Edwards  wrote:
> On 2012-08-17, rusi  wrote:
>
> > I was in a corporate environment for a while.  And carried my
> > 'trim&interleave' habits there. And got gently scolded for seeming to
> > hide things!!
>
> I have, rarely, gotten the opposite raction from "corporate e-mailers"
> used to top posting.  I got one comment something like "That's cool
> how you interleaved your reponses -- it's like having a real
> conversation."

Well sure. If I could civilize people around me, God (or Darwin?)
would give me a medal.
Usually though, I find it expedient to remember G.B. Shaw's:

"The reasonable man adapts himself to the world: the unreasonable one
persists to adapt the world to himself. Therefore all progress depends
on the unreasonable man."

and then decide exactly how (un)reasonable to be in a given context.

[No claims to always succeed in these calibrations :-) ]
And which brings me back to the question of how best to tell new folk
about the netiquette around here.

I was once teaching C to a batch of first year students.  I was
younger then and being more passionate and less reasonable, I made a
rule that students should indent their programs correctly.
[Nothing like python in sight those days!]

A few days later there was a commotion.  Students appeared in class
with black-badges, complained to the head-of-department and what not.
Very perplexed I said: "Why?! I allowed you to indent in any which way
you like as long as you have some rules and follow them." I imagined
that I had been perfectly reasonable and lenient!

Only later did I realize that students did not understand
- how to indent
- why to indent
- what program structure meant

So when people top-post it seems reasonable to assume that they dont
know better before jumping to conclusions of carelessness, rudeness,
inattention etc.

For example, my sister recently saw some of my mails and was mystified
that I had sent back 'blank mails' until I explained and pointed out
that my answers were interleaved into what was originally sent!
Clearly she had only ever seen (and therefore expected) top-posted
mail-threads.  Like your 'corporate-emailer' she found it damn neat,
after the initiation.

Whether such simple unfamiliarity with culture is the case in the
particular case (this thread's discussion) I am not sure.  Good to
remember Hanlon's razor...
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Crashes always on Windows 7

2012-08-18 Thread Terry Reedy


On 8/18/2012 2:18 AM, [email protected] wrote:


Open using File>Open on the Shell


The important question, as I said in my previous post, is *exactly* what 
you do in the OpenFile dialog. Some things work, others do not.

And we (Python) have no control.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: python+libxml2+scrapy AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'

2012-08-18 Thread Stefan Behnel

Dmitry Arsentiev, 15.08.2012 14:49:
> Has anybody already meet the problem like this? -
> AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'
> 
> When I run scrapy, I get
> 
>   File "/usr/local/lib/python2.7/site-packages/scrapy/selector/factories.py",
> line 14, in 
> libxml2.HTML_PARSE_NOERROR + \
> AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'
> 
> 
> When I run
>  python -c 'import libxml2; libxml2.HTML_PARSE_RECOVER'
> 
> I get
> Traceback (most recent call last):
>   File "", line 1, in 
> AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'
> 
> How can I cure it?
> 
> Python 2.7
> libxml2-python 2.6.9
> 2.6.11-gentoo-r6

That version of libxml2 is way too old and doesn't support parsing
real-world HTML. IIRC, that started with 2.6.21 and got improved a bit
after that.

Get a 2.8.0 installation, as someone pointed out already.

Stefan


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Steven D'Aprano

On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:

> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
> 
> I'm aware of this (and all the blah blah blah you are explaining). This
> always the same song. Memory.

Exactly. The reason it is always the same song is because it is an 
important song.

> Let me ask. Is Python an 'american" product for us-users or is it a tool
> for everybody [*]?

It is a product for everyone, which is exactly why PEP 393 is so 
important. PEP 393 means that users who have only a few non-BMP 
characters don't have to pay the cost of UCS-4 for every single string in 
their application, only for the ones that actually require it. PEP 393 
means that using Unicode strings is now cheaper for everybody.

You seem to be arguing that the way forward is not to make Unicode 
cheaper for everyone, but to make ASCII strings more expensive so that 
everyone suffers equally. I reject that idea.

> Is there any reason why non ascii users are somehow penalized compared
> to ascii users?

Of course there is a reason.

If you want to represent 1114111 different characters in a string, as 
Unicode supports, you can't use a single byte per character, or even two 
bytes. That is a fact of basic mathematics. Supporting 1114111 characters 
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because 
someday you *might* need a non-BMP character?

> This flexible string representation is a regression (ascii users or
> not).

No it is not. It is a great step forward to more efficient Unicode.

And it means that now Python can correctly deal with non-BMP characters 
without the nonsense of UTF-16 surrogates:

steve@runes:~$ python3.3 -c "print(len(chr(1114000)))"  # Right!
1
steve@runes:~$ python3.2 -c "print(len(chr(1114000)))"  # Wrong!
2

without doubling the storage of every string.

This is an important step towards making the full range of Unicode 
available more widely.

> I recognize in practice the real impact is for many users closed to zero

Then what's the problem?

> (including me) but I have shown (I think) that this flexible
> representation is, by design, not as optimal as it is supposed to be.

You have not shown any real problem at all. 

You have shown untrustworthy, edited timing results that don't match what 
other people are reporting.

Even if your timing results are genuine, you haven't shown that they make 
any difference for real code that does useful work.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: pythonic interface to SAPI5?

2012-08-18 Thread Vojtěch Polášek

Thank you very much,
I have found a DLL which is designed exactly for us and I use it through
ctypes.
Vojta
On 18.8.2012 15:44, Ramchandra Apte wrote:
> A simple workaround is to use:
> speak = subprocess.Popen("espeak",stdin = subprocess.PIPE)
> speak.stdin.write("Hello world!")
> time.sleep(1)
> speak.terminate() #end the speaking
>
>
>
> On 17 August 2012 21:49, Vojtěch Polášek  > wrote:
>
> Hi,
> I am developing audiogame for visually impaired users and I want it to
> be multiplatform. I know, that there is library called
> accessible_output
> but it is not working when used in Windows for me.
> I tried pyttsx, which should use Espeak on Linux and SAPI5 on Windows.
> It works on Windows, on Linux I decided to use speech dispatcher
> bindings.
> But it seems that I can't interrupt speech when using pyttsx and
> this is
> showstopper for me.
> Does anyone has any working solution for using SAPI5 on windows?
> Thank you very much,
> Vojta
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
>
>

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
> 
> Proof that is acceptable to everybody please, not just yourself.
> 
> 
I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

Intuitively I expect there is some kind slow down between
all these "strings" conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3, 
I attempted to toy with sizeof and stuct, but without
success.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Paul Rubin

Steven D'Aprano  writes:
> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters 
> using two code points. This is fragile and doesn't work very well, 
> because string-handling methods can break the surrogate pairs apart, 
> leaving you with invalid unicode string. Not good.)
...
> With PEP 393, each Python string will be stored in the most efficient 
> format possible:

Can you explain the issue of "breaking surrogate pairs apart" a little
more?  Switching between encodings based on the string contents seems
silly at first glance.  Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages.  I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread MRAB


On 18/08/2012 19:05, [email protected] wrote:

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :


Proof that is acceptable to everybody please, not just yourself.



I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.


[snip]

"a" will be stored as 1 byte/codepoint.

Adding "é", it will still be stored as 1 byte/codepoint.

Adding "€", it will still be stored as 2 bytes/codepoint.

But then you wouldn't be adding them one at a time in Python, you'd be
building a list and then joining them together in one operation.
--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth

Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> 
> 
> 
> > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
> 
> >> [...]
> 
> >> The problem with UCS-4 is that every character requires four bytes.
> 
> >> [...]
> 
> > 
> 
> > I'm aware of this (and all the blah blah blah you are explaining). This
> 
> > always the same song. Memory.
> 
> 
> 
> Exactly. The reason it is always the same song is because it is an 
> 
> important song.
> 
> 
No offense here. But this is an *american* answer.

The same story as the coding of text files, where "utf-8 == ascii"
and the rest of the world doesn't count.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread rusi

On Aug 18, 10:59 pm, Steven D'Aprano  wrote:
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> > Is there any reason why non ascii users are somehow penalized compared
> > to ascii users?
>
> Of course there is a reason.
>
> If you want to represent 1114111 different characters in a string, as
> Unicode supports, you can't use a single byte per character, or even two
> bytes. That is a fact of basic mathematics. Supporting 1114111 characters
> must be more expensive than supporting 128 of them.
>
> But why should you carry the cost of 4-bytes per character just because
> someday you *might* need a non-BMP character?

I am reminded of: 
http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread MRAB


On 18/08/2012 19:26, Paul Rubin wrote:

Steven D'Aprano  writes:

(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.)

...

With PEP 393, each Python string will be stored in the most efficient
format possible:


Can you explain the issue of "breaking surrogate pairs apart" a little
more?  Switching between encodings based on the string contents seems
silly at first glance.  Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages.  I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.


On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.

The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.
--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence


On 18/08/2012 19:30, [email protected] wrote:

Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :

On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:




Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :



[...]



The problem with UCS-4 is that every character requires four bytes.



[...]







I'm aware of this (and all the blah blah blah you are explaining). This



always the same song. Memory.




Exactly. The reason it is always the same song is because it is an

important song.



No offense here. But this is an *american* answer.

The same story as the coding of text files, where "utf-8 == ascii"
and the rest of the world doesn't count.

jmf



Thinking about it I entirely agree with you.  Steven D'Aprano strikes me 
as typically American, in the same way that I'm typically Brazilian :)


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence


On 18/08/2012 19:40, rusi wrote:

On Aug 18, 10:59 pm, Steven D'Aprano  wrote:

On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:

Is there any reason why non ascii users are somehow penalized compared
to ascii users?


Of course there is a reason.

If you want to represent 1114111 different characters in a string, as
Unicode supports, you can't use a single byte per character, or even two
bytes. That is a fact of basic mathematics. Supporting 1114111 characters
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because
someday you *might* need a non-BMP character?


I am reminded of: 
http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html



ROFLMAO doesn't adequately some up how much I laughed.

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Terry Reedy

On 8/18/2012 12:38 PM, [email protected] wrote:

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

You have not tried enough tests ;-).

On my Win7-64 system:
from timeit import timeit

print(timeit(" 'a'*1 "))
3.3.0b2: .5
3.2.3: .8

print(timeit("c in a", "c  = '…'; a = 'a'*1"))
3.3: .05 (independent of len(a)!)
3.2: 5.8  100 times slower! Increase len(a) and the ratio can be made as 
high as one wants!

print(timeit("a.encode()", "a = 'a'*1000"))
3.2: 1.5
3.3:  .26

Similar with encoding='utf-8' added to call.

Jim, please stop the ranting. It does not help improve Python. utf-32 is 
not a panacea; it has problems of time, space, and system compatibility 
(Windows and others). Victor Stinner, whatever he may have once thought 
and said, put a *lot* of effort into making the new implementation both 
correct and fast.

On your replace example
>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
> 61.919225272152346
>>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
> 1.2918679017971044

I do not see the point of changing both length and replacement. For me, 
the time is about the same for either replacement. I do see about the 
same slowdown ratio for 3.3 versus 3.2 I also see it for pure search 
without replacement.

print(timeit("c in a", "c  = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

On Aug 18, 12:22 pm, Jussi Piitulainen 
wrote:
> Frank Koshti writes:
> > not always placed in HTML, and even in HTML, they may appear in
> > strange places, such as Hello. My specific issue
> > is I need to match, process and replace $foo(x=3), knowing that
> > (x=3) is optional, and the token might appear simply as $foo.
>
> > To do this, I decided to use:
>
> > re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> > the issue with this is it doesn't match $foo by itself, and requires
> > there to be () at the end.
>
> Adding a ? after the meant-to-be-optional expression would let the
> regex engine know what you want. You can also separate the mandatory
> and the optional part in the regex to receive pairs as matches. The
> test program below prints this:
>
> >$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc
> ('$foo', '')
> ('$foo', '(bar=3)')
> ('$foo', '($)')
> ('$foo', '')
> ('$bar', '(v=0)')
>
> Here is the program:
>
> import re
>
> def grab(text):
>     p = re.compile(r'([$]\w+)([(][^()]+[)])?')
>     return re.findall(p, text)
>
> def test(html):
>     print(html)
>     for hit in grab(html):
>         print(hit)
>
> if __name__ == '__main__':
>     test('>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etchttp://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth

Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :
> On Aug 18, 10:59 pm, Steven D'Aprano  
> [email protected]> wrote:
> 
> > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> 
> > > Is there any reason why non ascii users are somehow penalized compared
> 
> > > to ascii users?
> 
> >
> 
> > Of course there is a reason.
> 
> >
> 
> > If you want to represent 1114111 different characters in a string, as
> 
> > Unicode supports, you can't use a single byte per character, or even two
> 
> > bytes. That is a fact of basic mathematics. Supporting 1114111 characters
> 
> > must be more expensive than supporting 128 of them.
> 
> >
> 
> > But why should you carry the cost of 4-bytes per character just because
> 
> > someday you *might* need a non-BMP character?
> 
> 
> 
> I am reminded of: 
> http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605
> 
> 
> 
> Original above does not open for me but here's a copy that does:
> 
> 
> 
> http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html

I thing it's time to leave the discussion and to go to bed.

You can take the problem the way you wish, Python 3.3 is "slower"
than Python 3.2.

If you see the present status as an optimisation, I'm condidering
this as a regression.

I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.

To be extreme, tools using pure utf-16 or utf-32 are, at least,
considering all the citizen on this planet in the same way.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: set and dict iteration

2012-08-18 Thread Aaron Brady

On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote:
> On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady  wrote:
> 
> > Is there a problem with hacking on the Beta?
> 
> 
> 
> Nope. Hack on the beta, then when the release arrives, rebase your
> 
> work onto it. I doubt that anything of this nature will be changed
> 
> between now and then.
> 
> 
> 
> ChrisA

Thanks Chris, your post was encouraging.

I have a question about involving the 'tp_clear' field of the types.

http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear

'''
...The tuple type does not implement a tp_clear function, because it’s possible 
to prove that no reference cycle can be composed entirely of tuples. 
'''

I didn't follow the reasoning in the proof; the premise is necessary but IMHO 
not obviously sufficient.  Nevertheless, the earlier diagram contains an overt 
homogeneous reference cycle.

Reposting: 
http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png

In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need 
to visit the auxiliary collection; the same fields of the iterators don't need 
to visit the primary set or other iterators; and references in the linked list 
don't need to be included in the iterators' reference counts.

Can someone who is more familiar with the cycle detector and cycle breaker, 
help prove or disprove the above?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to get initial absolute working dir reliably?

2012-08-18 Thread Jason Swails

On Sat, Aug 18, 2012 at 11:19 AM, kj  wrote:

>
> Basically, I'm looking for a read-only variable (or variables)
> initialized by Python at the start of execution, and from which
> the initial working directory may be read or computed.
>

This will work for Linux and Mac OS X (and maybe Cygwin, but unlikely for
native Windows): try the PWD environment variable.

>>> import os
>>> os.getcwd()
'/Users/swails'
>>> os.getenv('PWD')
'/Users/swails'
>>> os.chdir('..')
>>> os.getcwd()
'/Users'
>>> os.getenv('PWD')
'/Users/swails'

Of course this environment variable can still be messed with, but there
isn't much reason to do so generally (if I'm mistaken here, someone please
correct me).

Hopefully this is of some help,
Jason
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence


On 18/08/2012 21:22, [email protected] wrote:

Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :

On Aug 18, 10:59 pm, Steven D'Aprano  wrote:


On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:



Is there any reason why non ascii users are somehow penalized compared



to ascii users?







Of course there is a reason.







If you want to represent 1114111 different characters in a string, as



Unicode supports, you can't use a single byte per character, or even two



bytes. That is a fact of basic mathematics. Supporting 1114111 characters



must be more expensive than supporting 128 of them.







But why should you carry the cost of 4-bytes per character just because



someday you *might* need a non-BMP character?




I am reminded of: 
http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605



Original above does not open for me but here's a copy that does:



http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html


I thing it's time to leave the discussion and to go to bed.


In plain English, duck out cos I'm losing.



You can take the problem the way you wish, Python 3.3 is "slower"
than Python 3.2.


I'll ask for the second time.  Provide proof that is acceptable to 
everybody and not just yourself.




If you see the present status as an optimisation, I'm condidering
this as a regression.


Considering does not equate to proof.  Where are the figures which back 
up your claim?




I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.


I look forward to seeing your patch on the bug tracker.  If and only if 
you can find something that needs patching, which from the course of 
this thread I think is highly unlikely.





To be extreme, tools using pure utf-16 or utf-32 are, at least,
considering all the citizen on this planet in the same way.

jmf




--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: set and dict iteration

2012-08-18 Thread MRAB


On 18/08/2012 21:29, Aaron Brady wrote:

On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote:

On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady  wrote:

> Is there a problem with hacking on the Beta?



Nope. Hack on the beta, then when the release arrives, rebase your

work onto it. I doubt that anything of this nature will be changed

between now and then.



ChrisA


Thanks Chris, your post was encouraging.

I have a question about involving the 'tp_clear' field of the types.

http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear

'''
...The tuple type does not implement a tp_clear function, because it’s possible 
to prove that no reference cycle can be composed entirely of tuples.
'''

I didn't follow the reasoning in the proof; the premise is necessary but IMHO 
not obviously sufficient.  Nevertheless, the earlier diagram contains an overt 
homogeneous reference cycle.

Reposting: 
http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png

In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need 
to visit the auxiliary collection; the same fields of the iterators don't need 
to visit the primary set or other iterators; and references in the linked list 
don't need to be included in the iterators' reference counts.

Can someone who is more familiar with the cycle detector and cycle breaker, 
help prove or disprove the above?


In simple terms, when you create an immutable object it can contain
only references to pre-existing objects, but in order to create a cycle
you need to make an object refer to another which is created later, so
it's not possible to create a cycle out of immutable objects.

However, using Python's C API it _is_ possible to create such a cycle,
by mutating an otherwise-immutable tuple (see PyTuple_SetItem and
PyTuple_SET_ITEM).

--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico

On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin  wrote:
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more?  Switching between encodings based on the string contents seems
> silly at first glance.  Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages.  I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.

UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character? You have to
scan from the beginning. The same applies when surrogate pairs are
used to represent single characters, unless the representation leaks
and a surrogate is indexed as two - which is where the breaking-apart
happens.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: [pyxl] xlrd 0.8.0 released!

2012-08-18 Thread Brent Marshall

My compliments to John and Chris and to any others who contributed to the 
new xlsx capability. This is a most welcome development. Thank you.

Brent
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Paul Rubin

Chris Angelico  writes:
> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
> few thousand bytes, how do you locate the 273rd character? 

How often do you need to do that, as opposed to traversing the string by
iteration?  Anyway, you could use a rope-like implementation, or an
index structure over the string.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico

On Sun, Aug 19, 2012 at 12:11 PM, Paul Rubin  wrote:
> Chris Angelico  writes:
>> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
>> few thousand bytes, how do you locate the 273rd character?
>
> How often do you need to do that, as opposed to traversing the string by
> iteration?  Anyway, you could use a rope-like implementation, or an
> index structure over the string.

Well, imagine if Python strings were stored in UTF-8. How would you slice it?

>>> "asdfqwer"[4:]
'qwer'

That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: set and dict iteration

2012-08-18 Thread Aaron Brady

On Saturday, August 18, 2012 5:14:05 PM UTC-5, MRAB wrote:
> On 18/08/2012 21:29, Aaron Brady wrote:
> 
> > On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote:
> 
> >> On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady  wrote:
> 
> >>
> 
> >> > Is there a problem with hacking on the Beta?
> 
> >>
> 
> >>
> 
> >>
> 
> >> Nope. Hack on the beta, then when the release arrives, rebase your
> 
> >>
> 
> >> work onto it. I doubt that anything of this nature will be changed
> 
> >>
> 
> >> between now and then.
> 
> >>
> 
> >>
> 
> >>
> 
> >> ChrisA
> 
> >
> 
> > Thanks Chris, your post was encouraging.
> 
> >
> 
> > I have a question about involving the 'tp_clear' field of the types.
> 
> >
> 
> > http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear
> 
> >
> 
> > '''
> 
> > ...The tuple type does not implement a tp_clear function, because it’s 
> > possible to prove that no reference cycle can be composed entirely of 
> > tuples.
> 
> > '''
> 
> >
> 
> > I didn't follow the reasoning in the proof; the premise is necessary but 
> > IMHO not obviously sufficient.  Nevertheless, the earlier diagram contains 
> > an overt homogeneous reference cycle.
> 
> >
> 
> > Reposting: 
> > http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png
> 
> >
> 
> > In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't 
> > need to visit the auxiliary collection; the same fields of the iterators 
> > don't need to visit the primary set or other iterators; and references in 
> > the linked list don't need to be included in the iterators' reference 
> > counts.
> 
> >
> 
> > Can someone who is more familiar with the cycle detector and cycle breaker, 
> > help prove or disprove the above?
> 
> >
> 
> In simple terms, when you create an immutable object it can contain
> 
> only references to pre-existing objects, but in order to create a cycle
> 
> you need to make an object refer to another which is created later, so
> 
> it's not possible to create a cycle out of immutable objects.
> 
> 
> 
> However, using Python's C API it _is_ possible to create such a cycle,
> 
> by mutating an otherwise-immutable tuple (see PyTuple_SetItem and
> 
> PyTuple_SET_ITEM).

Are there any precedents for storing uncounted references to PyObject's?

One apparent problematic case is creating an iterator to a set, then adding it 
to the set.  However the operation is a modification, and causes the iterator 
to be removed from the secondary list before the set is examined for collection.

Otherwise, the iterator keeps a counted reference to the set, but the set does 
not keep a counted reference to the iterator, so the iterator will always be 
freed first.  Therefore, the set's secondary list will be empty when the set is 
freed.

Concurrent addition and deletion of iterators should be disabled, and the 
iterators should remove themselves from the set's secondary list before they 
decrement their references to the set.

Please refresh the earlier diagram; counted references are distinguished 
separately.  Reposting: 
http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Paul Rubin

Chris Angelico  writes:
 "asdfqwer"[4:]
> 'qwer'
>
> That's a not uncommon operation when parsing strings or manipulating
> data. You'd need to completely rework your algorithms to maintain a
> position somewhere.

Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF-8 string is no big deal.  It gets more expensive if you
want to index far more deeply into the string.  I'm asking how often
that is done in real code.  Obviously one can concoct hypothetical
examples that would suffer.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico

On Sun, Aug 19, 2012 at 12:35 PM, Paul Rubin  wrote:
> Chris Angelico  writes:
> "asdfqwer"[4:]
>> 'qwer'
>>
>> That's a not uncommon operation when parsing strings or manipulating
>> data. You'd need to completely rework your algorithms to maintain a
>> position somewhere.
>
> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal.  It gets more expensive if you
> want to index far more deeply into the string.  I'm asking how often
> that is done in real code.  Obviously one can concoct hypothetical
> examples that would suffer.

Sure, four characters isn't a big deal to step through. But it still
makes indexing and slicing operations O(N) instead of O(1), plus you'd
have to zark the whole string up to where you want to work. It'd be
workable, but you'd have to redo your algorithms significantly; I
don't have a Python example of parsing a huge string, but I've done it
in other languages, and when I can depend on indexing being a cheap
operation, I'll happily do exactly that.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Terry Reedy


On 8/18/2012 4:09 PM, Terry Reedy wrote:


print(timeit("c in a", "c  = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.


I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default of 100 repetitions in a loop, the reported times 
are microseconds per operation and thus not practically significant.'

3. 'There is a stringbench.py with a large number of such micro benchmarks.'

I believe there are also whole-application benchmarks that try to mimic 
real-world mixtures of operations.


People making improvements must consider performance on multiple systems 
and multiple benchmarks. If someone wants to work on search speed, they 
cannot just optimize that one operation on one system.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Paul Rubin

Chris Angelico  writes:
> Sure, four characters isn't a big deal to step through. But it still
> makes indexing and slicing operations O(N) instead of O(1), plus you'd
> have to zark the whole string up to where you want to work.

I know some systems chop the strings into blocks of (say) a few
hundred chars, so you can immediately get to the correct
block, then scan into the block to get to the desired char offset.

> I don't have a Python example of parsing a huge string, but I've done
> it in other languages, and when I can depend on indexing being a cheap
> operation, I'll happily do exactly that.

I'd be interested to know what the context was, where you parsed
a big unicode string in a way that required random access to
the nth character in the string.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico

On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin  wrote:
> Chris Angelico  writes:
>> I don't have a Python example of parsing a huge string, but I've done
>> it in other languages, and when I can depend on indexing being a cheap
>> operation, I'll happily do exactly that.
>
> I'd be interested to know what the context was, where you parsed
> a big unicode string in a way that required random access to
> the nth character in the string.

It's something I've done in C/C++ fairly often. Take one big fat
buffer, slice it and dice it as you get the information you want out
of it. I'll retain and/or calculate indices (when I'm not using
pointers, but that's a different kettle of fish). Generally, I'm
working with pure ASCII, but port those same algorithms to Python and
you'll easily be able to read in a file in some known encoding and
manipulate it as Unicode.

It's not so much 'random access to the nth character' as an efficient
way of jumping forward. For instance, if I know that the next thing is
a literal string of n characters (that I don't care about), I want to
skip over that and keep parsing. The Adobe Message Format is
particularly noteworthy in this, but it's a stupid format and I don't
recommend people spend too much time reading up on it (unless you like
that sensation of your brain trying to escape through your ear).

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Encapsulation, inheritance and polymorphism

2012-08-18 Thread Robert Miles


On 7/23/2012 11:18 AM, Albert van der Horst wrote:

In article <[email protected]>,
Steven D'Aprano   wrote:

Even with a break, why bother continuing through the body of the function
when you already have the result? When your calculation is done, it's
done, just return for goodness sake. You wouldn't write a search that
keeps going after you've found the value that you want, out of some
misplaced sense that you have to look at every value. Why write code with
unnecessary guard values and temporary variables out of a misplaced sense
that functions must only have one exit?


Example from recipee's:

Stirr until the egg white is stiff.

Alternative:
Stirr egg white for half an hour,
but if the egg white is stiff keep your spoon still.

(Cooking is not my field of expertise, so the wording may
not be quite appropriate. )


--
Steven


Groetjes Albert


Note that you forgot applying enough heat to do the cooking.


--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Paul Rubin

Chris Angelico  writes:
> Generally, I'm working with pure ASCII, but port those same algorithms
> to Python and you'll easily be able to read in a file in some known
> encoding and manipulate it as Unicode.

If it's pure ASCII, you can use the bytes or bytearray type.  

> It's not so much 'random access to the nth character' as an efficient
> way of jumping forward. For instance, if I know that the next thing is
> a literal string of n characters (that I don't care about), I want to
> skip over that and keep parsing.

I don't understand how this is supposed to work.  You're going to read a
large unicode text file (let's say it's UTF-8) into a single big string?
So the runtime library has to scan the encoded contents to find the
highest numbered codepoint (let's say it's mostly ascii but has a few
characters outside the BMP), expand it all (in this case) to UCS-4
giving 4x memory bloat and requiring decoding all the UTF-8 regardless,
and now we should worry about the efficiency of skipping n characters?

Since you have to decode the n characters regardless, I'd think this
skipping part should only be an issue if you have to do it a lot of
times.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Encapsulation, inheritance and polymorphism

2012-08-18 Thread John Ladasky

On Tuesday, July 17, 2012 12:39:53 PM UTC-7, Mark Lawrence wrote:

> I would like to spend more time on this thread, but unfortunately the 44 
> ton artic carrying "Java in a Nutshell Volume 1 Part 1 Chapter 1 
> Paragraph 1 Sentence 1" has just arrived outside my abode and needs 
> unloading :-)

That reminds me of a remark I made nearly 10 years ago:

"Well, I followed one friend's advice and investigated Java, perhaps a little 
too quickly.  I purchased Ivor Horton's _Beginning_Java_2_ book.  It is 
reasonably well-written.  But how many pages did I have to read before I got 
through everything I needed to know, in order to read and write files?  Four 
hundred!  I need to keep straight detailed information about objects, 
inheritance, exceptions, buffers, and streams, just to read data from a text 
file???

I haven't actually sat down to program in Java yet.  But at first glance, it 
would seem to be a step backwards even from the procedural C programming that I 
was doing a decade ago.  I was willing to accept the complexity of the Windows 
GUI, and program with manuals open on my lap.  It is a lot harder for me to 
accept that I will need to do this in order to process plain old text, perhaps 
without even any screen output."

https://groups.google.com/d/topic/bionet.software/kk-EGGTHN1M/discussion

Some things never change!  :^)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Steven D'Aprano

This is a long post. If you don't feel like reading an essay, skip to the 
very bottom and read my last few paragraphs, starting with "To recap".

On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:

> Steven D'Aprano  writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP
>> characters using two code points. This is fragile and doesn't work very
>> well, because string-handling methods can break the surrogate pairs
>> apart, leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
> 
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more?  Switching between encodings based on the string contents seems
> silly at first glance.  

Forget encodings! We're not talking about encodings. Encodings are used 
for converting text as bytes for transmission over the wire or storage on 
disk. PEP 393 talks about the internal representation of text within 
Python, the C-level data structure.

In 3.2, that data structure depends on a compile-time switch. In a 
"narrow build", text is stored using two-bytes per character, so the 
string "len" (as in the name of the built-in function) will be stored as 

006c 0065 006e

(or possibly 6c00 6500 6e00, depending on whether your system is 
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.

Since most identifiers are ASCII, that's already using twice as much 
memory as needed. This standard data structure is called UCS-2, and it 
only handles characters in the Basic Multilingual Plane, the BMP (roughly 
the first 64000 Unicode code points). I'll come back to that.

In a "wide build", text is stored as four-bytes per character, so "len" 
is stored as either:

006c 0065 006e
6c00 6500 6e00

Now memory is cheap, but it's not *that* cheap, and no matter how much 
memory you have, you can always use more.

This system is called UCS-4, and it can handle the entire Unicode 
character set, for now and forever. (If we ever need more that four-bytes 
worth of characters, it won't be called Unicode.)

Remember I said that UCS-2 can only handle the 64K characters 
[technically: code points] in the Basic Multilingual Plane? There's an 
extension to UCS-2 called UTF-16 which extends it to the entire Unicode 
range. Yes, that's the same name as the UTF-16 encoding, because it's 
more or less the same system.

UTF-16 says "let's represent characters in the BMP by two bytes, but 
characters outside the BMP by four bytes." There's a neat trick to this: 
the BMP doesn't use the entire two-byte range, so there are some byte 
pairs which are illegal in UCS-2 -- they don't correspond to *any* 
character. UTF-16 used those byte pairs to signal "this is half a 
character, you need to look at the next pair for the rest of the 
character".

Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".

Except this comes at a big cost: you can no longer tell how long a string 
is by counting the number of bytes, which is fast, because sometimes four 
bytes is two characters and sometimes it's one and you can't tell which 
it will be until you actually inspect all four bytes.

Copying sub-strings now becomes either slow, or buggy. Say you want to 
grab the 10th characters in a string. The fast way using UCS-2 is to 
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we 
start counting at zero) and you're done. Fast and safe if you're willing 
to give up the non-BMP characters.

It's also fast and safe if you use USC-4, but then everything takes twice 
as much space, so you probably end up spending so much time copying null 
bytes that you're probably slower anyway. Especially when your OS starts 
paging memory like mad.

But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8 
and 9 are half of a surrogate pair, and you've now split the pair and 
ended up with an invalid string. That's what Python 3.2 does, it fails to 
handle surrogate pairs properly:

py> s = chr(0x + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'

I've just split a single valid Unicode character into two invalid 
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that 
my data is now junk.

Since any character can be a surrogate pair, you have to scan every pair 
of bytes in order to index a string, or work out it's length, or copy a 
substring. It's not enough to just check if the last pair is a surrogate. 

When you don't, you have bugs like this from Python 3.2:

py> s = "01234" + chr(0x + 1) + "6789"
py> s[9] == '9'
False
py> s[9], len(s)
('8', 11)

Which is now fixed in Python 3.3.

So variable-width data structures like UTF-8 or UTF-16 are crap for the 
internal representation of strings -- they are either fast or correct but 
cannot be both.

But UCS-2 is sub-optimal, because it can only handle the BMP, and

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Steven D'Aprano

On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote:

>> > I'm aware of this (and all the blah blah blah you are explaining).
>> > This always the same song. Memory.
>> 
>> 
>> 
>> Exactly. The reason it is always the same song is because it is an
>> important song.
>> 
>> 
> No offense here. But this is an *american* answer.

I am not American.

I am not aware that computers outside of the USA, and Australia, have 
unlimited amounts of memory. You must be very lucky.


> The same story as the coding of text files, where "utf-8 == ascii" and
> the rest of the world doesn't count.

UTF-8 is not ASCII.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Steven D'Aprano

On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:

> The change does not just benefit ASCII users.  It primarily benefits
> anybody using a wide unicode build with strings mostly containing only
> BMP characters.

Just to be clear:

If you have many strings which are *mostly* BMP, but have one or two non-
BMP characters in *each* string, you will see no benefit.

But if you have many strings which are all BMP, and only a few strings 
containing non-BMP characters, then you will see a big benefit.


> Even for narrow build users, there is the benefit that
> with approximately the same amount of memory usage in most cases, they
> no longer have to worry about non-BMP characters sneaking in and
> breaking their code.

Yes! +1000 on that.


> There is some additional benefit for Latin-1 users, but this has nothing
> to do with Python.  If Python is going to have the option of a 1-byte
> representation (and as long as we have the flexible representation, I
> can see no reason not to), 

The PEP explicitly states that it only uses a 1-byte format for ASCII 
strings, not Latin-1:

"ASCII-only Unicode strings will again use only one byte per character"

and later:

"If the maximum character is less than 128, they use the PyASCIIObject 
structure"

and:

"The data and utf8 pointers point to the same memory if the string uses 
only ASCII characters (using only Latin-1 is not sufficient)."


> then it is going to be Latin-1 by definition,

Certainly not, either in fact or in principle. There are a large number 
of 1-byte encodings, Latin-1 is hardly the only one.


> because that's what 1-byte Unicode (UCS-1, if you will) is.  If you have
> an issue with that, take it up with the designers of Unicode.

The designers of Unicode have never created a standard "1-byte Unicode" 
or UCS-1, as far as I can determine.

The Unicode standard refers to some multiple million code points, far too 
many to fit in a single byte. There is some historical justification for 
using "Unicode" to mean UCS-2, but with the standard being extended 
beyond the BMP, that is no longer valid.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details.


I think what you are trying to say is that the Unicode designers 
deliberately matched the Latin-1 standard for Unicode's first 256 code 
points. That's not the same thing though: there is no Unicode standard 
mapping to a single byte format.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Steven D'Aprano

On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:

> As I understand (I think) the undelying mechanism, I can only say, it is
> not a surprise that it happens.
> 
> Imagine an editor, I type an "a", internally the text is saved as ascii,
> then I type en "é", the text can only be saved in at least latin-1. Then
> I enter an "€", the text become an internal ucs-4 "string". The remove
> the "€" and so on.

Firstly, that is not what Python does. For starters, € is in the BMP, and 
so is nearly every character you're ever going to use unless you are 
Asian or a historian using some obscure ancient script. NONE of the 
examples you have shown in your emails have included 4-byte characters, 
they have all been ASCII or UCS-2.

You are suffering from a misunderstanding about what is going on and 
misinterpreting what you have seen.

In *both* Python 3.2 and 3.3, both é and € are represented by two bytes. 
That will not change. There is a tiny amount of fixed overhead for 
strings, and that overhead is slightly different between the versions, 
but you'll never notice the difference.

Secondly, how a text editor or word processor chooses to store the text 
that you type is not the same as how Python does it. A text editor is not 
going to be creating a new immutable string after every key press. That 
will be slow slow SLOW. The usual way is to keep a buffer for each 
paragraph, and add and subtract characters from the buffer.

> Intuitively I expect there is some kind slow down between all these
> "strings" conversion.

Your intuition is wrong. Strings are not converted from ASCII to USC-2 to 
USC-4 on the fly, they are converted once, when the string is created.

The tests we ran earlier, e.g.:

('ab…' * 1000).replace('…', 'œ…')

show the *worst possible case* for the new string handling, because all 
we do is create new strings. First we create a string 'ab…', then we 
create another string 'ab…'*1000, then we create two new strings '…' and 
'œ…', and finally we call replace and create yet another new string.

But in real applications, once you have created a string, you don't just 
immediately create a new one and throw the old one away. You likely do 
work with that string:

steve@runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))"
10 loops, best of 3: 2.41 usec per loop

steve@runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))"
10 loops, best of 3: 2.29 usec per loop

Once you start doing *real work* with the strings, the overhead of 
deciding whether they should be stored using 1, 2 or 4 bytes begins to 
fade into the noise.

> When I tested this flexible representation, a few months ago, at the
> first alpha release. This is precisely what, I tested. String
> manipulations which are forcing this internal change and I concluded the
> result is not brillant. Realy, a factor 0.n up to 10.

Like I said, if you really think that there is a significant, repeatable 
slow-down on Windows, report it as a bug.

> Does any body know a way to get the size of the internal "string" in
> bytes? 

sys.getsizeof(some_string)

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10030
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10038

As I said, there is a *tiny* overhead difference. But identifiers will 
generally be smaller:

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
(size.__name__))"
48
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
(size.__name__))"
34

You can check the object overhead by looking at the size of the empty 
string.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Steven D'Aprano

On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:

> "a" will be stored as 1 byte/codepoint.
> 
> Adding "é", it will still be stored as 1 byte/codepoint.

Wrong. It will be 2 bytes, just like it already is in Python 3.2.

I don't know where people are getting this myth that PEP 393 uses Latin-1 
internally, it does not. Read the PEP, it explicitly states that 1-byte 
formats are only used for ASCII strings.

> Adding "€", it will still be stored as 2 bytes/codepoint.

That is correct.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

66 matches

Mail list logo