On 10/06/13 22:55, Oscar Benjamin wrote:
Yes, but I thought it was using a different unambiguous and easier
(for me) to understand definition of decimal digits.
It's no easier. You have a list (in your head) of characters that are decimal
digits. In your head, you have ten of them, because you are an English speaker
and probably have never learned any language that uses other digits, or used
DOS codepages or Windows charsets with alternate versions of digits (such as
the East Asian full width and narrow width forms). You probably *have* used
charsets like Latin-1 containing ¹²³ but probably not often enough to think
about them as potentially digits. But either way, in your head you have a list
of decimal digits.
Python also has a list of decimal digits, except it is longer.
(Strictly speaking, it probably doesn't keep an explicit list "these chars are
digits" in memory, but possibly looks them up in a Unicode property database as
needed. But then, who knows how memories and facts are stored in the human brain?
Strictly speaking, there's probably no list in your head either.)
I guess that I'm
just coming to realise exactly what Python 3's unicode support really
means and in many cases it means that the interpreter is doing things
that I don't want or need.
With respect, it's not that you don't want or need them, but that you don't
*know* that you actually do want and need them. (I assume you are releasing
software for others to use. If all your software is private, for your own use
and nobody else, then you may not care.) If your software accepts numeric
strings from the user -- perhaps it reads a file, perhaps it does something
like this:
number = int(input("Please enter a number: "))
-- you want it to do the right thing when the user enters a number. Thanks to
the Internet, your program is available to people all over the world. Well, in
probably half the world, those digits are not necessarily the same as ASCII
0-9. Somebody downloads your app in Japan, points it at a data file containing
fullwidth or halfwidth digits, and in Python 3 it just works. (Provided, of
course, that you don't sabotage its ability to do so with inappropriate decimal
only data validation.)
For example I very often pipe streams of ascii numeric text from one
program to another.
No you don't. Never. Not once in the history of computers has anyone ever piped
streams of text from one program to another. They just *think* they have.
They pipe *bytes* from one program to another.
In some cases the cost of converting to/from
decimal is actually significant and Python 3 will add to this both
with a more complex conversion
Let your mind be at rest on this account. Python 3.3 int() is nearly twice as
fast as Python 2.7 for short strings:
[steve@ando ~]$ python2.7 -m timeit "int('12345')"
1000000 loops, best of 3: 0.924 usec per loop
[steve@ando ~]$ python3.3 -m timeit "int('12345')"
1000000 loops, best of 3: 0.485 usec per loop
and about 25% faster for long strings:
[steve@ando ~]$ python2.7 -m timeit "int('1234567890'*5)"
100000 loops, best of 3: 2.06 usec per loop
[steve@ando ~]$ python3.3 -m timeit "int('1234567890'*5)"
1000000 loops, best of 3: 1.45 usec per loop
It's a little slower when converting the other way:
[steve@ando ~]$ python2.7 -m timeit "str(12345)"
1000000 loops, best of 3: 0.333 usec per loop
[steve@ando ~]$ python3.3 -m timeit "str(12345)"
1000000 loops, best of 3: 0.5 usec per loop
but for big numbers, the difference is negligible:
[steve@ando ~]$ python2.7 -m timeit -s "n=1234567890**5" "str(n)"
1000000 loops, best of 3: 1.12 usec per loop
[steve@ando ~]$ python3.3 -m timeit -s "n=1234567890**5" "str(n)"
1000000 loops, best of 3: 1.16 usec per loop
and in any case, the time taken to convert to a string is trivial.
and with its encoding/decoding part of
the io stack. I'm wondering whether I should really just be using
binary mode for this kind of thing in Python 3 since this at least
removes an unnecessary part of the stack.
I'm thinking that you're engaging in premature optimization. Have you profiled
your code to confirm that the bottlenecks are where you think they are?
In a previous thread where I moaned about the behaviour of the int()
function Eryksun suggested that it would be better if int() wan't used
for parsing strings at all. Since then I've thought about that and I
agree. There should be separate functions for each kind of string to
number conversion with one just for ascii decimal only.
I think that is a terrible, terrible idea. It moves responsibility for something
absolutely trivial ("convert a string to a number") from the language to the
programmer, *who will get it wrong*.
# The right way:
number = int(string)
# The wrong, horrible, terrible way (and buggy too):
try:
number = ascii_int(string)
except ValueError:
try:
number = fullwidth_int(string)
except ValueError:
try:
number = halfwidth_int(string)
except ValueError:
try:
number = thai_int(string)
except ...
# and so on, for a dozen or so other scripts...
# oh gods, think of the indentation!!!
except ValueError:
# Maybe it's a mixed script number?
# Fall back to char by char conversion.
n = 0
for c in string:
if c in ascii_digits:
n += ord(c) - ord('0')
elif c in halfwidth_digits:
n += ord(c) - ord('\N{HALFWIDTH DIGIT ZERO}'
elif ... # and so forth
else:
raise ValueError
Of course, there are less stupid ways to do this. But you don't have to,
because it already works.
An alternative method depending on where your strings are actually
coming from would be to use byte-strings or the ascii codec. I may
consider doing this in future; in my own applications if I pass a
non-ascii digit to int() then I definitely have data corruption.
It's not up to built-ins like int() to protect you from data corruption.
Would you consider it reasonable for me to say "in my own applications, if I
pass a number bigger than 100, I definitely have data corruption, therefore
int() should not support numbers bigger than 100"?
I expect the int() function to reject invalid input.
It does. What makes you think it doesn't?
I thought that its definition of invalid matched up with my own.
If your definition is something other than "a string containing non-digits, apart
from a leading plus or minus sign", then it is your definition that is wrong.
--
Steven
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor