On 10/06/13 22:55, Oscar Benjamin wrote:

Yes, but I thought it was using a different unambiguous and easier
(for me) to understand definition of decimal digits.

It's no easier. You have a list (in your head) of characters that are decimal 
digits. In your head, you have ten of them, because you are an English speaker 
and probably have never learned any language that uses other digits, or used 
DOS codepages or Windows charsets with alternate versions of digits (such as 
the East Asian full width and narrow width forms). You probably *have* used 
charsets like Latin-1 containing ¹²³ but probably not often enough to think 
about them as potentially digits. But either way, in your head you have a list 
of decimal digits.

Python also has a list of decimal digits, except it is longer.

(Strictly speaking, it probably doesn't keep an explicit list "these chars are 
digits" in memory, but possibly looks them up in a Unicode property database as 
needed. But then, who knows how memories and facts are stored in the human brain? 
Strictly speaking, there's probably no list in your head either.)


 I guess that I'm
just coming to realise exactly what Python 3's unicode support really
means and in many cases it means that the interpreter is doing things
that I don't want or need.

With respect, it's not that you don't want or need them, but that you don't 
*know* that you actually do want and need them. (I assume you are releasing 
software for others to use. If all your software is private, for your own use 
and nobody else, then you may not care.) If your software accepts numeric 
strings from the user -- perhaps it reads a file, perhaps it does something 
like this:

number = int(input("Please enter a number: "))

-- you want it to do the right thing when the user enters a number. Thanks to 
the Internet, your program is available to people all over the world. Well, in 
probably half the world, those digits are not necessarily the same as ASCII 
0-9. Somebody downloads your app in Japan, points it at a data file containing 
fullwidth or halfwidth digits, and in Python 3 it just works. (Provided, of 
course, that you don't sabotage its ability to do so with inappropriate decimal 
only data validation.)


For example I very often pipe streams of ascii numeric text from one
program to another.

No you don't. Never. Not once in the history of computers has anyone ever piped 
streams of text from one program to another. They just *think* they have.

They pipe *bytes* from one program to another.



In some cases the cost of converting to/from
decimal is actually significant and Python 3 will add to this both
with a more complex conversion

Let your mind be at rest on this account. Python 3.3 int() is nearly twice as 
fast as Python 2.7 for short strings:

[steve@ando ~]$ python2.7 -m timeit "int('12345')"
1000000 loops, best of 3: 0.924 usec per loop
[steve@ando ~]$ python3.3 -m timeit "int('12345')"
1000000 loops, best of 3: 0.485 usec per loop


and about 25% faster for long strings:

[steve@ando ~]$ python2.7 -m timeit "int('1234567890'*5)"
100000 loops, best of 3: 2.06 usec per loop
[steve@ando ~]$ python3.3 -m timeit "int('1234567890'*5)"
1000000 loops, best of 3: 1.45 usec per loop


It's a little slower when converting the other way:

[steve@ando ~]$ python2.7 -m timeit "str(12345)"
1000000 loops, best of 3: 0.333 usec per loop
[steve@ando ~]$ python3.3 -m timeit "str(12345)"
1000000 loops, best of 3: 0.5 usec per loop

but for big numbers, the difference is negligible:

[steve@ando ~]$ python2.7 -m timeit -s "n=1234567890**5" "str(n)"
1000000 loops, best of 3: 1.12 usec per loop
[steve@ando ~]$ python3.3 -m timeit -s "n=1234567890**5" "str(n)"
1000000 loops, best of 3: 1.16 usec per loop

and in any case, the time taken to convert to a string is trivial.


and with its encoding/decoding part of
the io stack. I'm wondering whether I should really just be using
binary mode for this kind of thing in Python 3 since this at least
removes an unnecessary part of the stack.

I'm thinking that you're engaging in premature optimization. Have you profiled 
your code to confirm that the bottlenecks are where you think they are?


In a previous thread where I moaned about the behaviour of the int()
function Eryksun suggested that it would be better if int() wan't used
for parsing strings at all. Since then I've thought about that and I
agree. There should be separate functions for each kind of string to
number conversion with one just for ascii decimal only.

I think that is a terrible, terrible idea. It moves responsibility for something 
absolutely trivial ("convert a string to a number") from the language to the 
programmer, *who will get it wrong*.

# The right way:
number = int(string)


# The wrong, horrible, terrible way (and buggy too):
try:
    number = ascii_int(string)
except ValueError:
    try:
        number = fullwidth_int(string)
    except ValueError:
        try:
            number = halfwidth_int(string)
        except ValueError:
            try:
                number = thai_int(string)
            except ...
            # and so on, for a dozen or so other scripts...
            # oh gods, think of the indentation!!!

            except ValueError:
                 # Maybe it's a mixed script number?
                 # Fall back to char by char conversion.
                 n = 0
                 for c in string:
                     if c in ascii_digits:
                         n += ord(c) - ord('0')
                     elif c in halfwidth_digits:
                         n += ord(c) - ord('\N{HALFWIDTH DIGIT ZERO}'
                     elif ... # and so forth
                     else:
                         raise ValueError


Of course, there are less stupid ways to do this. But you don't have to, 
because it already works.



An alternative method depending on where your strings are actually
coming from would be to use byte-strings or the ascii codec. I may
consider doing this in future; in my own applications if I pass a
non-ascii digit to int() then I definitely have data corruption.

It's not up to built-ins like int() to protect you from data corruption.
Would you consider it reasonable for me to say "in my own applications, if I
pass a number bigger than 100, I definitely have data corruption, therefore
int() should not support numbers bigger than 100"?

I expect the int() function to reject invalid input.

It does. What makes you think it doesn't?


I thought that its definition of invalid matched up with my own.

If your definition is something other than "a string containing non-digits, apart 
from a leading plus or minus sign", then it is your definition that is wrong.



--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to