[Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
Hi all,

My first post to the list. In fact, first time Python hacker, long-time
Python user though. (Melbourne, Australia).

Some of you may have seen for the past week or so my bug report on Roundup,
http://bugs.python.org/issue3300

I've spent a heap of effort on this patch now so I'd really like to get some
more opinions and have this patch considered for Python 3.0.

Basically, urllib.quote and unquote seem not to have been updated since
Python 2.5, and because of this they implicitly perform Latin-1 encoding and
decoding (with respect to percent-encoded characters). I think they should
default to UTF-8 for a number of reasons, including that's what other
software such as web browsers use.

I've submitted a patch which fixes quote and unquote to use UTF-8 by
default. I also added extra arguments allowing the caller to choose the
encoding (after discussion, there was some consensus that this would be
beneficial). I have now completed updating the documentation, writing
extensive test cases, and testing the rest of the standard library for code
breakage - with the result being there wasn't really any, everything seems
to just work nicely with UTF-8. You can read the sordid details of my
investigation in the tracker.

Firstly, it'd be nice to hear if people think this is desirable behaviour.
Secondly, if it's feasible to get this patch in Python 3.0. (I think if it
were delayed to Python 3.1, the code breakage wouldn't justify it). And
thirdly, if the first two are positive, if anyone would like to review this
patch and check it in.

I have extensively tested it, and am now pretty confident that it won't
cause any grief if it's checked in.

Thanks very much,
Matt Giuca
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
Thanks for all the replies, and making me feel welcome :)

>
> If what you are saying is true, then it can probably go in as a bug
> fix (unless someone else knows something about Latin-1 on the Net that
> makes this not true).
>

Well from what I've seen, the only time Latin-1 naturally appears on the net
is when you have a web page in Latin-1 (either explicit or inferred; and
note that a browser like Firefox will infer Latin-1 if it sees only ASCII
characters) with a form in it. Submitting the form, the browser will use
Latin-1 to percent-encode the query string.

So if you write a web app and you don't have any non-ASCII characters or
mention the charset, chances are you'll get Latin-1. But I would argue
you're leaving things to chance and you deserve to get funny behaviour. If
you do any of the following:

   - Use a non-ASCII character, encoded as UTF-8 on the page.
   - Send a Content-Type: ; charset=utf-8.
   - In HTML, set a .
   - In the form itself, set .

then the browser will encode the form data as UTF-8. And most "proper" web
pages should get themselves explicitly served as UTF-8.

That I can't say I can necessarily due; have my own bug reports to
> work through this weekend. =)


OK well I'm busy for the next few days; after that I can do a patch trade
with someone. (That is if I am allowed to do reviews; not sure since I don't
have developer privileges).


On Sun, Jul 13, 2008 at 5:58 AM, Mark Hammond <[EMAIL PROTECTED]>
wrote:

> > My first post to the list. In fact, first time Python hacker,
> > long-time Python user though. (Melbourne, Australia).
>
> Cool - where exactly?  I'm in Wantirna (although not at this very moment -
> I'm in Lithuania, but home again in a couple of days)


Cool :) Balwyn.


> * Please take Martin with a grain of salt ( \I would say "ignore him", but
> that is too strong ;)


Lol, he is a hard man to please, but he's given some good feedback.


On Sun, Jul 13, 2008 at 7:07 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:

>
> The standard here is RFC 3986, from Jan 2005, which says,
>
>  ``When a new URI scheme defines a component that represents textual
> data consisting of characters from the Universal Character Set [UCS],
> the data should first be encoded as octets according to the UTF-8
> character encoding [STD63]; then only those octets that do not
> correspond to characters in the unreserved set should be
> percent-encoded.''


Ah yes, I was originally hung up on the idea that "URLs had to be encoded in
UTF-8", till Martin pointed out that it only says "new URI scheme" there.
It's perfectly valid to have non-UTF-8-encoded URIs. However in practice
they're almost always UTF-8. So I think introducing the new encoding
argument and having it default to "utf-8" is quite reasonable.

I'd say, treat the incoming data as either Unicode (if it's a Unicode
> string), or some unknown superset of ASCII (which includes both
> Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown
> encoding), and apply the appropriate transformation.
>

Ah there may be some confusion here. We're only dealing with str->str
transformations (which in Python 3 means Unicode strings). You can't put a
bytes in or get a bytes out of either of these functions. I suggested a
"quote_raw" and "unquote_raw" function which would let you do this.

The issue is with the percent-encoded characters in the URI string, which
must be interpreted as bytes, not code points. How then do you convert these
into a Unicode string? (Python 2 did not have this problem, since you simply
output a byte string without caring about the encoding).

On Sun, Jul 13, 2008 at 9:10 AM, "Martin v. Löwis" <[EMAIL PROTECTED]>
wrote:

> > Very nice, I had this somewhere on my todo list to work on. I'm very much
> > in favour, especially since it synchronizes us with the RFCs (for all I
> > remember reading about it last time).
>
> I still think that it doesn't. The RFCs haven't changed, and can't
> change for compatibility reasons. The encoding of non-ASCII characters
> in URLs remains as underspecified as it always was.


Correct. But my patch brings us in-line with that unspecification. The
unpatched version forces you to use Latin-1. My patch lets you specify the
encoding to use.


> Now, with IRIs, the situation is different, but I don't think the patch
> claims to implement IRIs (and if so, it perhaps shouldn't change URL
> processing in doing so).


True. I don't claim to have implemented IRIs or even know enough about them
to do that. I'll read up on these things in the next few days.

However, this is a URI library, not IRI. From what I've seen, it

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
> This POV is way too browser-centric...
>

This is but one example. Note that I found web forms to be the least
clear-cut example of choosing an encoding. Most of the time applications
seem to be using UTF-8, and all the standards I have read are moving towards
specifying UTF-8 (from being unspecified). I've never seen a standard
specify or even recommend Latin-1.

Where web forms are concerned, basically setting the form accept-charset or
the page charset is the *maximum amount* of control you have over the
encoding. As you say, it can be encoded by another page or the user can
override their settings. Then what can you do as the server? Nothing ...
there's no way to predict the encoding. So you just handle the cases you
have control over.

5) Different cultures do not choose necessarily between latin-1 and utf-8.
> They deal more with things like, say KOI8-R or Big5.


Exactly. This is exactly my point - Latin-1 is arbitrary from a standards
point of view. It's just one of the many legacy encodings we'd like to
forget. The UTFs are the only options which support all languages, and UTF-8
is the only ASCII-compatible (and therefore URI-compatible) encoding. So we
should aim to support that as the default.

Besides all that and without any offense: "most proper" and "should do" and
> the implication that all web browsers behave the same way are not a good
> location to argue from when talking about implementing a standard ;)


I agree. However if there *was* a proper standard we wouldn't have to argue!
"Most proper" and "should do" is the most confident we can be when dealing
with this standard, as there is no correct encoding.

Does anyone have a suggestion which will be more compatible with the rest of
the world than allowing the user to select an encoding, and defaulting to
"utf-8"?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-13 Thread Matt Giuca
On Mon, Jul 14, 2008 at 4:54 AM, André Malo <[EMAIL PROTECTED]> wrote:

>
> Ahem. The HTTP standard does ;-)
>

Really? Can you include a quotation please? The HTTP standard talks a lot
about ISO-8859-1 (Latin-1) in terms of actually raw encoded bytes, but not
in terms of URI percent-encoding (a different issue) as far as I can tell.


>
> > Where web forms are concerned, basically setting the form accept-charset
> > or the page charset is the *maximum amount* of control you have over the
> > encoding. As you say, it can be encoded by another page or the user can
> > override their settings. Then what can you do as the server? Nothing ...
>
> Guessing works pretty well in most of the cases.
>

Are you suggesting that urllib.unquote guess the encoding? It could do that
but it would make things rather unpredictable. I think if this was an
application (such as a web browser), then guessing is OK. But this is a
library function. Library functions should not make arbitrary decisions;
they should be well-specified.

Latin-1 is not exactly arbitray. Besides being a charset - it maps
> one-to-one to octet values, hence it's commonly used to encode octets and
> is therefore a better fallback than every other encoding.
>

True. So the only advantage I see to the current implementation is that if
you really want to, you can take the Latin-1-decoded URI (from unquote) and
explicitly encode it as Latin-1 and then decode it again as whatever
encoding you want. But that would be a hack, would it not? I'd prefer if the
library didn't require a hack just to get the extremely common use case
(UTF-8).


>
> > I agree. However if there *was* a proper standard we wouldn't have to
> > argue! "Most proper" and "should do" is the most confident we can be when
> > dealing with this standard, as there is no correct encoding.
>
> Well, the standard says, there are octets to be encoded. I find that proper
> enough.


Yes but unfortunately we aren't talking about octets any more in Python 3,
but characters. If we're going to follow the standard and encode octets,
then we should be accepting (for quote) and returning (for unquote) bytes
objects, not strings. But as that's going to break most existing code and be
extremely confusing, I think it's best we try and solve this problem for
Unicode strings.


> > Does anyone have a suggestion which will be more compatible with the rest
> > of the world than allowing the user to select an encoding, and defaulting
> > to "utf-8"?
>
> Default to latin-1 for decoding and utf-8 for encoding. This might be
> confusing though, so maybe you've asked the wrong question ;)
>

:o that would break so so much existing code, not to mention being horribly
inconsistent and confusing. Having said that, that's almost what the current
behaviour is (quote uses Latin-1 for characters < 256, and UTF-8 for
characters above; unquote uses Latin-1).

Again I bring up the http server example. If you go to a directory, create a
file with a name such as '漢字', and then run this code in Python 3.0 from
that directory:

import http.server
s = http.server.HTTPServer(('',8000),
http.server.SimpleHTTPRequestHandler)
s.serve_forever()

You'll see the file in the directory listing - its HTML will be 漢字. But if you click it, you get a 404 because
the server will look for the file named unquote("%E6%BC%A2%E5%AD%97") =
'æ¼¢å\xad\x97'.

If you apply my patch (patch5) *everything* *just* *works*.


On Mon, Jul 14, 2008 at 6:36 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:

> > Ah there may be some confusion here. We're only dealing with str->str
> > transformations (which in Python 3 means Unicode strings). You can't put
> a
> > bytes in or get a bytes out of either of these functions. I suggested a
> > "quote_raw" and "unquote_raw" function which would let you do this.
>
> Ah, well, that's a problem.  Clearly the unquote is str->bytes, while
> the quote is (bytes OR str)->str.
>

OK so for quote, you're suggesting that we accept either a bytes or a str
object. That sounds quite reasonable (though neither the unpatched or
patched versions accept a bytes at the moment). I'd simply change the code
in quote (from patch5) to do this:

if isinstance(s, str):
s = s.encode(encoding, errors)

res = map(quoter, s)

Now you get this behaviour by default (which may appear confusing but I'd
argue correct given the different semantics of 'h\xfcllo' and b'h\xfcllo'):

>>> urllib.parse.quote(b'h\xfcllo')
'h%FCllo'   # Directly-encoded octets
>>> urllib.parse.quote('h\xfcllo')
'h%C3%BCllo' # UTF-8 encoded string, then encoded octets

Clearly the unquote is str->bytes,  You can't pass a Unicode string
> back
> as the result of unquote *without* passing in an encoding specifier,
> because the character set is application-specific.
>

So for unquote you're suggesting that it always return a bytes object UNLESS
an encoding is specified? As in:

>>> urllib.parse.unquote('h%C3%BCllo')
b'h\xc3\xbcllo'

I would object to that on tw

Re: [Python-Dev] str(container) should call str(item), not repr(item)

2008-07-28 Thread Matt Giuca
Another disadvantage of calling str recursively rather than repr is that it
places an onus on anyone writing a class to write both a repr and a str
method (or be inconsistent with the newly-accepted standard for container
types).

I personally write a repr method for most classes, which aids debugging.
This means all my classes behave like containers currently do - their str
will call repr on the items. This proposal will make all of my classes
behave inconsistently with the standard container types.

- Matt Giuca
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Matt Giuca
Hi folks,

This issue got some attention a few weeks back but it seems to have
fallen quiet, and I haven't had a good chance to sit down and reply
again till now.

As I've said before this is a serious issue which will affect a great
deal of code. However it's obviously not as clear-cut as I originally
believed, since there are lots of conflicting opinions. Let us see if
we can come to a consensus.

(For those who haven't seen the discussion, the thread starts here:
http://mail.python.org/pipermail/python-dev/2008-July/081013.html
continues here for some reason:
http://mail.python.org/pipermail/python-dev/2008-July/081066.html
and I've got a bug report with a fully tested and documented patch here:
http://bugs.python.org/issue3300)

Firstly, it looks like most of the people agree we should add an
optional "encoding" argument which lets the caller customize which
encoding to use. What we tend to disagree about is what the default
encoding should be.

Here I present the various options as I see it (and I'm trying to be
impartial), and the people who've indicated support for that option
(apologies if I've misrepresented anybody's opinion, feel free to
correct):

1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to
UTF-8. unquote is Latin-1.
In favour: Anybody who doesn't reply to this thread
Pros: Already implemented; some existing code depends upon ord values
of string being the same as they were for byte strings; possible to
hack around it.
Cons: unquote is not inverse of quote; quote behaviour
internally-inconsistent; garbage when unquoting UTF-8-encoded URIs.

2. Default to UTF-8.
In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven
Pros: Fully working and tested solution is implemented; recommended by
RFC 3986 for all future schemes; recommended by W3C for use with HTML;
UTF-8 used by all major browsers; supports all characters; most
existing code compatible by default; unquote is inverse of quote.
Cons: By default, URIs may have invalid octet sequences (not possible
to reverse).

3. quote default to UTF-8, unquote default to Latin-1.
In favour: André Malo
Pros: quote able to handle all characters; unquote able to handle all sequences.
Cons: unquote is not inverse of quote; totally inconsistent.

4. quote accepts either bytes or str, unquote default to outputting
bytes unless given an encoding argument.
In favour: Bill Janssen
Pros: Technically does what the spec says, which is treat it as an
octet encoding.
Cons: unquote will break most existing code; almost 100% of the time
people will want it as a string.



I'll just comment on #4 since I haven't already. Let's talk about
quote and unquote separately. For quote, I'm all for letting it accept
a bytes as well as a str. That doesn't break anything or surprise
anyone.

For unquote, I think it will break a lot and surprise everyone. I
think that while this may be "purely" the best option, it's pretty
silly. I reckon the vast majority of users will be surprised when they
see it spitting out a bytes object, and all that most people will do
is decode it as UTF-8. Besides, while you're reading the RFCs as "URLs
specify a method for encoding octet sequences", I'm reading them as
"URLs specify a method for encoding strings, and leave the character
encoding unspecified." The second reading supports the idea that
unquote outputs a str.

I'm also recommending we add unquote_to_bytes to do what you suggest
unquote should do. (So either way we'll get both versions of unquote;
I'm just suggesting the one called "unquote" do the thing everybody
expects). But that's less of a priority so I want to commit these
urgent fixes first.

I'm basically saying just two things: 1. The standards are undefined;
2. Therefore we should pick the most useful and/or intuitive default.
IMHO choosing UTF-8 *is* the most useful AND intuitive, and will be
more so in the future when more technologies are hard-coded as UTF-8
(which this RFC recommends they do in the future).

I am also quite adamant that unquote be the inverse of quote.

Are there any more opinions on this matter? It would be good to reach
a consensus. If anyone seriously wants to push a different alternative
to mine, please write a working implementation and attach it to issue
3300.

On the technical side of things, does anybody have time to review my
patch for this issue?
http://bugs.python.org/issue3300
Patch 5.
It's just a patch for unquote, quote, and small related functions, as
well as numerous changes to test cases and documentation.

Cheers
Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Matt Giuca
Arg! Damnit, why do my replies get split off from the main thread?
Sorry about any confusion this may be causing.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Matt Giuca
> Con: URI encoding does not encode characters.

OK, for all the people who say URI encoding does not encode characters: yes
it does. This is not an encoding for binary data, it's an encoding for
character data, but it's unspecified how the strings map to octets before
being percent-encoded. From RFC 3986, section
1.2.1
:

Percent-encoded octets (Section 2.1) may be used within a URI to represent
> characters outside the range of the US-ASCII coded character set if this
> representation is allowed by the scheme or by the protocol element in which
> the URI is referenced.  Such a definition should specify the character
> encoding used to map those characters to octets prior to being
> percent-encoded for the URI.


So the string->string proposal is actually correct behaviour. I'm all in
favour of a bytes->string version as well, just not with the names "quote"
and "unquote".

I'll prepare a new patch shortly which has bytes->string and string->bytes
versions of the functions as well. (quote will accept either type, while
unquote will output a str, there will be a new function unquote_to_bytes
which outputs a bytes - is everyone happy with that?)

Guido says:

> Actually, we'd need to look at the various other APIs in Py3k before we can
> decide whether these should be considered taking or returning bytes or text.
> It looks like all other APIs in the Py3k version of urllib treat URLs as
> text.


Yes, as I said in the bug tracker, I've groveled over the entire stdlib to
see how my patch affects the behaviour of dependent code. Aside from a few
minor bits which assumed octets (and did their own encoding/decoding) (which
I fixed), all the code assumes strings and is very happy to go on assuming
this, as long as the URIs are encoded with UTF-8, which they almost
certainly are.

Guido says:

> I think the only change is to remove the encoding arguments and ...


You really want me to remove the encoding= named argument? And hard-code
UTF-8 into these functions? It seems like we may as well have the optional
encoding argument, as it does no harm and could be of significant benefit.
I'll post a patch with the unquote_to_bytes function, but leave the encoding
arguments in until this point is clarified.

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca
Alright, I've uploaded the new patch which adds the two requested
bytes-oriented functions, as well as accompanying docs and tests.
http://bugs.python.org/issue3300
http://bugs.python.org/file11009/parse.py.patch6

I'd rather have two pairs of functions, so that those who want to give
> the readers of their code a clue can do so. I'm not opposed to having
> redundant functions that accept either string or bytes though, unless
> others prefer not to.
>

Yes, I was in a similar mindset. So the way I've implemented it, quote
accepts either a bytes or a str. Then there's a new function
quote_from_bytes, which is defined precisely like this:

quote_from_bytes = quote
>

So either name can be used on either input type, with the idea being that
you should use quote on a str, and quote_from_bytes on a bytes. Is this a
good idea or should it be rewritten so each function permits only one input
type?

Sorry, I have yet to look at the tracker (only so many minutes in a day...).


Ah, I didn't mean offense. Just that one could read the sordid details of my
investigation on the tracker if one so desired ;)

I don't mind an encoding argument, as long as it isn't used to change
> the return type (as Bill was proposing).


Yeah, my unquote always outputs a str, and unquote_to_bytes always outputs a
bytes.

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca
Bill wrote:

I'm not sure that's sufficient review, though I agree it's necessary.
>
The major consumers of quote/unquote are not in the Python standard
>
library.


I figured that Python 3.0 is designed to fix things, with the breaking
third-party code being an acceptable side-effect of that. So the most
important thing when 3.0 is released is that the stdlib is internally
consistent. All other code is "allowed" to be broken. So I've investigated
all the code necessary.

Having said this, my patch breaks almost no code. Your suggestion breaks a
hell of a lot.

Sure.  All I was asking was that we not break the existing usage of
>
the standard library "unquote" by producing a string by *assuming* a
>
UTF-8 encoded string is what's in those percent-encoded bytes (instead
>
of, say, ISO 2022-JP).  Let the "new" function produce a string:
>
"unquote_as_string".


You're assuming that a Python 2.x "str" is the same thing as a Python 3.0
"bytes". It isn't. (If it was, this transition would be trivial). A Python 2
"str" is a non-Unicode string. It can be printed, concatenated with Unicode
strings, etc etc. It has the semantics of a string. The Python 3.0 "bytes"
is not a string at all.

What you're saying is "the old behaviour was to output a bytes, so the new
behaviour should be consistent". But that isn't true - the old behaviour was
to output a string (a non-Unicode one). People, and code, expect it to
output something with string semantics. So making unquote output a bytes is
just as big a change as making it output a (unicode) str. Python 3.0 doesn't
have a type which is like Python 2's "str" type (which is good - that type
was very messy). So the argument that "Python 2 unquote outputs a bytes, so
we should too" is not legitimate.



If you want to keep pushing this, please install my new patch (patch 6).
Then rename "unquote" to "unquote_to_string" and rename "unquote_to_bytes"
to "unquote", and witness the havoc that ensues. Firstly, you break most
Internet-related modules in the standard library.

10 tests failed:
>
test_SimpleHTTPServer test_cgi test_email test_http_cookiejar
>
test_httpservers test_robotparser test_urllib test_urllib2
>
test_urllib2_localnet test_wsgiref
>

Fixing these isn't a matter of changing test cases (which all but one of my
fixes were). It would require changes to all the modules, to get them to
deal with bytes instead of strings (which would generally mean spraying
.decode("utf-8") all over the place). My code, on the other hand, "tends to
be" compatible with 2.x code.

Here I'm seeing:
BytesWarning: Comparison between bytes and string.
TypeError: expected an object with the buffer interface
http.client.BadStatusLine

For another example, try this:

>>> import http.server
>>> s = http.server.HTTPServer(('',8000),
http.server.SimpleHTTPRequestHandler)
>>> s.serve_forever()

The current (unpatched) build works, but links to files with non-ASCII
filenames (eg. '漢字') break, because of the URL. This is one example of my
patch directly fixing a bug in real code. With my patch applied, the links
work fine *because URL quoting and unquoting are consistent, and work on all
Unicode characters*.

If you change unquote to output a bytes, it breaks completely. You get a
"TypeError: expected an object with the buffer interface" as soon as the
user visits the page.

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca
> so you can use quote_from_bytes on strings?

Yes, currently.

> I assumed Guido meant it was okay to have quote accept string/byte input and 
> have a function that was redundant but limited in what it accepted (i.e. 
> quote_from_bytes accepts only bytes)
>
> I suppose your implementation doesn't break anything... it just strikes me as 
> "odd"

Yeah. I get exactly what you mean. Worse is it takes an
encoding/replace argument.

I'm in two minds about whether it should allow this or not. On one
hand, it kind of goes with the Python philosophy of not artificially
restricting the allowed types. And it avoids redundancy in the code.

But I'd be quite happy to let quote_from_bytes restrict its input to
just bytes, to avoid confusion. It would basically be a
slightly-modified version of quote:

def quote_from_bytes(s, safe = '/'):
if isinstance(safe, str):
safe = safe.encode('ascii', 'ignore')
cachekey = (safe, always_safe)
if not isinstance(s, bytes) or isinstance(s, bytearray):
raise TypeError("quote_from_bytes() expected a bytes")
try:
quoter = _safe_quoters[cachekey]
except KeyError:
quoter = Quoter(safe)
_safe_quoters[cachekey] = quoter
res = map(quoter, s)
return ''.join(res)

(Passes test suite).

I think I'm happier with this option. But the "if not isinstance(s,
bytes) or isinstance(s, bytearray)" is not very nice.
(The only difference to quote besides the missing arguments is the two
lines beginning "if not isinstance". Maybe we can generalise the rest
of the function).
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-05 Thread Matt Giuca
Has anyone had time to look at the patch for this issue? It got a lot of
support about a week ago, but nobody has replied since then, and the patch
still hasn't been assigned to anybody or given a priority.

I hope I've complied with all the patch submission procedures. Please let me
know if there is anything I can do to speed this along.

Also I'd be interested in hearing anyone's opinion on the "quote_from_bytes"
issue as raised in the previous email. I posted a suggested implementation
of a more restrictive quote_from_bytes in that email, but I haven't included
it in the patch yet.

Matt Giuca
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-05 Thread Matt Giuca
> After the most recent flurry of discussion I've lost track of what's
> the right thing to do. I also believe it was said it should wait until
> 2.7/3.0, so there's no hurry (in fact there's no way to check it -- we
> don't have branches for those versions yet).
>

I assume you mean 2.7/3.1.

I've always been concerned with the suggestion that this wait till 3.1. I
figure this patch is going to change the documented behaviour of these
functions, so it might be unacceptable to change it after 3.0 is released.
It seems logical that this patch be part of the
"incompatible-for-the-sake-of-fixing-things" set of changes in 3.0.

The current behaviour is broken. Any code which uses quote to produce a URL,
then unquotes the same URL later will simply break for characters outside
the Latin-1 range. This is evident in the SimpleHTTPServer class as I said
above (which presents users with URLs for the files in a directory using
quote, then gives 404 when they click on them, because unquote can't handle
it). And it will break any user's code which also assumes unquote is the
inverse of quote.

We could hack a fix into SimpleHTTPServer and expect other users to do the
same (along the lines of .encode('utf-8').decode('latin-1')), but then those
hacks will break when we apply the patch in 3.1 because they abuse Unicode
strings, and we'll have to have another debate about how to be backwards
compatible with them. (The patched version is largely compatible with the
2.x version, but the unpatched version isn't compatible with either the 2.x
version or the patched version).

Surely the sane option is to get this UTF-8 patch into version 3.0 so we
don't have to support this bug into the future? I'm far less concerned about
the decision with regards to unquote_to_bytes/quote_from_bytes, as those are
new features which can wait.

Matt Giuca
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Matt Giuca
> This whole discussion circles too much, I think. Maybe it should be pepped?


The issue isn't circular. It's been patched and tested, then a whole lot of
people agreed including Guido. Then you and Bill wanted the bytes
functionality back. So I wrote that in there too, and Bill at least said
that was sufficient.

On Thu, Jul 31, 2008, Bill Janssen wrote:
>
> But:  OK, OK, I yield.  Though I still think this is a bad idea, I'll shut
> up if we can also add "unquote_as_bytes" which returns a byte sequence
> instead of a string.  I'll just change my code to use that.
>

We've reached, to quote Guido, "as close as consensus as we can get on this
issue".

There is a bug in Python. I've proposed a working fix, and nobody else has.
Guido okayed it. I made all the changes the community suggested. What more
needs to be discussed here?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Matt Giuca
> There are a lot of quotes around. Including "After the most recent flurry
> of
> discussion I've lost track of what's the right thing to do."
> But I don't talk for other people.
>

OK .. let me compose myself a little. Sorry I went ahead and assumed this
was closed.

It's just frustrating to me that I've now spent a month trying to push this
through, and while it seems everybody has an opinion, nobody seems to have
bothered trying my code. (I've even implemented your suggestions and posted
my feedback, and nobody replied to that). Nobody's been assigned to look at
it and it hasn't been given a priority, even though we all agree it's a bug
(though we disagree on how to fix it).


>
> > There is a bug in Python. I've proposed a working fix, and nobody else
> > has.
>
> Well, you proposed a patch ;-)
> It may fix things, it will break a lot. While this was denied over and over
> again, it's still gonna happen, because the axioms are still not accounting
> for the reality.


Well all you're getting from me is "it works". And all I'm getting from you
is "it might not". Please .. I've been asking for weeks now for someone to
review the patch. I've already spent hours (like ... days worth of hours)
testing this patch against the whole library. I've written reams of reports
on the tracker to try and convince people it works. There isn't any more *I*
can do. If you think it's going to break code, go ahead and try it out.

The claims I am making are based on my experience working with a) Python 2,
b) Python 3 as it stands, c) Python 3 with my patch, and d) Python 3 with
quote/unquote using bytes. In my experience, (c) is the only version of
Python 3 which works properly.

> I made all the changes the community suggested.
>
> I don't think so.
>

?


> > What more needs to be discussed here?
>
> Huh? You feel, the discussion is over? Then why are there still open
> questions? I admit, a lot of discussion is triggered by the assessments
> you're stating in your posts. Don't take it as a personal offense, it's a
> simple observation. There were made a lot of statements and nobody even
> bothered to substantiate them.


If you read the bug tracker  all the way
to the beginning, you'll see I use a good many examples, and I also went
through the entire standard library  to try
and substantiate my claims. (Admittedly, I didn't finish investigating the
request module, but that shouldn't prevent the patch from being reviewed).
As I've said all along, yes, it will break code, but then *all solutions
possible* will break code, including leaving it in. Mine *seems* to break
the least existing code. If there is ever a time to break code, Python 3.0
is it.


> A PEP could fix that.
>

I could write a PEP. But as you've read above, I'm concerned this won't get
into Python 3.0, and then we'll be locked into the existing functionality
and it'll never get accepted; hence I'd rather this be resolved as quickly
as possible. If you think it's worth writing a PEP, let's do it.

Apologies again for my antagonistic reply earlier. Not trying to step on
toes, just get stuff done.

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca
Wow .. a lot of replies today!

On Thu, Aug 7, 2008 at 2:09 AM, "Martin v. Löwis" <[EMAIL PROTECTED]>wrote:

> It hasn't been given priority: There are currently 606 patches in the
> tracker, many fixing bugs of some sort. It's not clear (to me, at least)
> why this should be given priority over all the other things such as
> interpreter crashes.


Sorry ... when I said "it hasn't been given priority" I mean "it hasn't been
given *a* priority" - as in, nobody's assigned a priority to it, whatever
that priority should rightfully be.


> We all agree it's a bug: no, I don't. I think it's a missing feature,
> at best, but I'm staying out of the discussion. As-is, urllib only
> supports ASCII in URLs, and that is fine for most purposes.


Seriously, Mr. L%C3%B6wis, that's a tremendously na%C3%AFve statement.


> URLs are just not made for non-ASCII characters. Implement IRIs if you
> want non-ASCII characters; the rules are much clearer for these.


Python 3.0 fully supports Unicode. URIs support *encoding* of arbitrary
characters (as of more recent revisions). The difference is that URIs *may
only consist* of ASCII characters (even though they can encode Unicode
characters), while IRIs may also consist of Unicode characters. It's our
responsibility to implement URIs here ... IRIs are a separate issue.

Having said this, I'm pretty sure Martin can't be convinced, so I'll leave
that alone.

On Thu, Aug 7, 2008 at 3:34 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:

> So unquote() should probably try to decode using UTF-8 first
>
and then fall back to Latin-1 if that doesn't work.


That's an interesting proposal. I think I don't like it - for a user
application that's a good policy. But for a programming language library, I
think it should not do guesswork. It should use the encoding supplied, and
have a single default. But I'd be interested to hear if anyone else wants
this.

As-is, it passes 'replace' to the errors argument, so encoding errors get
replaced by '�' characters.

OK I haven't looked at the review yet .. guess it's off to the tracker :)

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] String concatenation

2008-08-09 Thread Matt Giuca
Is the only issue with this feature that you might accidentally miss a comma
after a string in a sequence of strings? That seems like a significantly
obscure scenario compared to the usefulness of the current syntax, for
exactly the purpose Barry points out (which most people use all the time).

I think the runtime concatenation costs are less important than the
handiness of being able to break strings across lines without having to
figure out where to put that '+' operator.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] bytes.tohex in Python 3

2008-08-09 Thread Matt Giuca
Hi,

I'm confused as to how you represent a bytes object in hexadecimal in Python
3. Of course in Python 2, you use str.encode('hex') to go to hex, and
hexstr.decode('hex') to go from hex.

In Python 3, they removed "hex" as a codec (which was a good move, I think).
Now there's the static method bytes.fromhex(hexbytes) to go from hex. But I
haven't figured out any (easy) way to convert a byte string to hex. Is there
some way I haven't noticed, or is this an oversight?

The easiest thing I can think of currently is this:
''.join(hex(b)[2:] for b in mybytes)

I think there should be a bytes.tohex() method. I'll add this as a bug
report if it indeed is an oversight, but I thought I'd check here first to
make sure I'm not just missing something.

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.tohex in Python 3

2008-08-09 Thread Matt Giuca
Well, whether there's community support for this or not, I thought I'd have
a go at implementing this, so I did. I've submitted a feature request +
working patch to the bug tracker:

http://bugs.python.org/issue3532

Matt

PS. I mean
''.join("%02x" % b for b in mybytes)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.tohex in Python 3

2008-08-09 Thread Matt Giuca
Hi Guido,

Ah yes Martin just pointed this out on the tracker. I think it would still
be worthwhile having the tohex method, if not just to counter the obscurity
of the binascii.hexlify function (but for other reasons too).

Also there's an issue with all the functions in binascii - they return
bytes, not strings. Is this an oversight? (My version of tohex returns a
str).

See tracker:
http://bugs.python.org/issue3532

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] uuid test fails

2008-08-14 Thread Matt Giuca
Hi,

I thought I'd bring this up on both the tracker and mailing list, since it's
important. It seems the test suite breaks as of r65661. I've posted details
to the bug tracker and a patch which fixes the module in question (uuid.py).

http://bugs.python.org/issue3552

Cheers
Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] ImportError message suggestion

2008-08-19 Thread Matt Giuca
> ImportError: cannot import name annotate from /usr//image.pyc


That could be handy.

Not sure it's necessary, however, and exposes some system information in the
error message. I can imagine a web app which prints out exception messages,
and this would mean the server file structure is printed to the world
(though arguably you should not be doing this on your web app, I work on an
open source web app and we do dump tracebacks to our users sometimes --
because it's open source we don't mind them seeing the code -- but we'd
rather not have them see our server details).

If you do get this issue (as a developer), I find the built-in help()
function very handy -- you can import a module then go help(that_module) and
it tells you the absolute path to the module. That might be a sufficient
alternative to this patch (though requiring a bit more manual labour).

So I am neither for nor against this suggestion.

"I think the acceptance for this wouldn't be that hard since there is
> no real issue for regression (the only one I could think of is for
> doctest module, although I'm not sure there are any reason to test for
> failed import in doctest)"


I agree. (I'm more familiar with unittest than doctest, where you'd just use
assertRaises(ImportError, ...) and not care what the exception message is --
is there any way in doctest to check for the exception type without caring
about the message?)

I can't write the C code myself, or evaluate the patch.
>

Go to http://bugs.python.org/ and add a new issue. Upload the patch as an
attachment when you enter the issue description. I think you'll have to put
it down as a feature request for 2.7/3.1, since the beta tomorrow will mean
no more features in 2.6/3.0.

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] ImportError message suggestion

2008-08-19 Thread Matt Giuca
I think this is not the place to be discussing the patch (the tracker is),
but while I think of it, I'll just say:

You need to DECREF the fn variable (both
PyObject_GetAttrStringand
PyString_FromString
return new
references). If this makes no sense, read up on reference
counting (http://docs.python.org/ext/refcounts.html,
http://www.python.org/doc/api/countingRefs.html).


+PyString_AsString(name),
+ PyString_AsString(fn));
+   Py_DECREF(fn);
}

Also:

   - Do you really want "?" if you can't get the filename for some reason --
   why not just not say anything?
   - Perhaps don't create a new variable "fn", use one of the many defined
   at the top of the eval function.

Otherwise, looks like it will do the job.

But I haven't tested it, just eyeballed it.

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Things to Know About Super

2008-08-24 Thread Matt Giuca
Hi Michele,

Do you have a URL for this blog?

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Fwd: Things to Know About Super

2008-08-24 Thread Matt Giuca
Had a brief offline discussion with Michele - forwarding.

-- Forwarded message --
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Date: Mon, Aug 25, 2008 at 12:13 AM

On Aug 24, 3:43 pm, "Matt Giuca" <[EMAIL PROTECTED]> wrote:
> Hi Michele,
>
> Do you have a URL for this blog?

Sorry, here it is:
http://www.artima.com/weblogs/index.jsp?blogger=micheles

-- Forwarded message --
From: Matt Giuca <[EMAIL PROTECTED]>
Date: Mon, Aug 25, 2008 at 1:15 AM

I skimmed (will read in detail later). As an "intermediate" (I'll describe
myself as) Python developer, I tend not to use/understand super (I just call
baseclassname.methodname(self,...) directly, so I guess I'm the target
audience of this article. It's good - very informative and thorough.

It's a bit too informal, personal, and opinionative to be used as
"documentation" IMHO but it could certainly be cleaned up without being
rewritten.

Of interest though, is this:
"The first sentence is just plain wrong: super does not return the
superclass."

>From what I remember of using super, this statement is true, and the
documentation is wrong (or at least over-simplifies things).
http://docs.python.org/dev/library/functions.html#super
http://docs.python.org/dev/3.0/library/functions.html#super
Perhaps this should be amended? (A brief statement to the effect of super
creating a proxy object which can call the methods of any base class). It
actually mentions the "super object" later, even though it claims to be
returning the superclass.

Also Michele, looks as if super in Python 3 works about the same but has the
additional feature of supporting 0 arguments, in which case it defaults to
super(this_class, first_arg). (Does not create unbound super objects).

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Documentation Error for __hash__

2008-08-28 Thread Matt Giuca
> This may have been true for old style classes, but as new style classes
> inherit a default __hash__ from object - mutable objects *will* be usable as
> dictionary keys (hashed on identity) *unless* they implement a __hash__
> method that raises a type error.
>

I always thought this was a bug in new-style classes (due to the fact that,
as you say, they inherit __hash__ from object whether it's wanted or not).
However, it seems to be fixed in Python 3.0. So this documentation is only
"wrong" for Python 2.x branch.

See:

Python 2.6b3+ (trunk:66055, Aug 29 2008, 07:50:39)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class X(object):
... def __eq__(self, other):
... return True
...
>>> x = X()
>>> hash(x)
-1211564180

versus

Python 3.0b3+ (py3k:66055M, Aug 29 2008, 07:52:23)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class X(object):
... def __eq__(self, other):
... return True
...
>>> x = X()
>>> hash(x)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: unhashable type: 'X'

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Documentation Error for __hash__

2008-08-29 Thread Matt Giuca
> Being hashable is a different from being usable as dictionary key.
>
> Dictionaries perform the lookup based on the hash value, but will
> then have to check for hash collisions based on an equal comparison.
>
> If an object does not define an equal comparison, then it is not
> usable as dictionary key.
>

But if an object defines *neither* __eq__ or __hash__, then by default it is
usable as a dictionary key (using the id() of the object for both default
equality and hashing, which is fine, and works for all user-defined types by
default).

An object defining __hash__ but not __eq__ is not problematic, since it
still uses id() default for equality. (It just has potentially bad
dictionary performance, if lots of things hash the same even though they
aren't equal). This it not a problem by definition because *it is officially
"okay" for two objects to hash the same, even if they aren't equal, though
undesirable*.

So all hashable objects are usable as dictionary keys, are they not? (As far
as I know it isn't possible to have an object that does not have an equality
comparison, unless you explicitly override __eq__ and have it raise a
TypeError for some reason).

It's probably a good idea to implement __hash__ for objects that
> implement comparisons, but it won't always work and it is certainly
> not needed, unless you intend to use them as dictionary keys.
>

But from what I know, it is a *bad* idea to implement __hash__ for any
mutable object with non-reference equality (where equality depends on the
mutable state), as an unbreakable rule. This is because if they are inserted
into a dictionary, then mutated, they may suddenly be in the wrong bucket.
This is why all the mutable types in Python with non-reference equality (eg.
list, set, dict) are explicitly not hashable, while the immutable types (eg.
tuple, frozenset, str) are hashable, and so are the mutable types with
reference equality (eg. functions, user-defined classes by default).


>
> > and that mutable objects should raise a TypeError in __hash__.
>
> That's a good idea, even though it's not needed either ;-)
>

So I think my above "axioms" are a better (less restrictive, and still
correct) rule than this one. It's OK for a mutable object to define
__hash__, as long as its __eq__ doesn't depend upon its mutable state. For
example, you can insert a function object into a dictionary, and mutate its
closure, and it won't matter, because neither the hash nor the equality of
the object is changing. It's only types like list and dict, with deep
equality, where you run into this hash table problem.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Documentation Error for __hash__

2008-08-29 Thread Matt Giuca
>
>> It's probably a good idea to implement __hash__ for objects that
>> implement comparisons, but it won't always work and it is certainly
>> not needed, unless you intend to use them as dictionary keys.
>>
>>
>>
>
>
> So you're suggesting that we document something like.
>
> Classes that represent mutable values and define equality methods are free
> to define __hash__ so long as you don't mind them being used incorrectly if
> treated as dictionary keys...
>
> Technically true, but not very helpful in my opinion... :-)


No, I think he was suggesting we document that if a class overrides __eq__,
it's a good idea to also implement __hash__, so it can be used as a
dictionary key.

However I have issues with this. First, he said:

"It's probably a good idea to implement __hash__ for objects that
implement comparisons, but it won't always work and it is certainly
not needed, unless you intend to use them as dictionary keys."

You can't say "certainly not needed unless you intend to use them as
dictionary keys", since if you are defining an object, you never know when
someone else will want to use them as a dict key (or in a set, mind!) So *if
possible*, it is a good idea to implement __hash__ if you are implementing
__eq__.

But also, it needs to be very clear that if you *should not* implement
__hash__ on a mutable object -- and it already is. So basically the docs
should suggest that it is a good idea to implement __hash__ if you are
implementing __eq__ on an immutable object.

HOWEVER,

There are two contradictory pieces of information in the docs.

a) "if it defines
__cmp__()or
__eq__() but
not
__hash__(),
its instances will not be usable as dictionary keys."
versus
b) "User-defined classes have
__cmp__()and
__hash__()methods
by default; with them, all objects compare unequal and
x.__hash__() returns id(x)."

Note that these statements are somewhat contradictory: if a class has a
__hash__ method by default (as b suggests), then it isn't possible to "not
have a __hash__" (as suggested by a).

In Python 2, statement (a) is true for old-style classes only, while
statement (b) is true for new style classes only. This distinction needs to
be made. (For old-style classes, it isn't the case that it has a __hash__
method by default - rather that the hash() function knows how to deal with
objects without a __hash__ method, by calling id()).

In Python 3, statement (a) is true always, while statement (b) is not (in
fact just the same as old-style classes are in Python 2). So the Python 3
docs can get away with being simpler (without having to handle that weird
case).

I just saw Marc-Andre's new email come in; I'll look at that now.

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Documentation Error for __hash__

2008-08-29 Thread Matt Giuca
> Note that only instances have the default hash value id(obj). This
> is not true in general. Most types don't implement the tp_hash
> slot and thus are not hashable. Indeed, mutable types should not
> implement that slot unless they know what they're doing :-)


By "instances" you mean "instances of user-defined classes"?
(I carefully avoid the term "instance" on its own, since I believe that was
phased out when they merged types and classes; it probably still exists in
the C API but shouldn't be mentioned in Python-facing documentation).

But anyway, yes, we should make that distinction.

Sorry, I wasn't clear enough: with "not defining an equal comparison"
> I meant that an equal comparison does not succeed, ie. raises an
> exception or returns Py_NotImplemented (at the C level).


Oh OK. I didn't even realise it was "valid" or "usual" to explicitly block
__eq__ like that.


> Again, the situation is better at the C level, since types
> don't have a default tp_hash implementation, so have to explicitly
> code such a slot in order for hash(obj) to work.


Yes but I gather that this "data model" document we are talking about is not
designed for C authors, but Python authors, so it should be written for the
point of view of people coding only in Python. (Only the "Extending and
Embedding" and the "C API" documents are for C authors).

The documentation should probably say:
>
> "If you implement __cmp__ or
> __eq__ on a class, also implement a __hash__ method (and either
> have it raise an exception or return a valid non-changing hash
> value for the object)."
>

I agree, except maybe not for the Python 3 docs. As long as the behaviour I
am observing is well-defined and not just a side-effect which could go away
-- that is, if you define __eq__/__cmp__ but not __hash__, in a user-defined
class, it raises a TypeError -- then I think it isn't necessary to recommend
implementing a __hash__ method and raising a TypeError. Better just to leave
as-is ("if it defines
__cmp__()or
__eq__()but
not
__hash__(),
its instances will not be usable as dictionary keys") and clarify the later
statement.


>
> "If you implement __hash__ on classes, you should consider implementing
> __eq__ and/or __cmp__ as well, in order to control how dictionaries use
> your objects."


I don't think I agree with that. I'm not sure why you'd implement __hash__
without __eq__ and/or __cmp__, but it doesn't cause issues so we may as well
not address it.


> In general, it's probably best to always implement both methods
> on classes, even if the application will just use one of them.
>

Well it certainly is for new-style classes in the 2.x branch. I don't think
you should implement __hash__ in Python 3 if you just want a non-hashable
object (since this is the default behaviour anyway).

A lot of my opinion here, though, which doesn't count very much -- so I'm
just making suggestions.

Matt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com