[Python-Dev] More on Py3K urllib -- urlencode()

2009-02-28 Thread Dan Mahn
Hi.  I've been using Py3K successfully for a while now, and have some 
questions about urlencode().


1) The docs mention that items sent to urlencode() are quoted using 
quote_plus().  However, instances of type "bytes" are not handled like 
they are with quote_plus() because urlencode() converts the parameters 
to strings first (which then puts a small "b" and single quotes around a 
textual representation of the bytes).  It just seems to me that 
instances of type "bytes" should be passed directly to quote_plus().  
That would complicate the code just a bit, but would end up being much 
more intuitive and useful.


2) If urlencode() relies so heavily on quote_plus(), then why doesn't it 
include the extra encoding-related parameters that quote_plus() takes?


3) Regarding the following code fragment in urlencode():

   k = quote_plus(str(k))
  if isinstance(v, str):
   v = quote_plus(v)
   l.append(k + '=' + v)
   elif isinstance(v, str):
   # is there a reasonable way to convert to ASCII?
   # encode generates a string, but "replace" or "ignore"
   # lose information and "strict" can raise UnicodeError
   v = quote_plus(v.encode("ASCII","replace"))
   l.append(k + '=' + v)

I don't understand how the "elif" section is invoked, as it uses the 
same condition as the "if" section.


Thanks in advance for any thoughts on this issue.  I could submit a 
patch for urlencode() to better explain my ideas if that would be useful.


- Dan


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] More on Py3K urllib -- urlencode()

2009-02-28 Thread Dan Mahn






Bill Janssen wrote:

  Bill Janssen  wrote:

  
  
Dan Mahn  wrote:



  3) Regarding the following code fragment in urlencode():

   k = quote_plus(str(k))
  if isinstance(v, str):
   v = quote_plus(v)
   l.append(k + '=' + v)
   elif isinstance(v, str):
   # is there a reasonable way to convert to ASCII?
   # encode generates a string, but "replace" or "ignore"
   # lose information and "strict" can raise UnicodeError
   v = quote_plus(v.encode("ASCII","replace"))
   l.append(k + '=' + v)

I don't understand how the "elif" section is invoked, as it uses the
same condition as the "if" section.
  

This looks like a 2->3 bug; clearly only the second branch should be
used in Py3K.  And that "replace" is also a bug; it should signal an
error on encoding failures.  It should probably catch UnicodeError and
explain the problem, which is that only Latin-1 values can be passed in
the query string.  So the encode() to "ASCII" is also a mistake; it
should be "ISO-8859-1", and the "replace" should be a "strict", I think.

  
  
Sorry!  In 3.0.1, this whole thing boils down to

   l.append(quote_plus(k) + '=' + quote_plus(v))

Bill
  


Thanks.  Generally, I would tend to agree.  I actually tried something
like that, but I found that I had inadvertently been sending numeric
values, in which case the str() was saving me.  Considering that, I
would rather just see something like ...

k = quote_plus(k) if isinstance(k,(str,bytes)) else quote_plus(str(k))
if isinstance(v,(str,bytes)):
    l.append(k + "=" + quote_plus(v))
else:
   # just keep what's in the else

I think it would be more compatible with existing code calling
urlencode().  Additionally, I think it would be nice to allow selection
of the quote_plus() string encoding parameters, but that was one of the
other points I listed.

A similar thing could be done when "not doseq", but the handling of "v"
would be exactly  like "k".

- Dan




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] More on Py3K urllib -- urlencode()

2009-03-07 Thread Dan Mahn
After a harder look, I concluded there was a bit more work to be done, 
but still very basic modifications.


Attached is a version of urlencode() which seems to make the most sense 
to me.


I wonder how I could officially propose at least some of these 
modifications.


- Dan



Bill Janssen wrote:

Bill Janssen  wrote:

  

Dan Mahn  wrote:



3) Regarding the following code fragment in urlencode():

   k = quote_plus(str(k))
  if isinstance(v, str):
   v = quote_plus(v)
   l.append(k + '=' + v)
   elif isinstance(v, str):
   # is there a reasonable way to convert to ASCII?
   # encode generates a string, but "replace" or "ignore"
   # lose information and "strict" can raise UnicodeError
   v = quote_plus(v.encode("ASCII","replace"))
   l.append(k + '=' + v)

I don't understand how the "elif" section is invoked, as it uses the
same condition as the "if" section.
  

This looks like a 2->3 bug; clearly only the second branch should be
used in Py3K.  And that "replace" is also a bug; it should signal an
error on encoding failures.  It should probably catch UnicodeError and
explain the problem, which is that only Latin-1 values can be passed in
the query string.  So the encode() to "ASCII" is also a mistake; it
should be "ISO-8859-1", and the "replace" should be a "strict", I think.



Sorry!  In 3.0.1, this whole thing boils down to

   l.append(quote_plus(k) + '=' + quote_plus(v))

Bill
  
def urlencode(query, doseq=0, safe='', encoding=None, errors=None):
"""Encode a sequence of two-element tuples or dictionary into a URL query 
string.

If any values in the query arg are sequences and doseq is true, each
sequence element is converted to a separate parameter.

If the query arg is a sequence of two-element tuples, the order of the
parameters in the output will match the order of parameters in the
input.
"""

if hasattr(query,"items"):
# mapping objects
query = query.items()
else:
# it's a bother at times that strings and string-like objects are
# sequences...
try:
# non-sequence items should not work with len()
# non-empty strings will fail this
if len(query) and not isinstance(query[0], tuple):
raise TypeError
# zero-length sequences of all types will get here and succeed,
# but that's a minor nit - since the original implementation
# allowed empty dicts that type of behavior probably should be
# preserved for consistency
except TypeError:
ty,va,tb = sys.exc_info()
raise TypeError("not a valid non-string sequence or mapping 
object").with_traceback(tb)

l = []
if not doseq:
# preserve old behavior
for k, v in query:
k = quote_plus(k if isinstance(k, (str,bytes)) else str(k), safe, 
encoding, errors)
v = quote_plus(v if isinstance(v, (str,bytes)) else str(v), safe, 
encoding, errors)
l.append(k + '=' + v)
else:
for k, v in query:
k = quote_plus(k if isinstance(k, (str,bytes)) else str(k), safe, 
encoding, errors)
if isinstance(v, str):
v = quote_plus(v if isinstance(v, (str,bytes)) else str(v), 
safe, encoding, errors)
l.append(k + '=' + v)
else:
try:
# is this a sufficient test for sequence-ness?
x = len(v)
except TypeError:
# not a sequence
v = quote_plus(str(v))
l.append(k + '=' + v)
else:
# loop over the sequence
for elt in v:
elt = quote_plus(elt if isinstance(elt, (str,bytes)) 
else str(elt), safe, encoding, errors)
l.append(k + '=' + elt)
return '&'.join(l)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] More on Py3K urllib -- urlencode()

2009-03-09 Thread Dan Mahn
Yes, that was a good idea.  I found some problems, and attached a new 
version.  It looks more complicated than I wanted, but it is a very 
regular repetition, so I hope it is generally readable.


I used "doctest" to include the test scenarios.  I was not familiar with 
it before, but it seems to work quite well.  The main snag I hit was 
that I had to jazz around with the escape sequences (backslashes) in 
order to get the doc string to go in properly.  That is, the lines in 
the string are not the lines I typed at the command prompt, as Python is 
interpreting the escapes in the strings when the file is imported.


In an effort to make fewer tests, the lines of the test strings grew 
pretty long.  I'm not sure if I should try to cut the lengths down or not.


Any suggestions would be welcome before I try to submit this as a patch.

- Dan


Bill Janssen wrote:

Aahz  wrote:

  

On Sat, Mar 07, 2009, Dan Mahn wrote:

After a harder look, I concluded there was a bit more work to be done,  
but still very basic modifications.


Attached is a version of urlencode() which seems to make the most sense  
to me.


I wonder how I could officially propose at least some of these  
modifications.
  

Submit a patch to bugs.python.org



And it would help if it included a lot of test cases.

Bill
  
from urllib.parse import quote_plus
import sys


def urlencode(query, doseq=0, safe='', encoding=None, errors=None):
"""Encode a sequence of two-element tuples or dictionary into a URL query 
string.

If any values in the query arg are sequences and doseq is true, each
sequence element is converted to a separate parameter.

If the query arg is a sequence of two-element tuples, the order of the
parameters in the output will match the order of parameters in the
input.

>>> urlencode((("\\u00a0","\\u00c1"), (b'\\xa0\\x24', b'\\xc1\\x24'), (1, 
2), ("a:", "b$")))
'%C2%A0=%C3%81&%A0%24=%C1%24&1=2&a%3A=b%24'
>>> urlencode((("\\u00a0","\\u00c1"), (b'\\xa0\\x24', b'\\xc1\\x24'), (1, 
2), ("a:", "b$")), safe=":$")
'%C2%A0=%C3%81&%A0$=%C1$&1=2&a:=b$'
>>> urlencode((("\\u00a0","\\u00c1"), (b'\\xa0\\x24', b'\\xc1\\x24'), (1, 
2), ("a:", "b$")), encoding="latin=1")
'%A0=%C1&%A0%24=%C1%24&1=2&a%3A=b%24'
>>> urlencode((("\\u00a0","\\u00c1"), (b'\\xa0\\x24', b'\\xc1\\x24'), (1, 
2), ("a:", "b$")), safe="$:", encoding="latin=1")
'%A0=%C1&%A0$=%C1$&1=2&a:=b$'

>>> urlencode((("\\u00a0","\\u00c1"), (b'\\xa0\\x24', b'\\xc1\\x24'), 
("d:", 0xe), (1, ("b", b'\\x0c\\x24', 0xd, "e$"))), 1)
'%C2%A0=%C3%81&%A0%24=%C1%24&d%3A=14&1=b&1=%0C%24&1=13&1=e%24'
>>> urlencode((("\\u00a0","\\u00c1"), (b'\\xa0\\x24', b'\\xc1\\x24'), 
("d:", 0xe), (1, ("b", b'\\x0c\\x24', 0xd, "e$"))), 1, safe=":$")
'%C2%A0=%C3%81&%A0$=%C1$&d:=14&1=b&1=%0C$&1=13&1=e$'
>>> urlencode((("\\u00a0","\\u00c1"), (b'\\xa0\\x24', b'\\xc1\\x24'), 
("d:", 0xe), (1, ("b", b'\\x0c\\x24', 0xd, "e$"))), 1, encoding="latin-1")
'%A0=%C1&%A0%24=%C1%24&d%3A=14&1=b&1=%0C%24&1=13&1=e%24'
>>> urlencode((("\\u00a0","\\u00c1"), (b'\\xa0\\x24', b'\\xc1\\x24'), 
("d:", 0xe), (1, ("b", b'\\x0c\\x24', 0xd, "e$"))), 1, safe=":$", 
encoding="latin-1")
'%A0=%C1&%A0$=%C1$&d:=14&1=b&1=%0C$&1=13&1=e$'

>>> urlencode((("\\u00a0", "\\u00c1"),), encoding="ASCII", errors="replace")
'%3F=%3F'
>>> urlencode((("\\u00a0", (1, "\\u00c1")),), 1, encoding="ASCII", 
errors="replace")
'%3F=1&%3F=%3F'

"""

if hasattr(query,"items"):
# mapping objects
query = query.items()
else:
# it's a bother at times that strings and string-like objects are
# sequences...
try:
# non-sequence items should not work with len()
# non-empty strings will fail this
if len(query) and not isinstance(query[0], tuple):
   

Re: [Python-Dev] More on Py3K urllib -- urlencode()

2009-03-10 Thread Dan Mahn
I submitted an explanation of this and my proposed modification as issue 
5468.


http://bugs.python.org/issue5468

- Dan


Bill Janssen wrote:

Aahz  wrote:


On Sat, Mar 07, 2009, Dan Mahn wrote:
After a harder look, I concluded there was a bit more work to be done,  
but still very basic modifications.


Attached is a version of urlencode() which seems to make the most sense  
to me.


I wonder how I could officially propose at least some of these  
modifications.

Submit a patch to bugs.python.org


And it would help if it included a lot of test cases.

Bill


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] More on Py3K urllib -- urlencode()

2009-03-10 Thread Dan Mahn
Ahh ... I see.  I should have done a bit more digging to find where the 
standard tests were.


I created a few new tests that could be included in that test suite -- 
see the attached file.  Do you think that this would be sufficient?


- Dan


Bill Janssen wrote:

Dan Mahn  wrote:


Yes, that was a good idea.  I found some problems, and attached a new
version.  It looks more complicated than I wanted, but it is a very
regular repetition, so I hope it is generally readable.


That's great, but I was hoping for more tests in lib/test/test_urllib.py.

Bill

def test_encoding(self):
# Test for special character encoding
given = (("\u00a0", "\u00c1"),)
expect = '%3F=%3F'
result = urllib.parse.urlencode(given, encoding="ASCII", 
errors="replace")
self.assertEqual(expect, result)
result = urllib.parse.urlencode(given, True, encoding="ASCII", 
errors="replace")
self.assertEqual(expect, result)
given = (("\u00a0", (1, "\u00c1")),)
# ... now with default utf-8 ...
given = (("\u00a0", "\u00c1"),)
expect = '%C2%A0=%C3%81'
result = urllib.parse.urlencode(given)
self.assertEqual(expect, result)
# ... now with latin-1 ...
expect = '%A0=%C1'
result = urllib.parse.urlencode(given, encoding="latin-1")
self.assertEqual(expect, result)
# ... now with sequence ...
given = (("\u00a0", (1, "\u00c1")),)
expect = '%3F=1&%3F=%3F'
result = urllib.parse.urlencode(given, True, encoding="ASCII", 
errors="replace")
self.assertEqual(expect, result)
# ... again with default utf-8 ...
given = (("\u00a0", "\u00c1"),)
expect = '%C2%A0=%C3%81'
result = urllib.parse.urlencode(given, True)
self.assertEqual(expect, result)
# ... again with latin-1 ...
expect = '%A0=%C1'
result = urllib.parse.urlencode(given, True, encoding="latin-1")
self.assertEqual(expect, result)

def test_bytes(self):
# Test for encoding bytes
given = ((b'\xa0\x24', b'\xc1\x24'),)
expect = '%A0%24=%C1%24'
result = urllib.parse.urlencode(given)
self.assertEqual(expect, result)
# ... now with sequence ...
result = urllib.parse.urlencode(given, True)
self.assertEqual(expect, result)
# ... now with safe and encoding ...
expect = '%A0$=%C1$'
result = urllib.parse.urlencode(given, safe=":$")
self.assertEqual(expect, result)
result = urllib.parse.urlencode(given, safe=":$", encoding="latin-1")
self.assertEqual(expect, result)
# ... again with sequence ...
result = urllib.parse.urlencode(given, True, safe=":$")
self.assertEqual(expect, result)
result = urllib.parse.urlencode(given, True, safe=":$", 
encoding="latin-1")
self.assertEqual(expect, result)
# ... now with an actual sequence ...
given = ((b'\xa0\x24', (b'\xc1\x24', 0xd)),)
result = urllib.parse.urlencode(given, True, safe=":$")
self.assert_(expect in result,
 "%s not found in %s" % (expect, result))
expect2 = '%A0$=1'
self.assert_(expect2 in result,
 "%s not found in %s" % (expect2, result))
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com