[issue40980] group names of bytes regexes are strings

2020-06-14 Thread Quentin Wenger


New submission from Quentin Wenger :

I noticed that match.groupdict() returns string keys, even for a bytes regex:

```
>>> import re
>>> re.match(b"(?P)", b"").groupdict()
{'a': b''}
```

This seems somewhat strange, because string and bytes matching in re are kind 
of two separate parts, cf. doc:

> Both patterns and strings to be searched can be Unicode strings (str) as well 
> as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot 
> be mixed: that is, you cannot match a Unicode string with a byte pattern or 
> vice-versa; similarly, when asking for a substitution, the replacement string 
> must be of the same type as both the pattern and the search string.

--
components: Regular Expressions
messages: 371516
nosy: ezio.melotti, matpi, mrabarnett
priority: normal
severity: normal
status: open
title: group names of bytes regexes are strings
type: behavior
versions: Python 3.8

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40984] re.compile's repr truncates patterns at 199 characters

2020-06-15 Thread Quentin Wenger


New submission from Quentin Wenger :

This seems somewhat arbitrary and yields unusable results, going against the 
doc:

> repr(object)
> Return a string containing a printable representation of an object. For many 
> types, this function makes an attempt to return a string that would yield an 
> object with the same value when passed to eval(), otherwise the 
> representation is a string enclosed in angle brackets that contains the name 
> of the type of the object together with additional information often 
> including the name and address of the object. A class can control what this 
> function returns for its instances by defining a __repr__() method.

The truncated representation neither "yields an object with the same value" (it 
raises a SyntaxError, of course, due to the missing quote and closing 
parenthesis), nor is "enclosed in angle brackets".


```
>>> import re
>>> re.compile("()"*99)
re.compile('()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()')
>>> re.compile("()"*100)
re.compile('()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()
```

--
components: Regular Expressions
messages: 371541
nosy: ezio.melotti, matpi, mrabarnett
priority: normal
severity: normal
status: open
title: re.compile's repr truncates patterns at 199 characters
type: behavior
versions: Python 3.8

___
Python tracker 
<https://bugs.python.org/issue40984>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40984] re.compile's repr truncates patterns at 199 characters

2020-06-15 Thread Quentin Wenger


Quentin Wenger  added the comment:

Note: it actually truncates at 200 characters, counting the initial quote of 
the argument's repr.

--

___
Python tracker 
<https://bugs.python.org/issue40984>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40984] re.compile's repr truncates patterns at 200 characters

2020-06-15 Thread Quentin Wenger


Change by Quentin Wenger :


--
title: re.compile's repr truncates patterns at 199 characters -> re.compile's 
repr truncates patterns at 200 characters

___
Python tracker 
<https://bugs.python.org/issue40984>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-15 Thread Quentin Wenger


Quentin Wenger  added the comment:

This also affects functions/methods expecting a group name as parameter (e.g. 
match.group), the group name has to be passed as string.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40984] re.compile's repr truncates patterns at 200 characters

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

Pardon me, but I see an important difference with the other bug report: that 
one is about a repr in angle brackets, and as such does not require an exact 
output, so an ellipsis is good enough.

In this bug, the output of repr gives a string than can, at least for small 
enough patterns, be passed to eval() to recontruct an object. So there is no 
good reason that this can be done for patterns up to 200 characters but not 
above; furthermore it is undocumented and goes against the doc on repr.

Compare with a complexly-nested structure of, say, lists, dicts and strings: 
The repr will always be "reconstructible", even if it is well above 200 
characters.

Also, a common way to write repr is to draw the outer "container" as a string, 
and fill it with the (full!) repr of the object's parameters. E.g. the repr of 
a list containing a 1000-character string will simply write square brackets 
around the 1002-character repr of the string. re.compile doesn't conform to 
this "rule".

--
resolution: duplicate -> 
status: closed -> open

___
Python tracker 
<https://bugs.python.org/issue40984>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

Agreed to some extent, but there is the difference that group names are 
embedded in the pattern, which has to be bytes if the target is bytes.

My use case is in an all-bytes, no-string project where I construct a large 
regular expression at startup, with semi-dynamical group names.

So it seems natural to have everything in bytes to concatenate the regular 
expression, incl. the group names.

But then group names that I receive back are strings, so I cannot look them up 
directly into the set of group names that I used to create the expression in 
the first place.

Of course I can live with it by storing them as strings in the first place and 
encode()'ing them during concatenation, but it does not feel "natural".

Furthermore, even if it is "just a name", a non-ascii group name will raise an 
error in bytes, even if encoded...:

```
>>> re.compile("(?P<" + "é" + ">)")
re.compile('(?P<é>)')
>>> re.compile(b"(?P<" + "é".encode() + b">)")
Traceback (most recent call last):
  File "", line 1, in 
re.compile(b"(?P<" + "é".encode() + b">)")
  File "/usr/lib/python3.8/re.py", line 252, in compile
return _compile(pattern, flags)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'é' at position 4
```

So no, it's not really "just a name", considering that in Python "é" should is 
a valid name.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

should *be a valid name

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40984] re.compile's repr truncates patterns at 200 characters

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

All in all, it is simply a matter of compliance. The doc of repr says that a 
repr is either

- a string that can be eval()'ed back to (an equivalent of) the original object
- or a "more loose" angle-bracket representation.

re.compile with small patterns falls in the first category. The other bug 
report corresponds to the second one, no problem.

However, re.compile with large patterns doesn't fall in either category, nor 
would it if changed to use an ellipsis.

--

___
Python tracker 
<https://bugs.python.org/issue40984>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

Of course an inconvenience in my program is not per se the reason to change the 
language. I just wanted to motivate that the current situation gives unexpected 
results.

"\xe9" doesn't look like proper utf-8 to me:

```
>>> "é".encode("latin-1")
b'\xe9'
>>> "é".encode()
b'\xc3\xa9'
```

Let's try another one: how would you go for Δ ("\u0394") as a group name?


```
>>> "Δ".encode()
b'\xce\x94'
>>> "Δ".encode("latin-1")
Traceback (most recent call last):
  File "", line 1, in 
"Δ".encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0394' in position 
0: ordinal not in range(256)
>>> re.match(b'(?P<\xce\x94>)', b'').groupdict()
Traceback (most recent call last):
  File "", line 1, in 
re.match(b'(?P<\xce\x94>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
>>> re.match(b'(?P<\u0394>)', b'').groupdict()
Traceback (most recent call last):
  File "", line 1, in 
re.match(b'(?P<\u0394>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
raise source.error(msg, len(name) + 1)
re.error: bad character in group name '\\u0394' at position 4
```

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

> So b'\xe9' is mapped to \u00e9, it is `é`.

Yes but \xe9 is not strictly valid utf-8, or say not the canonical 
representation of "é". So there is no way to get \xe9 starting from é without 
leaving utf-8. So starting with é as group name, I cannot programmatically 
encode it into a bytes pattern.

> Of course, characters with Unicode code point greater than 0xff are 
> impossible to appear in `bytes`.

But \xce and \x94 are both lower than \xff, yet using \xce\x94 ("Δ".encode()) 
in a group name fails.

According to the doc, the sole constraint on group names is that they have to 
be valid and unique Python identifiers. So this should work:

```
# Δ is a valid identifier
>>> "Δ".isidentifier()
True
>>> Δ = 1
>>> Δ
1
>>> import re
>>> name = "Δ"
>>> re.match(b"(?P<" + name.encode() + b">)", b"")
Traceback (most recent call last):
  File "", line 1, in 
re.match(b"(?P<" + name.encode() + b">)", b"")
  File "/usr/lib/python3.8/re.py", line 191, in match
return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
re.match(b'(?P<\xce\x94>)', b'').groupdict()
```

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

But Δ has no latin-1 representation. So Δ currently cannot be used as a group 
name in bytes regex, although it is a valid Python identifier. So that's a bug.

I mean, if you insist of having group names as strings even for bytes regexes, 
then it is not reasonable to prevent them from going _in_.

b"(??<\xce\x94>)" is a valid utf-8-encoded bytestring, why wouldn't you accept 
it as a valid re pattern?

IMHO, either

- group names from byte regexes should be returned as bytes
- or any utf-8-encoded representation of a valid Python identifier should be 
accepted as a group name of a bytes regex pattern.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

Sorry, b"(?P<\xce\x94>)"

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39949] truncating match in regular expression match objects repr

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

For a bit of background, the other issue is about the repr of compiled 
patterns, not match objects.
Please see my argument there about the conformance to repr's doc - merely 
adding an ellipsis would _not_ solve this case.

I have however nothing against the pattern being truncated/ellipsed when inside 
the repr of a match object.

--
nosy: +matpi

___
Python tracker 
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

The issue with the second variant is that utf-8 is an arbitrary (although 
default) choice.

But: re is doing that same arbitrary choice already in decoding the group names 
into a string, which is my original complaint!

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39949] truncating match in regular expression match objects repr

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

@eric.smith thanks, no problem.

If I can give any advice on this present issue, I would suggest to have the 
ellipsis _inside_ the quote, to make clear that the pattern is being truncated, 
not the match. So instead of

```
<_sre.SRE_Match object; span=(0, 49), 
match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS'...>
```

as suggested by @Seth.Troisi, I'd suggest

```
<_sre.SRE_Match object; span=(0, 49), 
match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS...'>
```

--

___
Python tracker 
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40984] re.compile's repr truncates patterns at 200 characters

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

I welcome any counter-example to the eval()'able property in the stdlib.

I do believe in this rule as hard and fast, because it works for small 
patterns, only bitting you when you grow, probably programmatically (so exactly 
when you actually could need the repr).

Furthermore, 200 seems very low anyway by today standards. I mean, if you want 
a repr in the first place, then chances are that you want it full if 
(reasonably) possible.

If a string repr's itself fully no matter what, why should re.compile 
arbitrarily decide to truncate its argument?

--

___
Python tracker 
<https://bugs.python.org/issue40984>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

> It seems you don't know some knowledge of encoding yet.

I don't have to be ashamed of my knowledge of encoding. Yet you are right that 
I was missing a subtlety, which is that latin-1 is a strict subset of Unicode 
rather than a completely arbitrary encoding. Thank you for that.

So what you are saying is that group names in bytes regexes can only be 
specified directly (without -explicit- encoding), so de facto they are limited 
to the latin-1 subset.

Very well.

But then, once again:

1) why convert them to string when spitting them out? bytes they were when 
going in, bytes they should remain... **By converting them you are choosing an 
arbitrary encoding, even if it is the "natural" one.**
2) this limitation to the latin-1 subset is not compatible with the 
documentation, which says that valid Python identifiers are valid group names. 
If this was really the case, then I would expect to be able to use any string 
for which .isidentifier() is true as a group name, programmatically.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

I prove my point that the decoding to string is arbitrary:

```
>>> import re
>>> orig_name = "Ř"
>>> orig_ch = orig_name.encode("cp1250") # Because why not?
>>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
>>> name == orig_name
False
>>> name
'Ø'
>>> name.encode("latin-1") == orig_ch
True
```

For any dynamically-constructed bytes regex pattern, a string group name as 
output is unusable. Only after latin-1-reencoding can it be safely compared. 
This latin-1 choice is arbitrary.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

> > this limitation to the latin-1 subset is not compatible with the 
> > documentation, which says that valid Python identifiers are valid group 
> > names.
> 
> Not all latin-1 characters are valid identifier, for example:
> 
> >>> '\x94'.encode('latin1')
> b'\x94'
> >>> '\x94'.isidentifier()
> False

True but that's not the point. Δ is a valid Python identifier but not a valid 
group name in bytes regexes, because it is not in the latin-1 plane. The 
documentation does not mention this.


> There is a workaround, you can convert `bytes` to `str` with "latin-1" 
> decoder before processing, IIRC there will be no extra overhead 
> (memory/speed) during processing, then the name and content are the same 
> type. :)

I am not searching a workaround for my current code.

And the simplest workaround is to latin-1-convert back to bytes, because re 
should not latin-1-convert to string in the first place.

Are you saying that the proper way to use bytes regexes is to use string 
regexes instead?


> Please look at these:
> 
> >>> orig_name = "Ř"
> >>> orig_ch = orig_name.encode("cp1250") # Because why not?
> >>> orig_ch
> b'\xd8'
> >>> name = list(re.match(b"(?P<" + orig_ch + b">)", 
> b"").groupdict().keys())[0]
> >>> name
> 'Ø'  # '\xd8'
> >>> name == orig_name
> False
> >>> name.encode("latin-1")
> b'\xd8'
> >>> name.encode("latin-1") == orig_ch
> True
> 
> "Ř" (\u0158) --cp1250--> b'\xd8'
> "Ø" (\u00d8) --latin-1--> b'\xd8'

That's no surprize, I carefully crafted this example. :-)

Rather, that is exactly my point: several different strings (which can all be 
valid Python identifiers) can have the same single-byte representation, simply 
by the mean of different encodings (duh).

So why convert group names to strings when outputting them from matches, when 
you don't know where the bytes come from, or even whether they ever were 
strings? That should be left to the programmer.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

And there's no need for a cryptic encoding like cp1250 for this problem to 
arise. Here is a simple example with Python's default encoding utf-8:

```
>>> a = "ú"
>>> b = list(re.match(b"(?P<" + a.encode() + b">)", b"").groupdict())[0]
>>> a.isidentifier()
True
>>> b.isidentifier()
True
>>> b
'ú'
>>> a.encode() == b.encode("latin1")
True
```

For reference, here is the very source of the issue: 
https://github.com/python/cpython/blob/master/Lib/sre_parse.py#L228

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

The problem can also be played in reverse, maybe it is more telling:

```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# maybe we can try to infer it by decoding the bytestring?
# let's try to do it with the default encoding... that natural, right?
>>> p.decode()
'(?P<ú>)'

# so we can reasonably expect the group name to be ú, right?
>>> list(re.compile(p).groupindex.keys()).pop()
'ú'

# Fail.
```

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

You questioned my knowledge of encodings. Let's quote from one of the most 
famous introductory articles on the subject 
(https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/):

> It does not make sense to have a string without knowing what encoding it uses

So I have that bytestring that comes from somewhere, maybe it was originally 
utf-8 or cp1250 or ... encoded, but I won't tell or don't know, the only thing 
I swear is that it originally was a valid Python identifier.
Now I pass it as a group name in re.match (it was a valid Python identifier, so 
that has to be alright per the docs) and I get back a (unicode) string.
re.match, how dare you giving me back a string when _you have no clue what my 
bytestring originally represented, resp. what it originally was encoded with_?
Maybe re.match will even crash, because it wrongly and assumes the bytestring 
to have been latin-1 encoded!

So: latin-1 is an arbitrary choice that is no better than any other, and the 
fact that it "naturally" converts bytes to unicode code points is an 
implementation detail.
If you want to keep it so, it ought (cf. the quote above) to be made clear in 
the docs that group names come out as latin-1-encoded strings, with all the 
restrictions that follow from that choice.
But the more logical way would be to renounce this arbitrary encoding 
altogether.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39949] truncating match in regular expression match objects repr

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

Oh ok, I was mislead by the example in your first message, where you did have 
both the quote and ellipsis.

I don't have a strong opinion.
- having the quote is a bit more "clean"
- but not having it makes clear than the pattern is truncated (per se, three 
dots is a valid pattern)

The best would be to find a precedent in the stdlib, but I currently cannot 
think of any either.

--

___
Python tracker 
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39949] truncating match in regular expression match objects repr

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

File objects are an example of a square-bracket repr with string parameters in 
the repr, but no truncation is performed (see 
https://github.com/python/cpython/blob/master/Modules/_io/textio.c#L2912).

Various truncations with the same (lack of?) clarity are done in the stdlib, 
see eg. 
https://github.com/python/cpython/blob/04fc4f2a46b2fd083639deb872c3a3037fdb47d6/Objects/longobject.c#L2475.

--

___
Python tracker 
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

I just had an "aha moment": What re claims is that, rather than doing as I 
suggested:

> ```
> # consider the following bytestring pattern
> >>> p = b"(?P<\xc3\xba>)"
> 
> # what character does the group name correspond to?
> # maybe we can try to infer it by decoding the bytestring?
> # let's try to do it with the default encoding... that's natural, right?
> >>> p.decode()
> '(?P<ú>)'
> ```

the actual way to know what group name is represented would be to look at the 
(unicode) string with the same "graphical representation":

```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# to discover it, we instead consider the string that "looks the same":
>>> "(?P<\xc3\xba>)"
'(?P<ú>)'

# ok so the group name will be "ú"
```

This way of going from bytes to strings _naively_ (which happens to be called 
latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be 
the same value, just because they "look the same" in the source code.

This is like throwing away everything we ever learned about Unicode and how a 
code point is fundamentally different from what is stored in memory.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger


Quentin Wenger  added the comment:

Because utf-8 is Python's default encoding, e.g. in source files, decode() and 
encode(). Literally everywhere.

If you ask around "I have a bytestring, I need a string, what do I do?", using 
latin-1 will not be the first answer (and moreover, the correct answer should 
be "it depends on the encoding", which re happily ignores by just asserting 
one).

Saying "just strip that b prefix, it's fine" cannot be taken seriously.

Yes latin-1 will never give an error on converting a bytestring, because it has 
full coverage of the 256 byte values, but saying that this is the reason why it 
should be used instead of another is forgetting why we have Unicode in the 
first place. **It is just pretending that Unicode never was a thing**. It is 
not because it can decode any bytestring that it will not return garbage _when 
the bytestring is not latin-1-encoded in the first place_.

Take a look at the documentation: https://docs.python.org/3/howto/unicode.html
7 references to latin-1, none saying that latin-1 is the way to go because it 
is so much better than anything else.

latin-1 used to be prominent in the 2.x world, it should slowly be time to 
recognize that this is over, and we cannot ignore anymore that encoding is a 
thing.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger


Quentin Wenger  added the comment:

If I don't have to think about the str -> bytes direction, re should first stop 
going in the other direction.

When I have bytes regexes I actually don't care about strings and would happily 
receive group names as bytes. But no, re decides that latin-1 is the way to go, 
and this way it 1) reduces my freedom in the choice of the group names, 2) 
makes me need to go read the internals to understand the the encoding it 
arbitrarily chose is latin-1, so that I can undo it properly and get back what 
I always wanted - a bytes group name.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger


Quentin Wenger  added the comment:

bytes are _not_ Unicode code points, not even in the 256 range. End of the 
story.

--

___
Python tracker 
<https://bugs.python.org/issue40980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger


Quentin Wenger  added the comment:

An extraneous difficulty also exists for bytes regexes, because there non-ascii 
characters are repr'ed using escape sequences. So there's a risk of cutting one 
in the middle.

```
>>> import re
>>> re.match(b".*", b"\xce")

```

--

___
Python tracker 
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger


Quentin Wenger  added the comment:

And ascii escapes should also not be forgotten.

```
>>> re.match(b".*", b"\t")

>>> re.match(".*", "\t")

```

--

___
Python tracker 
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger


Quentin Wenger  added the comment:

(but those are one-character escapes, so that should be fine - either the 
escape is complete or the backslash is trailing and can be "peeled of")

--

___
Python tracker 
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger


Quentin Wenger  added the comment:

*off

--

___
Python tracker 
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39949] truncating match in regular expression match objects repr

2020-06-19 Thread Quentin Wenger


Quentin Wenger  added the comment:

Other pathological case: literal backslashes

```
>>> re.match(".*", r"\\")

```

--

___
Python tracker 
<https://bugs.python.org/issue39949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com