#36483: IntegerField will accept non-ASCII digits, which leads to the same page
appearing at many URLs
-------------------------------+--------------------------------------
Reporter: Morgan Wahl | Owner: (none)
Type: Bug | Status: new
Component: Uncategorized | Version: 5.2
Severity: Normal | Resolution:
Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+--------------------------------------
Old description:
> Hello,
>
> I was recently surprised to find that a simple detail view URL with a
> model ID in it was also accessible at a URL using "full width" digit
> characters. For example the page at "/pizza/123" could also be returned
> from "/pizza/123". That's the Unicode characters U+FF11 U+FF12 U+FF13.
> It turns out this is ultimately because the model `IntegerField` is using
> `int` to get an integer from the string that was originally in the URL.
> And I was surprised to find Python's `int` constructor uses
> `unicodedata.decimal` (or some equivalent) to translate from characters
> in a string to decimal digits.
>
> That was a cool accidental feature to discover, however now I'm concerned
> about URL canonicalization. Python 3.13.3 accepts _68_ different
> characters for each digit. This means the same content is hypothetically
> accessible from many, many URLs. I've heard that can make a site look
> spammy to search engines. And maybe this could be an element of a
> security hole if something is assuming there is only one URL for a given
> page.
>
> The SEO problem could be addressed by setting a `<link rel=canonical>` in
> the page to point to `Pizza.objects.get(pk=id).get_absolute_url()` or
> some similar logic, or you could address the problem as a whole by
> setting up redirects or 404 responses, but all those approaches require a
> separate implementation for every view, since the view code ultimately
> doesn't know which parts of the URL are going to be treated as values of
> a `IntegerField`.
>
> Possible solutions I can think of are either:
>
> 1. make some mechanism to very easily canonicalize URLs, by allowing
> users to somehow mark this situation explicitly in the URL conf, and then
> Django can set a property on the request object with the "canonicalized"
> URL. Then redirects or 404s or <link> tags could be implemented just once
> for all such URLs. (Redirects and 404s in a middleware, <link> tags in a
> base template.)
> 2. Don't just pass strings to `int` in the model `IntegerField`. Instead
> only allow strings with ASCII digits to be used.
New description:
Hello,
I was recently surprised to find that a simple detail view URL with a
model ID in it was also accessible at a URL using "full width" digit
characters. For example the page at "/pizza/123" could also be returned
from "/pizza/123". That's the Unicode characters U+FF11 U+FF12 U+FF13.
It turns out this is ultimately because the model `IntegerField` is using
`int` to get an integer from the string that was originally in the URL.
And I was surprised to find Python's `int` constructor uses
`unicodedata.decimal` (or some equivalent) to translate from characters in
a string to decimal digits.
That was a cool accidental feature to discover, however now I'm concerned
about URL canonicalization. Python 3.13.3 accepts _68_ different
characters for each digit. This means the same content is hypothetically
accessible from many, many URLs. I've heard that can make a site look
spammy to search engines. And maybe this could be an element of a security
hole if something is assuming there is only one URL for a given page.
The SEO problem could be addressed by setting a `<link rel=canonical>` in
the page to point to `Pizza.objects.get(pk=id).get_absolute_url()` or some
similar logic, or you could address the problem as a whole by setting up
redirects or 404 responses, but all those approaches require a separate
implementation for every view, since Django's code ultimately doesn't know
which parts of the URL are going to be treated as values of a
`IntegerField`. (Django's `DetailView` could implement some logic for
this, however in reality there are lots of other situations where people
take a string from the URL use it to look up a record with an
`IntegerField`.)
Possible solutions I can think of are either:
1. make some mechanism to very easily canonicalize URLs, by allowing users
to somehow mark this situation explicitly in the URL conf, and then Django
can set a property on the request object with the "canonicalized" URL.
Then redirects or 404s or <link> tags could be implemented just once for
all such URLs. (Redirects and 404s in a middleware, <link> tags in a base
template.)
2. Don't just pass strings to `int` in the model `IntegerField`. Instead
only allow strings with ASCII digits to be used.
--
Comment (by Morgan Wahl):
(Clarified possible workarounds.)
--
Ticket URL: <https://code.djangoproject.com/ticket/36483#comment:2>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
--
You received this message because you are subscribed to the Google Groups
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/django-updates/01070197b2b8b32d-8455ab87-61f9-4bfd-8a25-3c97f1d16c7d-000000%40eu-central-1.amazonses.com.