#36483: IntegerField will accept non-ASCII digits, which leads to the same page
appearing at many URLs
-------------------------------+--------------------------------------
     Reporter:  Morgan Wahl    |                    Owner:  (none)
         Type:  Bug            |                   Status:  new
    Component:  Uncategorized  |                  Version:  5.2
     Severity:  Normal         |               Resolution:
     Keywords:                 |             Triage Stage:  Unreviewed
    Has patch:  0              |      Needs documentation:  0
  Needs tests:  0              |  Patch needs improvement:  0
Easy pickings:  0              |                    UI/UX:  0
-------------------------------+--------------------------------------

Old description:

> Hello,
>
> I was recently surprised to find that a simple detail view URL with a
> model ID in it was also accessible at a URL using "full width" digit
> characters. For example the page at "/pizza/123" could also be returned
> from "/pizza/123". That's the Unicode characters U+FF11 U+FF12 U+FF13.
> It turns out this is ultimately because the model `IntegerField` is using
> `int` to get an integer from the string that was originally in the URL.
> And I was surprised to find Python's `int` constructor uses
> `unicodedata.decimal` (or some equivalent) to translate from characters
> in a string to decimal digits.
>
> That was a cool accidental feature to discovery, however now I'm
> concerned about URL canonicalization. Python 3.13.3 accepts _68_
> different characters for each digit. This means the same content is
> hypothetically accessible from many, many URLs. I've heard that can make
> a site look spammy to search engines. And maybe this could be an element
> of a security hole if something is assuming there is only one URL for a
> given page.
>
> The SEO problem could be addressed by setting a `<link rel=canonical>` in
> the page to point to `Pizza.objects.get(pk=id).get_absolute_url()` or
> some similar logic, or you could address the problem as a whole by
> setting up redirects or 404 responses, but all those approaches require a
> separate implementation for every view, since the view code ultimately
> doesn't know which parts of the URL are going to be treated as values of
> a `IntegerField`.
>
> Possible solutions I can think of are either:
>
> 1. make some mechanism to very easily canonicalize URLs, by allowing
> users to somehow mark this situation explicitly in the URL conf, and then
> Django can set a property on the request object with the "canonicalized"
> URL. Then redirects or 404s or <link> tags could be implemented just once
> for all such URLs. (Redirects and 404s in a middleware, <link> tags in a
> base template.)
> 2. Don't just pass strings to `int` in the model `IntegerField`. Instead
> only allow strings with ASCII digits to be used.

New description:

 Hello,

 I was recently surprised to find that a simple detail view URL with a
 model ID in it was also accessible at a URL using "full width" digit
 characters. For example the page at "/pizza/123" could also be returned
 from "/pizza/123". That's the Unicode characters U+FF11 U+FF12 U+FF13.
 It turns out this is ultimately because the model `IntegerField` is using
 `int` to get an integer from the string that was originally in the URL.
 And I was surprised to find Python's `int` constructor uses
 `unicodedata.decimal` (or some equivalent) to translate from characters in
 a string to decimal digits.

 That was a cool accidental feature to discover, however now I'm concerned
 about URL canonicalization. Python 3.13.3 accepts _68_ different
 characters for each digit. This means the same content is hypothetically
 accessible from many, many URLs. I've heard that can make a site look
 spammy to search engines. And maybe this could be an element of a security
 hole if something is assuming there is only one URL for a given page.

 The SEO problem could be addressed by setting a `<link rel=canonical>` in
 the page to point to `Pizza.objects.get(pk=id).get_absolute_url()` or some
 similar logic, or you could address the problem as a whole by setting up
 redirects or 404 responses, but all those approaches require a separate
 implementation for every view, since the view code ultimately doesn't know
 which parts of the URL are going to be treated as values of a
 `IntegerField`.

 Possible solutions I can think of are either:

 1. make some mechanism to very easily canonicalize URLs, by allowing users
 to somehow mark this situation explicitly in the URL conf, and then Django
 can set a property on the request object with the "canonicalized" URL.
 Then redirects or 404s or <link> tags could be implemented just once for
 all such URLs. (Redirects and 404s in a middleware, <link> tags in a base
 template.)
 2. Don't just pass strings to `int` in the model `IntegerField`. Instead
 only allow strings with ASCII digits to be used.

--
Comment (by Morgan Wahl):

 (Fix typo in description.)
-- 
Ticket URL: <https://code.djangoproject.com/ticket/36483#comment:1>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/django-updates/01070197b2b64584-6cfe65a3-3af5-49fa-99ab-c55fd5694153-000000%40eu-central-1.amazonses.com.

Reply via email to