#28628: Audit for and abolish all use of '\d' in regexes
-------------------------------------+-------------------------------------
     Reporter:  James Bennett        |                    Owner:  Ad
         Type:                       |  Timmering
  Cleanup/optimization               |                   Status:  assigned
    Component:  Core (Other)         |                  Version:  dev
     Severity:  Normal               |               Resolution:
     Keywords:                       |             Triage Stage:  Ready for
                                     |  checkin
    Has patch:  1                    |      Needs documentation:  0
  Needs tests:  0                    |  Patch needs improvement:  0
Easy pickings:  0                    |                    UI/UX:  0
-------------------------------------+-------------------------------------
Description changed by Mariusz Felisiak:

Old description:

> Now that we're in the 2.0 release cycle and Python 3 only, any examples
> or code in Django using a `\d` in a regex should be replaced with
> `[0-9]`, as `\d` on Python 3 matches any character with Unicode category
> [Nd], which is almost certainly not what people expect. Changing to
> explicit `[0-9]`, perhaps with a note about why it should be preferred,
> would be better.
>
> Inventory of current uses & recommendations (with recommendation &
> updating table reflecting feedback):
>
> ||= **Module** =||= **Description** =||
> || django.contrib.admin.options || In internal function
> `_get_edited_object_pks`. \\Obtains PKs for edited objects through the
> ModelAdmin. POST should be machine-generated (through the admin
> interface) and not based on human input, so typically will be ASCII
> input. \\Changing to more restrictive regex unlikely to hurt any users,
> but doesn't add value either. ||
> || django.contrib.gis.geometry || Used to process GIS geometry
> data.\\Changing to more restrictive regex unlikely to hurt any users, but
> doesn't add value either. ||
> || django.contrib.gis.db.backends.postgis.operations || Used to obtain
> individual decimal components from the gettext version, eg. 1.2.3.
> \\Changing to a more restrictive regex is unlikely to hurt any users, but
> doesn't add value either. ||
> || django.contrib.gis.gdal.libgdal || Used to obtain individual decimal
> components from the gettext version, eg. 1.2.3.\\\\Changing to a more
> restrictive regex is unlikely to hurt any users, but doesn't add value
> either. ||
> || django.contrib.gis.geos.geometry || Used to process GIS geometry EWKT
> data. Resulting string is cast with `int()`.\\\\Changing to more
> restrictive regex unlikely to hurt any users, but doesn't add value
> either. ||
> || django.contrib.humanize.templatetags.humanize || Used in intcomma
> filter, that formats ints like 3000 as 3,000. \\\\Tested in
> `tests.humanize_tests.tests.HumanizeTests.test_intcomma`.\\\\Using a more
> restrictive regex could possibly hurt users. Instead added tests to
> ensure this works with non-ASCII unicode. ||
> || django.core.validators || (1) Used in URLValidator for `ipv4_re` port
> component in regex. Being more restrictive to capture valid IPv4 and
> ports seems useful, and not likely to be a breaking change. **Change** to
> `[0-9]` for consistency with other parts of the regex.
> || django.core.validators ||  (2) Used in `integer_validator` and
> `validate_integer`. Any number captured by d can be cast to an int and
> therefore no value is added (and likely users might be hurt) by
> restricting this to [0-9]). **Do not change**. ||
> || django.core.validators || (3) Used in `int_list_validator`, which does
> not use `int()` → **Change** to `[0-9]`. ||
> || django.core.management.commands.makemessages || Used to obtain
> individual decimal components from the gettext version, eg.
> 1.2.3.\\Changing to a more restrictive regex is unlikely to hurt any
> users, but doesn't add value either. ||
> || django.core.management.commands.runserver || Used to determine valid
> IPv4 addresses and valid port numbers. \\WSGI server seems to work fine
> even if for example 127.0.0.1 (incl a non-ASCII 0) is provided, and it
> seems the integers are properly cast -- so there would be little
> advantage to users to change this to a more restrictive regex. ||
> || django.db.backends.mysql.base || Used to obtain individual decimal
> components from a version, eg. 1.2.3. Changing to a more restrictive
> regex is unlikely to hurt any users, but doesn't add value either. ||
> || django.db.backends.sqlite3.introspection || Extracts the size number
> from a `varchar(11)` type name. \\\\Changing to a more restrictive regex
> is unlikely to hurt any users, but doesn't add value either. ||
> || django.db.migrations.autodetector || Used to extract the number from a
> migration filename. Unlikely to contain non-ASCII unless intentionally
> changed by a dev.\\Changing to a more restrictive regex is unlikely to
> hurt any users, but doesn't add value either. ||
> || django.db.migrations.writer || Used to sort imports in migrations,
> tested in tests.migrations.test_writer, see test_sorted_imports. Unclear
> to me what the decimal portion of the regex represents, however I don't
> see value in restricting to more restrictive regex. ||
> || django.forms.widgets || Used by SelectDateWidget to match a "YYYY-MM-
> DD" date pattern. When matched integers are cast with `int(val)` and
> therefore correctly processed if input was in non-ASCII Unicode decimals.
> \\\\Changing to a more restrictive regex does not seem to be of much
> value, because values that don't much `date_re` are passed to the
> `datetime.datetime.strptime()` and will be accepted anyway. ||
> || django.http.request || Used in split_domain_port to split a domain
> name and port ("www.example.com:8080"). \\The port number should be cast
> to an int (in which case, no problem!) - however in reality isn't
> currently being used in any code path, and both domain and port are
> returned as str. **Changed** to match other components of the regex. ||
> || django.template.base || Used in FilterExpression regex to match
> filters in templates. Arguments are passed on to individual filters.
> Where an int is expected, filters will cast using `int(x)` regardless so
> would process correctly. Using a more restrictive regex could possibly
> hurt users inadvertently using non-ASCII decimals in filter parameters.
> **Do not change**. ||
> || django.template.defaultfilters || Used in default title filter to
> change a letter after a decimal back back to lower case. Eg. `53RD ⇒ 53Rd
> ⇒ 53rd`. Changing this could have unexpected results for users, and no
> value is gained. **Do not change**. ||
> || django.test.client || Parses `charset` within the `Content-Type`
> header of `test.client.RequestFactory`. Strictly spoken the HTTP spec
> requires ASCII characters. **Change** for consistency as it reflects HTTP
> specs, however does not add much value. ||
> || django.utils.dateparse || Used for parsing date and time fields, in
> parsing string representations in the to_python implementations on
> DateField and DateTimeField in db.models. Decimals are extracted with
> regex, and downcast with `int(v)`. In normal usage these are unlikely to
> contain non-ASCII numbers, but if using legacy DBs it is quite possible
> data might have non-ASCII versions of numbers as well. Using a more
> restrictive regex could possibly hurt users. **Do not change**.
> || django.utils.http || Used in data parsing according to RFC1123, RFC850
> and asctime. Requires ASCII decimals as defined in RFC822:
> https://datatracker.ietf.org/doc/html/rfc822#appendix-D with "DIGIT =
> <any ASCII decimal digit> ; ( 60- 71, 48.- 57.)". The additional value of
> a more restrictive regex is not 100% evident, but it doesn't hurt to
> change the implementation to be closer to the specifications. ||
> || django.utils.version || Used to obtain individual decimal components
> from a version, eg. 1.2.3. Changing to a more restrictive regex is
> unlikely to hurt any users, but doesn't add value either. ||
> || django.utils.translation.trans_real || Used in regex to parse HTTP
> Accept-Language header. HTTP requires ASCII decimals. Example: `Accept-
> Language: pt,en-US;q=0.8,en;q=0.6,ru;q=0.4`. **Change** to more
> restrictive regex to match specs in
> https://datatracker.ietf.org/doc/html/rfc2616#section-3.9 . ||
> || django.views.i18n || `JavaScriptCatalog._num_plurals()`. Used to parse
> nplurals setting (see
> [https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html
> doc]). Regex match is cast with `int(x)`.  Intent is likely an ASCII
> decimal number ("nplurals value must be a decimal number which …"),
> however restricting the regex does not add any value. **Do not change**.
> ||

New description:

 Now that we're in the 2.0 release cycle and Python 3 only, any examples or
 code in Django using a `\d` in a regex should be replaced with `[0-9]`, as
 `\d` on Python 3 matches any character with Unicode category [Nd], which
 is almost certainly not what people expect. Changing to explicit `[0-9]`,
 perhaps with a note about why it should be preferred, would be better.

 Inventory of current uses & recommendations (with recommendation &
 updating table reflecting feedback):

 ||= **Module** =||= **Description** =||
 || django.contrib.admin.options || In internal function
 `_get_edited_object_pks`. \\Obtains PKs for edited objects through the
 ModelAdmin. POST should be machine-generated (through the admin interface)
 and not based on human input, so typically will be ASCII input. \\Changing
 to more restrictive regex unlikely to hurt any users, but doesn't add
 value either. ||
 || django.contrib.gis.geometry || Used to process GIS geometry
 data.\\Changing to more restrictive regex unlikely to hurt any users, but
 doesn't add value either. ||
 || django.contrib.gis.db.backends.postgis.operations || Used to obtain
 individual decimal components from the gettext version, eg. 1.2.3.
 \\Changing to a more restrictive regex is unlikely to hurt any users, but
 doesn't add value either. ||
 || django.contrib.gis.gdal.libgdal || Used to obtain individual decimal
 components from the gettext version, eg. 1.2.3.\\\\Changing to a more
 restrictive regex is unlikely to hurt any users, but doesn't add value
 either. ||
 || django.contrib.gis.geos.geometry || Used to process GIS geometry EWKT
 data. Resulting string is cast with `int()`.\\\\Changing to more
 restrictive regex unlikely to hurt any users, but doesn't add value
 either. ||
 || django.contrib.humanize.templatetags.humanize || Used in intcomma
 filter, that formats ints like 3000 as 3,000. \\\\Tested in
 `tests.humanize_tests.tests.HumanizeTests.test_intcomma`.\\\\Using a more
 restrictive regex could possibly hurt users. Instead added tests to ensure
 this works with non-ASCII unicode. ||
 || django.core.validators || (1) Used in URLValidator for `ipv4_re` port
 component in regex. Being more restrictive to capture valid IPv4 and ports
 seems useful, and not likely to be a breaking change. **Change** to
 `[0-9]` for consistency with other parts of the regex.
 || django.core.validators ||  (2) Used in `integer_validator`,
 `validate_integer`, and `int_list_validator`. Any number captured by `\d`
 can be cast to an int and therefore no value is added (and likely users
 might be hurt) by restricting this to [0-9]). **Do not change**. ||
 || django.core.management.commands.makemessages || Used to obtain
 individual decimal components from the gettext version, eg.
 1.2.3.\\Changing to a more restrictive regex is unlikely to hurt any
 users, but doesn't add value either. ||
 || django.core.management.commands.runserver || Used to determine valid
 IPv4 addresses and valid port numbers. \\WSGI server seems to work fine
 even if for example 127.0.0.1 (incl a non-ASCII 0) is provided, and it
 seems the integers are properly cast -- so there would be little advantage
 to users to change this to a more restrictive regex. ||
 || django.db.backends.mysql.base || Used to obtain individual decimal
 components from a version, eg. 1.2.3. Changing to a more restrictive regex
 is unlikely to hurt any users, but doesn't add value either. ||
 || django.db.backends.sqlite3.introspection || Extracts the size number
 from a `varchar(11)` type name. \\\\Changing to a more restrictive regex
 is unlikely to hurt any users, but doesn't add value either. ||
 || django.db.migrations.autodetector || Used to extract the number from a
 migration filename. Unlikely to contain non-ASCII unless intentionally
 changed by a dev.\\Changing to a more restrictive regex is unlikely to
 hurt any users, but doesn't add value either. ||
 || django.db.migrations.writer || Used to sort imports in migrations,
 tested in tests.migrations.test_writer, see test_sorted_imports. Unclear
 to me what the decimal portion of the regex represents, however I don't
 see value in restricting to more restrictive regex. ||
 || django.forms.widgets || Used by SelectDateWidget to match a "YYYY-MM-
 DD" date pattern. When matched integers are cast with `int(val)` and
 therefore correctly processed if input was in non-ASCII Unicode decimals.
 \\\\Changing to a more restrictive regex does not seem to be of much
 value, because values that don't much `date_re` are passed to the
 `datetime.datetime.strptime()` and will be accepted anyway. ||
 || django.http.request || Used in split_domain_port to split a domain name
 and port ("www.example.com:8080"). \\The port number should be cast to an
 int (in which case, no problem!) - however in reality isn't currently
 being used in any code path, and both domain and port are returned as str.
 **Changed** to match other components of the regex. ||
 || django.template.base || Used in FilterExpression regex to match filters
 in templates. Arguments are passed on to individual filters. Where an int
 is expected, filters will cast using `int(x)` regardless so would process
 correctly. Using a more restrictive regex could possibly hurt users
 inadvertently using non-ASCII decimals in filter parameters. **Do not
 change**. ||
 || django.template.defaultfilters || Used in default title filter to
 change a letter after a decimal back back to lower case. Eg. `53RD ⇒ 53Rd
 ⇒ 53rd`. Changing this could have unexpected results for users, and no
 value is gained. **Do not change**. ||
 || django.test.client || Parses `charset` within the `Content-Type` header
 of `test.client.RequestFactory`. Strictly spoken the HTTP spec requires
 ASCII characters. **Change** for consistency as it reflects HTTP specs,
 however does not add much value. ||
 || django.utils.dateparse || Used for parsing date and time fields, in
 parsing string representations in the to_python implementations on
 DateField and DateTimeField in db.models. Decimals are extracted with
 regex, and downcast with `int(v)`. In normal usage these are unlikely to
 contain non-ASCII numbers, but if using legacy DBs it is quite possible
 data might have non-ASCII versions of numbers as well. Using a more
 restrictive regex could possibly hurt users. **Do not change**.
 || django.utils.http || Used in data parsing according to RFC1123, RFC850
 and asctime. Requires ASCII decimals as defined in RFC822:
 https://datatracker.ietf.org/doc/html/rfc822#appendix-D with "DIGIT = <any
 ASCII decimal digit> ; ( 60- 71, 48.- 57.)". The additional value of a
 more restrictive regex is not 100% evident, but it doesn't hurt to change
 the implementation to be closer to the specifications. ||
 || django.utils.version || Used to obtain individual decimal components
 from a version, eg. 1.2.3. Changing to a more restrictive regex is
 unlikely to hurt any users, but doesn't add value either. ||
 || django.utils.translation.trans_real || Used in regex to parse HTTP
 Accept-Language header. HTTP requires ASCII decimals. Example: `Accept-
 Language: pt,en-US;q=0.8,en;q=0.6,ru;q=0.4`. **Change** to more
 restrictive regex to match specs in
 https://datatracker.ietf.org/doc/html/rfc2616#section-3.9 . ||
 || django.views.i18n || `JavaScriptCatalog._num_plurals()`. Used to parse
 nplurals setting (see
 [https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html
 doc]). Regex match is cast with `int(x)`.  Intent is likely an ASCII
 decimal number ("nplurals value must be a decimal number which …"),
 however restricting the regex does not add any value. **Do not change**.
 ||

--

-- 
Ticket URL: <https://code.djangoproject.com/ticket/28628#comment:25>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/069.71e2f4afa457932fbc4d4b29ac8c3da3%40djangoproject.com.

Reply via email to