https://bugs.kde.org/show_bug.cgi?id=477533

--- Comment #3 from Tobias Leupold <t...@stonemx.de> ---
Meanwhile, I know what's going on here, and I also think I know how to fix this
:-)

When the "compressed" file format is used, category names are used as XML
attributes. To be able to do so, they are escaped. Our current escaping
algorithm produces invalid XML attribute names, depending on the input: It
(among other flaws) allows numbers to be the first character of the escaped
output.

This violates the XML spec (cf. https://www.w3.org/TR/xml/ ), which states that
the first character of an XML attribute must be a NameStartChar. That is "a-z",
"A-Z", ":" or "_". Numbers are allowed later in the attribute name, but not as
the first character.

When writing the XML file, the non-compliant attribute name is written
nevertheless. When re-opening the database later, the data can't be read
anymore though, because the parser finds a number where he expects either the
end of the tag ("/>" or ">") or a new attribute (a NameStartChar), cf. the
posted error message: "Expected '>' or '/', but got '[0-9]'" – and thus fails
on the invalid XML.

Just as a side note: The algorithm also can't escape non-Latin-1 characters
correctly (they become "?"), and we also have problems with category names
containing spaces and underscores when using the "readable" format, which
aren't unescaped to what they initially were (all underscores are replaced by
spaces and the underscores are lost on the next reading).

The only way to fix the root cause for this is to implement a new escaping
algorithm to escape category names to be used as XML attributes that respects
the XML spec.

My proposal for a compliant implementation can be found at
https://invent.kde.org/graphics/kphotoalbum/-/tree/safe_xml_escaping?ref_type=heads
– I use a modified URL-style percent encoding using QByteArray's integrated
functionality. With this approach, not only the numbers issue is fixed, but one
can also use the whole Unicode range in a category name. Also, the spaces and
underscores issue is gone for the "readable" format.

Needs testing though.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to