All,

TL;DR: I think we should switch from DTD to RELAX NG (compact syntax,
ideally) for our XML validation needs. It is more expressive and more
readable.

Most people who know anything about XML stuff know that DTDs are not
that great a solution for validation. Their expression power is very
limited; there are a few examples of this is in our metadata.dtd [1].
For a few years now, I've wanted to see if we could replace
metadata.dtd with something in RELAX NG, which is a more modern XML
schema language; it's an ISO standard with an emphasis on readability
both for humans and for tools (by using a rigorous formalism). Some
arguments in favor of RELAX NG (and some counter-arguments) are
enumerated on Tim Bray's weblog [2]. I've created a compact syntax
schema for metadata that can validate all metadata.xml files currently
in the tree, as an example [3].

Some arguments against:

- Not enough tool support for RELAX NG: I'd be curious to hear what
tools you want to use. At least libxml2 supports RELAX NG natively.
The Python lxml library uses that support to provide pretty simple
RELAX NG validation. libxml2 does not have native compact syntax
support, but I maintain a simple library called rnc2rng [4] that is
used transparently by lxml if installed. rnc2rng also comes with a
rnc2rng command-line script to do the conversion.

- Performance: in a quick test with lxml (backed by libxml2), RELAX NG
validation takes very similar time compared to DTD. Testing with
~19000 metadata.xml files in the tree, with DTD (best of 3):

real    0m2.861s
user    0m2.560s
sys    0m0.296s

With RNC (best of 3):

real    0m3.058s
user    0m2.688s
sys    0m0.364s

We could probably easily maintain an XML Schema shadow schema if
that's really desired, but I would be in favor of making RELAX NG our
main schema language. I can easily do the work to update repoman for
this (I've already refactored the metadata code in repoman). What
other stuff would need to be updated?

Comments?

Cheers,

Dirkjan

[1] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.dtd
[2] https://www.tbray.org/ongoing/When/200x/2006/11/27/Choose-Relax
[3] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.rnc
[4] https://github.com/djc/rnc2rng

Reply via email to