All, TL;DR: I think we should switch from DTD to RELAX NG (compact syntax, ideally) for our XML validation needs. It is more expressive and more readable.
Most people who know anything about XML stuff know that DTDs are not that great a solution for validation. Their expression power is very limited; there are a few examples of this is in our metadata.dtd [1]. For a few years now, I've wanted to see if we could replace metadata.dtd with something in RELAX NG, which is a more modern XML schema language; it's an ISO standard with an emphasis on readability both for humans and for tools (by using a rigorous formalism). Some arguments in favor of RELAX NG (and some counter-arguments) are enumerated on Tim Bray's weblog [2]. I've created a compact syntax schema for metadata that can validate all metadata.xml files currently in the tree, as an example [3]. Some arguments against: - Not enough tool support for RELAX NG: I'd be curious to hear what tools you want to use. At least libxml2 supports RELAX NG natively. The Python lxml library uses that support to provide pretty simple RELAX NG validation. libxml2 does not have native compact syntax support, but I maintain a simple library called rnc2rng [4] that is used transparently by lxml if installed. rnc2rng also comes with a rnc2rng command-line script to do the conversion. - Performance: in a quick test with lxml (backed by libxml2), RELAX NG validation takes very similar time compared to DTD. Testing with ~19000 metadata.xml files in the tree, with DTD (best of 3): real 0m2.861s user 0m2.560s sys 0m0.296s With RNC (best of 3): real 0m3.058s user 0m2.688s sys 0m0.364s We could probably easily maintain an XML Schema shadow schema if that's really desired, but I would be in favor of making RELAX NG our main schema language. I can easily do the work to update repoman for this (I've already refactored the metadata code in repoman). What other stuff would need to be updated? Comments? Cheers, Dirkjan [1] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.dtd [2] https://www.tbray.org/ongoing/When/200x/2006/11/27/Choose-Relax [3] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.rnc [4] https://github.com/djc/rnc2rng