Hi all,

I finally pushed an initial draft of the Commons XML Factory project I
proposed back in December [1]:

https://github.com/copernik-eu/commons-xml-factory

The library is a single `XmlFactories` class with factory methods that
return hardened JAXP factories for:

- DocumentBuilderFactory
- SAXParserFactory
- XMLInputFactory
- TransformerFactory
- SchemaFactory
- XPathFactory

Internally, each factory method dispatches to a per-implementation
`XmlProvider` that applies the correct hardening for that
implementation. The SPI is open via `ServiceLoader`, but providers for
the JDK, Xerces, Woodstox and Saxon are bundled.

It's fair to ask whether this is worth a library at all: a per-factory
hardening recipe is only a handful of lines, and most projects wrote
their own years ago. Two observations:

First, those handful of lines are exactly the lines people forget or get
subtly wrong. The 2025 Java XXE CVEs bear this out: Apache Tika
(CVE-2025-54988, CVE-2025-66516), WebDriverManager (CVE-2025-4641),
CycloneDX (CVE-2025-64518), GeoServer (CVE-2025-58360).

Second, the correct recipe depends on which JAXP implementation is
actually on the classpath, and that's often not what the developer
thinks. A library author tests against the JDK, observes that
FEATURE_SECURE_PROCESSING transitively restricts ACCESS_EXTERNAL_*
(JEP 185), and writes a minimal hardening block. The library is then
deployed in an application that pulls in external Xerces transitively:
JEP 185 no longer applies, ACCESS_EXTERNAL_* is not honored, and the
minimal block is no longer sufficient.

The draft intentionally offers no configuration: it hardens at one
level and fails fast if it encounters an implementation it doesn't
recognize. Before extending it, I'd like feedback on whether the
proposed direction makes sense.

I see three plausible hardening levels worth supporting:

1. No DOCTYPE allowed. Eliminates the entire class of DTD-based
   attacks. This is what the draft implements.

2. DOCTYPE allowed, no external resources loaded. Internal entities
   work (for users who need HTML-style named entities, for example),
   entity expansion limits are enforced, but nothing is fetched from
   outside the document.

3. DOCTYPE allowed, user-supplied resolver. The caller provides an
   EntityResolver; we wrap it so that if the resolver returns null for
   an unknown reference, we throw rather than falling through to the
   parser's default URL-fetching behavior. This closes SAX's most
   common footgun while letting integrators implement classpath-scoped
   loading, XML catalogs, and similar.

The draft also addresses the secondary-source problem for
TransformerFactory (stylesheet loading) and SchemaFactory (schema
imports). Currently both are locked down as tightly as primary input,
but this is probably a place where two distinct levels make sense:
users often have trusted stylesheets or schemas they want to load via
xsl:import or xs:include, separate from the question of what to allow
in the document being transformed or validated.

Two things I'd particularly appreciate feedback on:

- Does the three-level model above cover the use cases you'd want to
  bring to this library?

- For the secondary-source question, is there appetite for a separate
  axis, or should primary and secondary be tied together under a
  single level?

Piotr

[1] https://lists.apache.org/thread/b2tjc15vjkgsrxxkc8phlnt6801hx4xz

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to