Source: apache-opennlp
Version: 2.5.3-1
Severity: important
Tags: security upstream
X-Debbugs-Cc: [email protected], Debian Security Team <[email protected]>

Hi,

The following vulnerabilities were published for apache-opennlp.

CVE-2026-40682[0]:
| XML External Entity (XXE) via Unsanitized Dictionary Parsing in
| Apache OpenNLP DictionaryEntryPersistor   Versions Affected: before
| 2.5.9, before 3.0.0-M3   Description: The DictionaryEntryPersistor
| class initializes a static SAXParserFactory at class-load time
| without enabling FEATURE_SECURE_PROCESSING or disabling DTD
| processing. When create(InputStream, EntryInserter) is invoked, the
| only feature set on the XMLReader is namespace support — external
| entity resolution and DOCTYPE declarations remain fully enabled. An
| attacker who can supply a crafted dictionary file (e.g., a stop-word
| list or domain dictionary) containing a malicious DOCTYPE
| declaration can trigger local file disclosure via file:// entity
| references or server-side request forgery via http:// entity
| references during SAX parsing, before the application processes a
| single dictionary entry. This is inconsistent with the project's own
| XmlUtil.createSaxParser() helper, which correctly sets
| FEATURE_SECURE_PROCESSING and disallow-doctype-decl and is used by
| all other XML parsing paths in the codebase. The public
| Dictionary(InputStream) constructor delegates directly to this
| method and is the documented API for loading user-supplied
| dictionaries, making untrusted input a realistic scenario.
| Mitigation: 2.x users should upgrade to 2.5.9. 3.x users should
| upgrade to 3.0.0-M3. Users who cannot upgrade immediately should
| ensure that all dictionary files are sourced from trusted origins
| and should consider wrapping the Dictionary(InputStream) constructor
| with input validation that rejects any XML containing a DOCTYPE
| declaration before it reaches the parser.


CVE-2026-42027[1]:
| Arbitrary Class Instantiation via Model Manifest in Apache OpenNLP
| ExtensionLoader      Versions Affected: before 2.5.9, before
| 3.0.0-M3      Description:   The
| ExtensionLoader.instantiateExtension(Class, String) method loads a
| class by its fully-qualified name via Class.forName() and invokes
| its no-arg constructor, with the class name sourced from the
| manifest.properties entry of a model archive. The existing
| isAssignableFrom check correctly rejects classes that are not
| subtypes of the expected extension interface (BaseToolFactory for
| factory=, ArtifactSerializer for serializer-class-*), but the check
| runs after Class.forName() has already loaded and initialized the
| named class.   Class.forName() with default initialization semantics
| executes the target class's static initializer before returning, so
| an attacker who can supply a crafted model archive can cause the
| static initializer of any class on the classpath to run during model
| loading, regardless of whether that class passes the subsequent type
| check.   Exploitation requires a class with attacker-useful side
| effects in its static initializer (for example, JNDI lookup,
| outbound network I/O, or filesystem access) to be present on the
| classpath, so this is not a drop-in remote code execution; however,
| the attack surface grows as third-party model distribution becomes
| more common (community model repositories, Hugging Face-style
| sharing), where users routinely load model files from origins they
| do not control. A secondary, narrower vector affects deployments
| that ship legitimate BaseToolFactory or
| ArtifactSerializer subclasses with side-effecting no-arg
| constructors: a malicious manifest can name such a class and force
| its constructor to run during model load.      Mitigation:       *
| 2.x users should upgrade to 2.5.9.    *  3.x users should upgrade to
| 3.0.0-M3.      Note: The fix introduces a package-prefix allowlist
| that is consulted before Class.forName() is invoked, so the static
| initializer of a disallowed class is never executed. Classes under
| the opennlp. prefix remain permitted by default. Deployments that
| load models referencing factories or serializers outside
| opennlp.* must opt those packages in, either programmatically via
| ExtensionLoader.registerAllowedPackage(String) before the first
| model load, or by setting the OPENNLP_EXT_ALLOWED_PACKAGES system
| property to a comma-separated list of allowed package prefixes.
| Users who cannot upgrade immediately should ensure that all model
| files are sourced from trusted origins and should audit their
| classpath for classes with side-effecting static initializers or
| constructors, particularly any that perform JNDI lookups, network
| requests, or filesystem operations during class initialization.


CVE-2026-42440[2]:
| OOM Denial of Service via Unbounded Array Allocation in Apache
| OpenNLP AbstractModelReader   Versions Affected:   before 2.5.9
| before 3.0.0-M3   Description:   The AbstractModelReader methods
| getOutcomes(), getOutcomePatterns(), and getPredicates() each read a
| 32-bit signed integer count field from a binary model stream and
| pass that value directly to an array allocation (new
| String[numOutcomes], new int[numOCTypes][], new String[NUM_PREDS])
| without validating that the value is non-negative or within a
| reasonable bound. The count is therefore fully attacker-controlled
| when the model file originates from an untrusted source.   A crafted
| .bin model file in which any of these count fields is set to
| Integer.MAX_VALUE (or any value large enough to exhaust the
| available heap) triggers an OutOfMemoryError at the array allocation
| itself, before the corresponding label or pattern data is consumed
| from the stream. The error occurs very early in deserialization: for
| a GIS model, getOutcomes() is reached after only the model-type
| string, the correction constant, and the correction parameter have
| been read; so the attacker pays no meaningful size cost to weaponize
| a payload, and a single small file can crash a JVM that loads it.
| Any code path that deserializes a .bin model is affected, including
| direct use of GenericModelReader and any higher-level component that
| delegates to it during model load.   The practical impact is denial
| of service against processes that load model files from untrusted or
| semi-trusted origins.     Mitigation:      *  2.x users should
| upgrade to 2.5.9.    *  3.x users should upgrade to 3.0.0-M3.
| Note: The fix introduces an upper bound on each of the three count
| fields, checked before array allocation; counts that are negative or
| exceed the bound cause an IllegalArgumentException to be thrown and
| the read to fail fast with no large allocation. The default bound is
| 10,000,000, which is well above the entry counts of legitimate
| OpenNLP models but far below any value that would threaten heap
| exhaustion. Deployments that legitimately need to load models with
| more entries than the default can raise the limit at JVM startup by
| setting the OPENNLP_MAX_ENTRIES system property to the desired
| positive integer (e.g. -DOPENNLP_MAX_ENTRIES=50000000); invalid or
| non-positive values fall back to the default.   Users who cannot
| upgrade immediately should treat all .bin model files as untrusted
| input unless their provenance is verified, and should avoid loading
| models supplied by end users or fetched from third-party
| repositories without integrity checks.


If you fix the vulnerabilities please also make sure to include the
CVE (Common Vulnerabilities & Exposures) ids in your changelog entry.

For further information see:

[0] https://security-tracker.debian.org/tracker/CVE-2026-40682
    https://www.cve.org/CVERecord?id=CVE-2026-40682
[1] https://security-tracker.debian.org/tracker/CVE-2026-42027
    https://www.cve.org/CVERecord?id=CVE-2026-42027
[2] https://security-tracker.debian.org/tracker/CVE-2026-42440
    https://www.cve.org/CVERecord?id=CVE-2026-42440

Please adjust the affected versions in the BTS as needed.

Regards,
Salvatore

Reply via email to