Source: apache-opennlp Version: 2.5.3-1 Severity: important Tags: security upstream X-Debbugs-Cc: [email protected], Debian Security Team <[email protected]>
Hi, The following vulnerabilities were published for apache-opennlp. CVE-2026-40682[0]: | XML External Entity (XXE) via Unsanitized Dictionary Parsing in | Apache OpenNLP DictionaryEntryPersistor Versions Affected: before | 2.5.9, before 3.0.0-M3 Description: The DictionaryEntryPersistor | class initializes a static SAXParserFactory at class-load time | without enabling FEATURE_SECURE_PROCESSING or disabling DTD | processing. When create(InputStream, EntryInserter) is invoked, the | only feature set on the XMLReader is namespace support — external | entity resolution and DOCTYPE declarations remain fully enabled. An | attacker who can supply a crafted dictionary file (e.g., a stop-word | list or domain dictionary) containing a malicious DOCTYPE | declaration can trigger local file disclosure via file:// entity | references or server-side request forgery via http:// entity | references during SAX parsing, before the application processes a | single dictionary entry. This is inconsistent with the project's own | XmlUtil.createSaxParser() helper, which correctly sets | FEATURE_SECURE_PROCESSING and disallow-doctype-decl and is used by | all other XML parsing paths in the codebase. The public | Dictionary(InputStream) constructor delegates directly to this | method and is the documented API for loading user-supplied | dictionaries, making untrusted input a realistic scenario. | Mitigation: 2.x users should upgrade to 2.5.9. 3.x users should | upgrade to 3.0.0-M3. Users who cannot upgrade immediately should | ensure that all dictionary files are sourced from trusted origins | and should consider wrapping the Dictionary(InputStream) constructor | with input validation that rejects any XML containing a DOCTYPE | declaration before it reaches the parser. CVE-2026-42027[1]: | Arbitrary Class Instantiation via Model Manifest in Apache OpenNLP | ExtensionLoader Versions Affected: before 2.5.9, before | 3.0.0-M3 Description: The | ExtensionLoader.instantiateExtension(Class, String) method loads a | class by its fully-qualified name via Class.forName() and invokes | its no-arg constructor, with the class name sourced from the | manifest.properties entry of a model archive. The existing | isAssignableFrom check correctly rejects classes that are not | subtypes of the expected extension interface (BaseToolFactory for | factory=, ArtifactSerializer for serializer-class-*), but the check | runs after Class.forName() has already loaded and initialized the | named class. Class.forName() with default initialization semantics | executes the target class's static initializer before returning, so | an attacker who can supply a crafted model archive can cause the | static initializer of any class on the classpath to run during model | loading, regardless of whether that class passes the subsequent type | check. Exploitation requires a class with attacker-useful side | effects in its static initializer (for example, JNDI lookup, | outbound network I/O, or filesystem access) to be present on the | classpath, so this is not a drop-in remote code execution; however, | the attack surface grows as third-party model distribution becomes | more common (community model repositories, Hugging Face-style | sharing), where users routinely load model files from origins they | do not control. A secondary, narrower vector affects deployments | that ship legitimate BaseToolFactory or | ArtifactSerializer subclasses with side-effecting no-arg | constructors: a malicious manifest can name such a class and force | its constructor to run during model load. Mitigation: * | 2.x users should upgrade to 2.5.9. * 3.x users should upgrade to | 3.0.0-M3. Note: The fix introduces a package-prefix allowlist | that is consulted before Class.forName() is invoked, so the static | initializer of a disallowed class is never executed. Classes under | the opennlp. prefix remain permitted by default. Deployments that | load models referencing factories or serializers outside | opennlp.* must opt those packages in, either programmatically via | ExtensionLoader.registerAllowedPackage(String) before the first | model load, or by setting the OPENNLP_EXT_ALLOWED_PACKAGES system | property to a comma-separated list of allowed package prefixes. | Users who cannot upgrade immediately should ensure that all model | files are sourced from trusted origins and should audit their | classpath for classes with side-effecting static initializers or | constructors, particularly any that perform JNDI lookups, network | requests, or filesystem operations during class initialization. CVE-2026-42440[2]: | OOM Denial of Service via Unbounded Array Allocation in Apache | OpenNLP AbstractModelReader Versions Affected: before 2.5.9 | before 3.0.0-M3 Description: The AbstractModelReader methods | getOutcomes(), getOutcomePatterns(), and getPredicates() each read a | 32-bit signed integer count field from a binary model stream and | pass that value directly to an array allocation (new | String[numOutcomes], new int[numOCTypes][], new String[NUM_PREDS]) | without validating that the value is non-negative or within a | reasonable bound. The count is therefore fully attacker-controlled | when the model file originates from an untrusted source. A crafted | .bin model file in which any of these count fields is set to | Integer.MAX_VALUE (or any value large enough to exhaust the | available heap) triggers an OutOfMemoryError at the array allocation | itself, before the corresponding label or pattern data is consumed | from the stream. The error occurs very early in deserialization: for | a GIS model, getOutcomes() is reached after only the model-type | string, the correction constant, and the correction parameter have | been read; so the attacker pays no meaningful size cost to weaponize | a payload, and a single small file can crash a JVM that loads it. | Any code path that deserializes a .bin model is affected, including | direct use of GenericModelReader and any higher-level component that | delegates to it during model load. The practical impact is denial | of service against processes that load model files from untrusted or | semi-trusted origins. Mitigation: * 2.x users should | upgrade to 2.5.9. * 3.x users should upgrade to 3.0.0-M3. | Note: The fix introduces an upper bound on each of the three count | fields, checked before array allocation; counts that are negative or | exceed the bound cause an IllegalArgumentException to be thrown and | the read to fail fast with no large allocation. The default bound is | 10,000,000, which is well above the entry counts of legitimate | OpenNLP models but far below any value that would threaten heap | exhaustion. Deployments that legitimately need to load models with | more entries than the default can raise the limit at JVM startup by | setting the OPENNLP_MAX_ENTRIES system property to the desired | positive integer (e.g. -DOPENNLP_MAX_ENTRIES=50000000); invalid or | non-positive values fall back to the default. Users who cannot | upgrade immediately should treat all .bin model files as untrusted | input unless their provenance is verified, and should avoid loading | models supplied by end users or fetched from third-party | repositories without integrity checks. If you fix the vulnerabilities please also make sure to include the CVE (Common Vulnerabilities & Exposures) ids in your changelog entry. For further information see: [0] https://security-tracker.debian.org/tracker/CVE-2026-40682 https://www.cve.org/CVERecord?id=CVE-2026-40682 [1] https://security-tracker.debian.org/tracker/CVE-2026-42027 https://www.cve.org/CVERecord?id=CVE-2026-42027 [2] https://security-tracker.debian.org/tracker/CVE-2026-42440 https://www.cve.org/CVERecord?id=CVE-2026-42440 Please adjust the affected versions in the BTS as needed. Regards, Salvatore

