[PR] Replace XSoup with JSoup built-in XPath support [stormcrawler]

via GitHub Mon, 30 Mar 2026 08:30:29 -0700


jnioche opened a new pull request, #1856:
URL: https://github.com/apache/stormcrawler/pull/1856


   ## Summary
   
   Fixes #1407
   
   - Removes the `us.codecraft:xsoup` dependency entirely
   - Replaces XSoup's `Xsoup.compile(xpath).evaluate(doc)` with JSoup's native 
`doc.selectXpath()` API (available since JSoup 1.17.1, we use 1.22.1)
   - Non-standard XSoup functions are handled by stripping the suffix from the 
expression and mapping to the equivalent JSoup `Element` method:
     - `/tidyText()` → `Element.text()`
     - `/allText()` → `Element.text()`
     - `/html()` → `Element.html()`
     - `/@attr` → `Element.attr(name)`
   - Element names in XPath expressions are automatically lowercased (e.g. 
`//SPAN` → `//span`) since JSoup normalizes tags to lowercase while XSoup was 
case-insensitive
   
   Existing user configurations with expressions like `//TITLE/tidyText()` or 
`//META[@name="keywords"]/@content` continue to work unchanged.
   
   ## Test plan
   
   - [x] All existing `JSoupFiltersTest` tests pass (concept extraction, script 
extraction, LD-JSON extraction, extra links)
   - [x] All existing `XPathFilterTest` tests pass
   - [x] Full core test suite passes (215 tests, 0 failures)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Replace XSoup with JSoup built-in XPath support [stormcrawler]

Reply via email to