dpol1 commented on PR #1834:
URL: https://github.com/apache/stormcrawler/pull/1834#issuecomment-4104392285
@rzo1 thanks for tagging me, appreciate that - the changes LGTM!
However I was going through the documentation and noticed:
`URLUtil.toURL()` uses `new URI(string).toURL()` which is strict [RFC
3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.2),
while the old `new URL(string)` was more permissive with characters like
pipes (`|`) or unencoded spaces.
`sanitizeForURI` handles this correctly inside `BasicURLNormalizer`.
However, my concern is about legacy data already present in the status store,
or URLs injected directly into the backend (e.g. Solr/OpenSearch via external
scripts) that bypass `URLFilterBolt` entirely. Since `FetcherBolt` reads them
raw, something like `http://example.com?q=a|b` would have been fetched fine
before and would now silently fail as ERROR.
Might be worth a follow-up issue, or documenting it as a known behavior
change. What do you think?
Not a blocker for this PR though! Just a silent functional regression.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]