dpol1 commented on PR #1834:
URL: https://github.com/apache/stormcrawler/pull/1834#issuecomment-4104392285

   @rzo1 thanks for tagging me, appreciate that - the changes LGTM!
   
   However I was going through the documentation and noticed: 
    `URLUtil.toURL()` uses `new URI(string).toURL()` which is strict [RFC 
3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.2), 
   while the old `new URL(string)` was more permissive with characters like 
pipes (`|`) or unencoded spaces.
   
   `sanitizeForURI` handles this correctly inside `BasicURLNormalizer`. 
However, my concern is about legacy data already present in the status store, 
or URLs injected directly into the backend (e.g. Solr/OpenSearch via external 
scripts) that bypass `URLFilterBolt` entirely. Since `FetcherBolt` reads them 
raw, something like `http://example.com?q=a|b` would have been fetched fine 
before and would now silently fail as ERROR.
   
   Might be worth a follow-up issue, or documenting it as a known behavior 
change. What do you think?
   Not a blocker for this PR though! Just a silent functional regression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to