sebastian-nagel commented on code in PR #1843: URL: https://github.com/apache/stormcrawler/pull/1843#discussion_r2999183668
########## docs/src/main/asciidoc/configuration.adoc: ########## @@ -180,6 +179,7 @@ is defined. | fetcher.server.min.delay | 0 | Delay between crawls for queues with >1 thread. Ignores robots.txt. | fetcher.threads.number | 10 | Total concurrent threads fetching pages. Adjust carefully based on system capacity. | fetcher.threads.per.queue | 1 | Default number of threads per queue. Can be overridden. +| fetcher.threads.start.delay | 10 | Delay (seconds) before starting fetcher threads. Review Comment: (in milliseconds) > Delay (milliseconds) between starting next fetcher thread. Avoids that DNS or network resources are overloaded during fetcher startup when all threads simultaneously start requesting the first pages. ########## docs/src/main/asciidoc/configuration.adoc: ########## @@ -196,20 +196,26 @@ implementation. | http.proxy.pass | - | Proxy password. | http.proxy.port | 8080 | Proxy port. | http.proxy.user | - | Proxy username. -| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 403. +| http.retry.on.connection.failure | true | Retry fetching on connection failure. Review Comment: `http.retry.on.connection.failure` is supported only by the OkHttp protocol. Maybe link to <https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/-builder/retry-on-connection-failure.html>? ########## docs/src/main/asciidoc/configuration.adoc: ########## @@ -196,20 +196,26 @@ implementation. | http.proxy.pass | - | Proxy password. | http.proxy.port | 8080 | Proxy port. | http.proxy.user | - | Proxy username. -| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 403. +| http.retry.on.connection.failure | true | Retry fetching on connection failure. +| http.robots.403.allow | true | Allow crawling when robots.txt returns HTTP 403. +| http.robots.5xx.allow | false | Allow crawling when robots.txt returns a server error (5xx). | http.robots.agents | '' | Additional user-agent strings for interpreting robots.txt. -| http.robots.file.skip | false | Ignore robots.txt rules (1.17+). +| http.robots.content.limit | -1 | Maximum bytes to fetch for robots.txt. -1 uses http.content.limit. +| http.robots.file.skip | false | Ignore robots.txt rules entirely. +| http.robots.headers.skip | false | Ignore robots directives from HTTP headers. +| http.robots.meta.skip | false | Ignore robots directives from HTML meta tags. | http.skip.robots | false | Deprecated (replaced by http.robots.file.skip). +| robots.noFollow.strict | true | If true, remove all outlinks from pages marked as noFollow. | http.store.headers | false | Whether to store response headers. -| http.store.responsetime | true | Not yet implemented — store response time in Metadata. | http.timeout | 10000 | Connection timeout (ms). | http.use.cookies | false | Use cookies in subsequent requests. | https.protocol.implementation | org.apache.stormcrawler.protocol.httpclient.HttpProtocol | HTTPS Protocol implementation. | partition.url.mode | byHost | Defines how URLs are partitioned: byHost, byDomain, or byIP. -| protocols | http,https | Supported protocols. -| redirections.allowed | true | Allow URL redirects. +| protocols | http,https,file | Supported protocols. +| http.allow.redirects | false | Allow URL redirects. Review Comment: Maybe we should add a dedecitated subsection to "protocols" which explains the behavior regarding redirects: ### Following Redirects When following HTTP redirects you have three options: 1. By default, StormCrawler emits the redirect target URL to the status stream. URL filter and normalization rules are applied to the target URLs, the crawler verifies that the target URL is allowed per robots.txt, and it is ensured that the redirect target is not fetched multiple times (URLs are deduplicated in the status index). 2. If `redirections.allowed` is false, the redirect target URLs are not sent to the status stream. That is redirects are ignored. 3. Redirects are followed immediately in the HTTP client and the target URLs not emitted to the status stream. This is the default behavior for browser-based protocols (Selenium and Playwright), but it's also supported by the OkHttp protocol if `http.allow.redirects` is set to true. ########## docs/src/main/asciidoc/configuration.adoc: ########## @@ -196,20 +196,26 @@ implementation. | http.proxy.pass | - | Proxy password. | http.proxy.port | 8080 | Proxy port. | http.proxy.user | - | Proxy username. -| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 403. +| http.retry.on.connection.failure | true | Retry fetching on connection failure. +| http.robots.403.allow | true | Allow crawling when robots.txt returns HTTP 403. +| http.robots.5xx.allow | false | Allow crawling when robots.txt returns a server error (5xx). | http.robots.agents | '' | Additional user-agent strings for interpreting robots.txt. -| http.robots.file.skip | false | Ignore robots.txt rules (1.17+). +| http.robots.content.limit | -1 | Maximum bytes to fetch for robots.txt. -1 uses http.content.limit. +| http.robots.file.skip | false | Ignore robots.txt rules entirely. +| http.robots.headers.skip | false | Ignore robots directives from HTTP headers. +| http.robots.meta.skip | false | Ignore robots directives from HTML meta tags. | http.skip.robots | false | Deprecated (replaced by http.robots.file.skip). +| robots.noFollow.strict | true | If true, remove all outlinks from pages marked as noFollow. | http.store.headers | false | Whether to store response headers. -| http.store.responsetime | true | Not yet implemented — store response time in Metadata. | http.timeout | 10000 | Connection timeout (ms). | http.use.cookies | false | Use cookies in subsequent requests. | https.protocol.implementation | org.apache.stormcrawler.protocol.httpclient.HttpProtocol | HTTPS Protocol implementation. | partition.url.mode | byHost | Defines how URLs are partitioned: byHost, byDomain, or byIP. -| protocols | http,https | Supported protocols. -| redirections.allowed | true | Allow URL redirects. +| protocols | http,https,file | Supported protocols. +| http.allow.redirects | false | Allow URL redirects. Review Comment: > (OkHttp only) Follow HTTP redirects immediately in the HTTP protocol client. Note: if followed immediately, redirect target URLs are not emitted to the status stream, are not filtered, not deduplicated, not checked whether allowed per robots.txt, etc. ########## docs/src/main/asciidoc/configuration.adoc: ########## @@ -196,20 +196,26 @@ implementation. | http.proxy.pass | - | Proxy password. | http.proxy.port | 8080 | Proxy port. | http.proxy.user | - | Proxy username. -| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 403. +| http.retry.on.connection.failure | true | Retry fetching on connection failure. +| http.robots.403.allow | true | Allow crawling when robots.txt returns HTTP 403. +| http.robots.5xx.allow | false | Allow crawling when robots.txt returns a server error (5xx). | http.robots.agents | '' | Additional user-agent strings for interpreting robots.txt. -| http.robots.file.skip | false | Ignore robots.txt rules (1.17+). +| http.robots.content.limit | -1 | Maximum bytes to fetch for robots.txt. -1 uses http.content.limit. +| http.robots.file.skip | false | Ignore robots.txt rules entirely. +| http.robots.headers.skip | false | Ignore robots directives from HTTP headers. +| http.robots.meta.skip | false | Ignore robots directives from HTML meta tags. | http.skip.robots | false | Deprecated (replaced by http.robots.file.skip). +| robots.noFollow.strict | true | If true, remove all outlinks from pages marked as noFollow. | http.store.headers | false | Whether to store response headers. -| http.store.responsetime | true | Not yet implemented — store response time in Metadata. | http.timeout | 10000 | Connection timeout (ms). | http.use.cookies | false | Use cookies in subsequent requests. | https.protocol.implementation | org.apache.stormcrawler.protocol.httpclient.HttpProtocol | HTTPS Protocol implementation. | partition.url.mode | byHost | Defines how URLs are partitioned: byHost, byDomain, or byIP. -| protocols | http,https | Supported protocols. Review Comment: Defined in StatusEmitterBolt and used by derived classes (FetcherBolt, etc.): ``` | redirections.allowed | true | If true emit redirect target URLs as "outlinks" to the status stream. If false, do not follow redirects. See also `http.allow.redirects`. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
