NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 by sebastian-nagel · Pull Request #779 · apache/nutch · GitHub
Skip to content

NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309#779

Merged
sebastian-nagel merged 3 commits into
apache:masterfrom
sebastian-nagel:NUTCH-2990-robotstxt-redirects
Oct 21, 2023
Merged

NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309#779
sebastian-nagel merged 3 commits into
apache:masterfrom
sebastian-nagel:NUTCH-2990-robotstxt-redirects

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Contributor
  • follow multiple redirects when fetching robots.txt
  • number of followed redirects is configurable by the property http.robots.redirect.max (default: 5)
  • improvements in RobotRulesParser's robots.txt test utility
    • bug fix: the passed agent names need to be transferred to the property http.robots.agents earlier, before the protocol plugins are configured
    • more verbose debug logging

… RFC 9309

- follow multiple redirects when fetching robots.txt
- number of followed redirects is configurable by the property
  http.robots.redirect.max (default: 5)
- bug fix: the passed agent names need to be transferred
  to the property http.robots.agents earlier, before the
  protocol plugins are configured
- more verbose debug logging
@lewismc

lewismc commented Sep 26, 2023

Copy link
Copy Markdown
Member

@sebastian-nagel

Copy link
Copy Markdown
Contributor Author

an example on hand of a robots.txt which can be fetched with >1 redirects?

http://wikipedia.org/robots.txt

Note: works with protocol-http, for protocol-okhttp need also to apply the fix for NUTCH-3002.

Maybe as an additional note: this PR removes the secondary lookup for a lower-cased "location" header. Case-insensitive lookup of protocol metadata should be implemented on protocol level.

@jnioche

jnioche commented Oct 6, 2023

Copy link
Copy Markdown
Contributor

@sebastian-nagel sebastian-nagel merged commit ecdd19d into apache:master Oct 21, 2023
@sebastian-nagel sebastian-nagel deleted the NUTCH-2990-robotstxt-redirects branch October 29, 2023 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants