iframe-proxy

sebastian-nagel · 2023-10-02T08:36:57Z

remove unsafe check for absolute URLs on redirects (HTTP Location header): a string starting with http is not necessarily a valid and resolvable URL. For safety, always resolve the redirect location using java.net.URL(URL context, String spec)
resolve relative redirects using the redirect source as base URL. Resolving with the original URL as base/context may lead to a wrong redirect target in a chain of redirects with different hosts/authorities.
cache robot rules for all /robots.txt (if on default location) in a chain of redirects
add unit test for the three points above

Note: this PR is a result of implementing the RFC 9309 redirect rules to Nutch, see NUTCH-2990 and apache/nutch#779. I deliberately took the implementation in StormCrawler (#1058/#1074) as a starting point.

- remove unsafe check for absolute URLs on redirects (location header) - resolve relative redirects using the redirect source as base URL - cache robot rules for all /robots.txt (if on default location) in a chain of redirects Signed-off-by: Sebastian Nagel <sebastian@commoncrawl.org>

jnioche · 2023-10-02T14:25:56Z

jnioche added this to the 2.10 milestone Oct 2, 2023

jnioche added enhancement core labels Oct 2, 2023

jnioche merged commit d6f1377 into apache:master Oct 2, 2023

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements and fixes to HttpRobotRulesParser when following redirects#1103

Improvements and fixes to HttpRobotRulesParser when following redirects#1103
jnioche merged 1 commit into
apache:masterfrom
sebastian-nagel:robotstxt-redirects-fixes

sebastian-nagel commented Oct 2, 2023

Uh oh!

jnioche commented Oct 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

sebastian-nagel commented Oct 2, 2023

Uh oh!

jnioche commented Oct 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants