{{ message }}
Increase the number of redirects to 5 for Robots.txt fetching#1074
Merged
Conversation
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
jnioche
approved these changes
May 19, 2023
jnioche
left a comment
Contributor
There was a problem hiding this comment.
Thanks for adding the tests. Just a minor suggestion
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Contributor
michaeldinzinger
added a commit
to michaeldinzinger/storm-crawler
that referenced
this pull request
May 22, 2023
…#1074) * Issue apache#1058: Allow 5 redirects for Robots.txt fetching Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Minor variable renaming Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> --------- Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
jnioche
added a commit
that referenced
this pull request
May 23, 2023
* Remove injection from crawl topologies in *Search archetypes, fixes #1065 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * BasicURLNormalizer .unmangleQueryString() returns invalid results if "&" symbol in a parents path #1059 (#1062) * Fix unmangleQueryString filter. Fix unmangleQueryString filter. Do not analyze full URL path, just last child, * formatting Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Removed remaining references to ES in OPenSearch module Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Dependency upgrades.fixes #1066 (#1067) Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Automatic creation of index definitions should use the bolt type (#1069) Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Maven plugin upgrades + better handling of plugin versions Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * bgufix test jar not attached Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Update maven.yml v3 version of actions Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * mechanism to retrieve more generic value of configuration (#1071) * mechanism to retrieve more generic value of configuration if a specific one is not found, fixes #1070 Signed-off-by: Julien Nioche <julien@digitalpebble.com> * minor javadoc fix Signed-off-by: Julien Nioche <julien@digitalpebble.com> --------- Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Batch requests in DeleterBolt, fixes #1072 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Update README.md link to docker project Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Create DeletionBolt.java for Solr. #1050 (#1073) * Create DeletionBolt.java storm-crawler-solr bug. Missing DeletionBolt bolt code. #1050 * Update DeletionBolt.java License header added * Update DeletionBolt.java formatting Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * SOLR: suppress warnings + minor changes and Javadoc + added deletion to default topology Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Tika 2.8.0, fixes 1066 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Increase the number of redirects to 5 for Robots.txt fetching (#1074) * Issue #1058: Allow 5 redirects for Robots.txt fetching Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Minor variable renaming Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> --------- Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Add test coverage reports with JaCoCo and Coveralls, fixes #1075 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * #1075 - Add test coverage reports with JaCoCo Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * #1075 - Update GH workflow to reduce log spam by adding -B and --no-transfer-progess maven options Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Rebase - Issue #1042: Forbid all rules by default Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Modify Robots.txt parsing logic and add test cases Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Parse robots txt rules only for status code 200 Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Trying to resolve merge conflicts Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Modify Robots.txt parsing logic and add test cases Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Parse robots txt rules only for status code 200 Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Merge HttpRobotRulesParserTest Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> --------- Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Co-authored-by: Julien Nioche <julien@digitalpebble.com> Co-authored-by: syefimov <syefimov@ptfs.com> Co-authored-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Described in Issue #1058
In IETF RFC9309 (Robots Exclusion Protocol), it is stated that crawlers should follow up to 5 consecutive redirects in their attempt to fetch a Robots.txt file. Up to now, the SC only followed one level of redirects. So this code change might slightly improve the politeness of the crawler