#1475 -- Regular crawling should work when autodiscovery of sitemaps is turned off by tballison · Pull Request #1477 · apache/stormcrawler · GitHub
Skip to content

#1475 -- Regular crawling should work when autodiscovery of sitemaps is turned off#1477

Merged
jnioche merged 4 commits into
mainfrom
issue-1475
Feb 28, 2025
Merged

#1475 -- Regular crawling should work when autodiscovery of sitemaps is turned off#1477
jnioche merged 4 commits into
mainfrom
issue-1475

Conversation

@tballison

@tballison tballison commented Feb 21, 2025

Copy link
Copy Markdown
Contributor

I tested this offline with https://www.cdc.gov and https://www.fda.gov.

I confirmed that when sitemap.discovery=false, I could set one in the seed file to true, and the behavior was as expected.

I also tested the opposite, where the default was true, but the seed for one of them was false, and the behavior was as expected.

I'm not sure this is the best solution. I don't like tightly coupling logic for the SitemapFilter in the FetcherBolts, but so it goes.

And, as usual, unit tests are, well, hard.

Let me know what you think.

Thank you for contributing to Apache StormCrawler.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a issue associated with this PR? Is it referenced in the commit message?

  • Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

  • Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean verify?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

@tballison tballison requested a review from jnioche February 21, 2025 17:35
@tballison tballison changed the title #1475 #1475 -- Regular crawling should work when autodiscovery of sitemaps is turned off Feb 21, 2025

@jnioche jnioche left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing a configuration value through the metadata, I think we should instead make use of the configure method which all filters inherit.
We can get the value of the conf from there and if the sitemap detection is off, have a simple check at the beginning of the filter method and exit.

@tballison

tballison commented Feb 24, 2025

Copy link
Copy Markdown
Contributor Author

@tballison tballison requested a review from jnioche February 27, 2025 16:38
@jnioche jnioche added this to the 3.3.0 milestone Feb 28, 2025
@jnioche jnioche added the bug label Feb 28, 2025
@jnioche jnioche merged commit 9b39f50 into main Feb 28, 2025
@jnioche jnioche deleted the issue-1475 branch February 28, 2025 10:03
@jnioche

jnioche commented Feb 28, 2025

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants