Crawling a whole blog #592

TomLucidor · 2024-05-13T02:03:22Z

TomLucidor
May 13, 2024

Assuming that a WordPress blog (or equivalent) does not have that much protection, what is the fastest way to do an operation similar to HTTRack (GUI full site scraper), Heritrix (InternetArchive's saver), and Grab-Site (ArchiveTeam's saver)? If the blog gets updated regularly, how does one avoid duplicate pages?
Q1: heard of Ferret, Crawlee, and AutoScraper? Seems like people are automating even more now
Q2: What do you think of Playwright and Puppeteer? Could they be used to offload scraping before article extraction with Trafilatura?

Answered by adbar

Jun 11, 2024

You'd need to use the --explore function on the command-line with --backup-dir html/ to replicate the functionality but Trafilatura would also extract the content.
In some cases it makes more sense to get the data first and then use Trafilatura locally on the downloaded content (which also answers Q2).

Q1: This is not the same, requires a lot of training, or is not as efficient (depending on the use case and the package).

View full answer

adbar · 2024-06-11T17:43:35Z

adbar
Jun 11, 2024
Maintainer

You'd need to use the --explore function on the command-line with --backup-dir html/ to replicate the functionality but Trafilatura would also extract the content.
In some cases it makes more sense to get the data first and then use Trafilatura locally on the downloaded content (which also answers Q2).

Q1: This is not the same, requires a lot of training, or is not as efficient (depending on the use case and the package).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling a whole blog #592

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Crawling a whole blog #592

TomLucidor May 13, 2024

Replies: 1 comment

adbar Jun 11, 2024 Maintainer

TomLucidor
May 13, 2024

adbar
Jun 11, 2024
Maintainer