Crawling a whole blog #592
-
Assuming that a WordPress blog (or equivalent) does not have that much protection, what is the fastest way to do an operation similar to HTTRack (GUI full site scraper), Heritrix (InternetArchive's saver), and Grab-Site (ArchiveTeam's saver)? If the blog gets updated regularly, how does one avoid duplicate pages? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
You'd need to use the Q1: This is not the same, requires a lot of training, or is not as efficient (depending on the use case and the package). |
Beta Was this translation helpful? Give feedback.
You'd need to use the
--explore
function on the command-line with--backup-dir html/
to replicate the functionality but Trafilatura would also extract the content.In some cases it makes more sense to get the data first and then use Trafilatura locally on the downloaded content (which also answers Q2).
Q1: This is not the same, requires a lot of training, or is not as efficient (depending on the use case and the package).