Skip to content

Releases: adbar/trafilatura

trafilatura-1.6.3

29 Nov 13:42
e7b5723
Compare
Choose a tag to compare

Extraction:

Metadata:

  • more precise date extraction (see htmldate)
  • new htmldate extensive search parameter in config (#434)
  • changes in URLs: normalization, trackers removed (see courlan)

Navigation:

  • reviewed code for feeds (#443)
  • new config option: external URLs for feeds/sitemaps (#441)

Documentation:

trafilatura-1.6.2

06 Sep 15:45
5ce31d9
Compare
Choose a tag to compare

Extraction:

  • more lenient HTML parsing (#370)
  • improved code block support with @idoshamun (#372, #401)
  • convertion of relative links to absolute by @feltcat (#377)
  • remove use of signal from core functions (#384)

Metadata:

Command-line interface:

  • more robust batch processing (#381)
  • added --probe option to CLI to check for extractable content (#378, #392)

Maintenance:

  • simplified code (#408)
  • support for Python 3.12
  • pinned LXML version for MacOS (#393)
  • updated dependencies and parameters (notably htmldate and courlan)
  • code cleaning by @marksmayo (#406)

trafilatura-1.6.1

15 Jun 12:59
d85d584
Compare
Choose a tag to compare

Extraction:

  • minor fixes: tables in figures (#301), headings (#354) and lists (#318)

Metadata:

Navigation:

  • reviewed link processing in feeds and sitemaps (#340, #350)
  • more robust spider (#359)
  • updated underlying courlan package (#360)

Full Changelog: v1.6.0...v1.6.1

trafilatura-1.6.0

11 May 11:00
0bce218
Compare
Choose a tag to compare

Extraction:

  • new content hashes and default file names (#314)
  • fix deprecation warning with @sdondley in #321
  • fix for metadata image by @andremacola in #328
  • fix potential unicode issue in third-party extraction with @Korben00 in #331
  • review logging levels (#347)

Command-line interface:

  • more efficient sitemap processing (#326)
  • more efficient downloads (#338)
  • fix for single URL processing (#324) and URL blacklisting (#339)

Navigation

  • additional safety check on domain similarity for feeds and sitemaps
  • new function is_live test() using HTTP HEAD request (#327)
  • code parts supported by new courlan version

Maintenance

  • allow urllib3 version 2.0+
  • minor code simplification and fixes

Full Changelog: v1.5.0...v1.6.0

trafilatura-1.5.0

30 Mar 16:11
2639b24
Compare
Choose a tag to compare

Extraction:

Navigation:

  • transfer URL management to courlan.UrlStore (#232, #312)
  • fixes for spider module

Maintenance:

  • simplify code and extend tests
  • underlying packages htmldate and courlan, update setup and docs

Full Changelog: v1.4.1...v1.5.0

v1.4.1

19 Jan 17:02
Compare
Choose a tag to compare

Extraction:

  • extraction bugs fixed (#263, #266), more robust HTML doctype parsing
  • XML output improvements by @knit-bee (#273, #274)
  • adjust thresholds for link density in paragraphs

Metadata:

  • improved title and sitename detection (#284)
  • faster author, categories, domain name, and tags extraction
  • fixes to author emoji regexes by @felipehertzer (#269)

Command-line interface:

  • review argument consistency and add deprecation warnings (#261)

Setup:

  • make download timeout configurable (#263)
  • updated dependencies, use of faust-cchardet for Python 3.11

Full Changelog: v1.4.0...v1.4.1

trafilatura-1.4.0

18 Oct 13:59
Compare
Choose a tag to compare

Impact on extraction and output format:

Smaller changes in convenience functions:

  • add function to clear caches (#219)
  • CLI: change exit code if download fails (#223)
  • settings: use "\n" for multiple user agents by @k-sareen (#241)

Updates:

Full Changelog: v1.3.0...v1.4.0

trafilatura-1.3.0

29 Jul 14:42
Compare
Choose a tag to compare
  • fast and robust html2txt() function added (#221)
  • more robust parsing (#228)
  • fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
  • extraction about 10-20% faster, slightly better recall
  • partial fixes for memory leaks (#216)
  • docs extended and updated (#217, #225)
  • prepared deprecation of old process_record() function
  • more stable processing with updated dependencies

Full Changelog: v1.2.2...v1.3.0

trafilatura-1.2.2

18 May 15:55
Compare
Choose a tag to compare
  • more efficient rules for extraction
  • metadata: further attributes used (with @felipehertzer)
  • better baseline extraction
  • issues fixed: #202, #204, #205
  • evaluation updated

Full Changelog: v1.2.1...v1.2.2

trafilatura-1.2.1

02 May 10:24
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.2.0...v1.2.1