Releases: adbar/trafilatura
Releases · adbar/trafilatura
trafilatura-1.6.3
Extraction:
- preserve space in certain elements with @idoshamun (#429)
- optional list of xPaths to prune by @HeLehm (#414)
Metadata:
- more precise date extraction (see htmldate)
- new
htmldate
extensive search parameter in config (#434) - changes in URLs: normalization, trackers removed (see courlan)
Navigation:
Documentation:
trafilatura-1.6.2
Extraction:
- more lenient HTML parsing (#370)
- improved code block support with @idoshamun (#372, #401)
- convertion of relative links to absolute by @feltcat (#377)
- remove use of signal from core functions (#384)
Metadata:
- JSON-LD fix for sitenames by @felipehertzer (#383)
Command-line interface:
- more robust batch processing (#381)
- added
--probe
option to CLI to check for extractable content (#378, #392)
Maintenance:
- simplified code (#408)
- support for Python 3.12
- pinned LXML version for MacOS (#393)
- updated dependencies and parameters (notably
htmldate
andcourlan
) - code cleaning by @marksmayo (#406)
trafilatura-1.6.1
Extraction:
Metadata:
- simplify and fully test JSON parsing code, with @felipehertzer (#352, #368)
- authors, JSON and unicode fixes by @felipehertzer in #365
- fix for authors without
additionalName
by @awwitecki in #363
Navigation:
- reviewed link processing in feeds and sitemaps (#340, #350)
- more robust spider (#359)
- updated underlying courlan package (#360)
Full Changelog: v1.6.0...v1.6.1
trafilatura-1.6.0
Extraction:
- new content hashes and default file names (#314)
- fix deprecation warning with @sdondley in #321
- fix for metadata image by @andremacola in #328
- fix potential unicode issue in third-party extraction with @Korben00 in #331
- review logging levels (#347)
Command-line interface:
- more efficient sitemap processing (#326)
- more efficient downloads (#338)
- fix for single URL processing (#324) and URL blacklisting (#339)
Navigation
- additional safety check on domain similarity for feeds and sitemaps
- new function
is_live test()
using HTTP HEAD request (#327) - code parts supported by new courlan version
Maintenance
- allow
urllib3
version 2.0+ - minor code simplification and fixes
Full Changelog: v1.5.0...v1.6.0
trafilatura-1.5.0
Extraction:
- fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303)
- pagetype and image urls added to metadata by @andremacola (#282, #310)
- add as_dict method to Document class with @edkrueger in #306
- XML output fix with @knit-bee in #315
- various smaller fixes: lists (#309), XPaths, metadata hardening
Navigation:
Maintenance:
- simplify code and extend tests
- underlying packages htmldate and courlan, update setup and docs
Full Changelog: v1.4.1...v1.5.0
v1.4.1
Extraction:
- extraction bugs fixed (#263, #266), more robust HTML doctype parsing
- XML output improvements by @knit-bee (#273, #274)
- adjust thresholds for link density in paragraphs
Metadata:
- improved title and sitename detection (#284)
- faster author, categories, domain name, and tags extraction
- fixes to author emoji regexes by @felipehertzer (#269)
Command-line interface:
- review argument consistency and add deprecation warnings (#261)
Setup:
- make download timeout configurable (#263)
- updated dependencies, use of faust-cchardet for Python 3.11
Full Changelog: v1.4.0...v1.4.1
trafilatura-1.4.0
Impact on extraction and output format:
- better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
- XML: preserve list type as attribute (#229)
- XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
- faster text cleaning and shorter code (#237 with @deedy5, #245)
- metadata: add language when detector is activated (#224)
- metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
- TXT: change markdown formatting of headers by @LaundroMat (#257)
Smaller changes in convenience functions:
- add function to clear caches (#219)
- CLI: change exit code if download fails (#223)
- settings: use "\n" for multiple user agents by @k-sareen (#241)
Updates:
- docs updated (and #244 by @dsgibbons)
- package dependencies updated
Full Changelog: v1.3.0...v1.4.0
trafilatura-1.3.0
- fast and robust
html2txt()
function added (#221) - more robust parsing (#228)
- fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
- extraction about 10-20% faster, slightly better recall
- partial fixes for memory leaks (#216)
- docs extended and updated (#217, #225)
- prepared deprecation of old
process_record()
function - more stable processing with updated dependencies
Full Changelog: v1.2.2...v1.3.0
trafilatura-1.2.2
- more efficient rules for extraction
- metadata: further attributes used (with @felipehertzer)
- better baseline extraction
- issues fixed: #202, #204, #205
- evaluation updated
Full Changelog: v1.2.1...v1.2.2
trafilatura-1.2.1
What's Changed
--precision
and--recall
arguments added to the CLI- better text cleaning: paywalls and comments
- improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
- further bugs fixed: #189, #192 (with @felipehertzer), #200
- efficiency: faster module loading and improved RAM footprint
Full Changelog: v1.2.0...v1.2.1