Releases · adbar/trafilatura

29 Nov 13:42

adbar

v1.6.3

e7b5723

trafilatura-1.6.3

Extraction:

preserve space in certain elements with @idoshamun (#429)
optional list of xPaths to prune by @HeLehm (#414)

Metadata:

more precise date extraction (see htmldate)
new htmldate extensive search parameter in config (#434)
changes in URLs: normalization, trackers removed (see courlan)

Navigation:

reviewed code for feeds (#443)
new config option: external URLs for feeds/sitemaps (#441)

Documentation:

update, add page on text embeddings with @tonyyanga (#428, #435, #447)
fix quickstart by @sashkab (#419)

Contributors

sashkab, idoshamun, and 2 other contributors

Assets 2

06 Sep 15:45

adbar

v1.6.2

5ce31d9

trafilatura-1.6.2

Extraction:

more lenient HTML parsing (#370)
improved code block support with @idoshamun (#372, #401)
convertion of relative links to absolute by @feltcat (#377)
remove use of signal from core functions (#384)

Metadata:

JSON-LD fix for sitenames by @felipehertzer (#383)

Command-line interface:

more robust batch processing (#381)
added --probe option to CLI to check for extractable content (#378, #392)

Maintenance:

simplified code (#408)
support for Python 3.12
pinned LXML version for MacOS (#393)
updated dependencies and parameters (notably htmldate and courlan)
code cleaning by @marksmayo (#406)

Contributors

idoshamun, felipehertzer, and 2 other contributors

Assets 2

15 Jun 12:59

adbar

v1.6.1

d85d584

trafilatura-1.6.1

Extraction:

minor fixes: tables in figures (#301), headings (#354) and lists (#318)

Metadata:

simplify and fully test JSON parsing code, with @felipehertzer (#352, #368)
authors, JSON and unicode fixes by @felipehertzer in #365
fix for authors without additionalName by @awwitecki in #363

Navigation:

reviewed link processing in feeds and sitemaps (#340, #350)
more robust spider (#359)
updated underlying courlan package (#360)

Full Changelog: v1.6.0...v1.6.1

Contributors

awwitecki and felipehertzer

Assets 2

11 May 11:00

adbar

v1.6.0

0bce218

trafilatura-1.6.0

Extraction:

new content hashes and default file names (#314)
fix deprecation warning with @sdondley in #321
fix for metadata image by @andremacola in #328
fix potential unicode issue in third-party extraction with @Korben00 in #331
review logging levels (#347)

Command-line interface:

more efficient sitemap processing (#326)
more efficient downloads (#338)
fix for single URL processing (#324) and URL blacklisting (#339)

Navigation

additional safety check on domain similarity for feeds and sitemaps
new function is_live test() using HTTP HEAD request (#327)
code parts supported by new courlan version

Maintenance

allow urllib3 version 2.0+
minor code simplification and fixes

Full Changelog: v1.5.0...v1.6.0

Contributors

Korben00, sdondley, and andremacola

Assets 2

30 Mar 16:11

adbar

v1.5.0

2639b24

trafilatura-1.5.0

Extraction:

fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303)
pagetype and image urls added to metadata by @andremacola (#282, #310)
add as_dict method to Document class with @edkrueger in #306
XML output fix with @knit-bee in #315
various smaller fixes: lists (#309), XPaths, metadata hardening

Navigation:

transfer URL management to courlan.UrlStore (#232, #312)
fixes for spider module

Maintenance:

simplify code and extend tests
underlying packages htmldate and courlan, update setup and docs

Full Changelog: v1.4.1...v1.5.0

Contributors

andremacola, felipehertzer, and 2 other contributors

Assets 2

19 Jan 17:02

adbar

v1.4.1

14d9782

v1.4.1

Extraction:

extraction bugs fixed (#263, #266), more robust HTML doctype parsing
XML output improvements by @knit-bee (#273, #274)
adjust thresholds for link density in paragraphs

Metadata:

improved title and sitename detection (#284)
faster author, categories, domain name, and tags extraction
fixes to author emoji regexes by @felipehertzer (#269)

Command-line interface:

review argument consistency and add deprecation warnings (#261)

Setup:

make download timeout configurable (#263)
updated dependencies, use of faust-cchardet for Python 3.11

Full Changelog: v1.4.0...v1.4.1

Contributors

felipehertzer and knit-bee

Assets 2

18 Oct 13:59

adbar

v1.4.0

f9e35aa

trafilatura-1.4.0

Impact on extraction and output format:

better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
XML: preserve list type as attribute (#229)
XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
faster text cleaning and shorter code (#237 with @deedy5, #245)
metadata: add language when detector is activated (#224)
metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
TXT: change markdown formatting of headers by @LaundroMat (#257)

Smaller changes in convenience functions:

add function to clear caches (#219)
CLI: change exit code if download fails (#223)
settings: use "\n" for multiple user agents by @k-sareen (#241)

Updates:

docs updated (and #244 by @dsgibbons)
package dependencies updated

Full Changelog: v1.3.0...v1.4.0

Contributors

LaundroMat, mrienstra, and 5 other contributors

Assets 2

29 Jul 14:42

adbar

v1.3.0

c3f9a9f

trafilatura-1.3.0

fast and robust html2txt() function added (#221)
more robust parsing (#228)
fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
extraction about 10-20% faster, slightly better recall
partial fixes for memory leaks (#216)
docs extended and updated (#217, #225)
prepared deprecation of old process_record() function
more stable processing with updated dependencies

Full Changelog: v1.2.2...v1.3.0

Contributors

felipehertzer

Assets 2

18 May 15:55

adbar

v1.2.2

168e660

trafilatura-1.2.2

more efficient rules for extraction
metadata: further attributes used (with @felipehertzer)
better baseline extraction
issues fixed: #202, #204, #205
evaluation updated

Full Changelog: v1.2.1...v1.2.2

Contributors

felipehertzer

Assets 2

02 May 10:24

adbar

v1.2.1

1bb5fee

trafilatura-1.2.1

What's Changed

--precision and --recall arguments added to the CLI
better text cleaning: paywalls and comments
improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
further bugs fixed: #189, #192 (with @felipehertzer), #200
efficiency: faster module loading and improved RAM footprint

Full Changelog: v1.2.0...v1.2.1

Contributors

felipehertzer, glacierck, and immortal-autumn

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

Contributors

Contributors

Contributors

Contributors

Contributors

Contributors

Contributors

Contributors

What's Changed

Contributors

Releases: adbar/trafilatura

trafilatura-1.6.3

Contributors

trafilatura-1.6.2

Contributors

trafilatura-1.6.1

Contributors

trafilatura-1.6.0

Contributors

trafilatura-1.5.0

Contributors

v1.4.1

Contributors

trafilatura-1.4.0

Contributors

trafilatura-1.3.0

Contributors

trafilatura-1.2.2

Contributors

trafilatura-1.2.1

What's Changed

Contributors