WARC file support? #27

anarcat · 2022-08-23T00:34:55Z

Hi!

When I found out about this project, its name made me think it was a tool to read WARC files, which stands for... Web ARChives!

Is there support for WARC planned? it would be pretty interesting because it would allow the reader to use archive.org extracts, which are typically in the WARC file format. Other web crawlers (e.g. wget, but also web browsers) can also output WARC files...

birros · 2022-08-23T11:55:57Z

Hi @anarcat, thanks for your interest.

Unfortunately, as mentioned in this #12 (comment) I don't want to spend time on this project before I finish another one (private for now, but open in the future) that will improve the personal data management for this application (and others).

But your request is relevant, I discovered warc after zim format, both are relevant in different cases. I plan to implement warc / wacz support when I restart this project, but don't expect it to happen in the next few months, it won't happen for several years (my private project is really complex).

In the meantime, you can try to convert your warc file to a zim file using this tool from the openzim team (I haven't tested it): openzim/warc2zim. Also check kiwix/kiwix-desktop from the same team as an actively maintained zim reader.

I will keep this issue opened for the future.

You can also open warc file with webrecorder/replayweb.page, which is a free, self-hosting software that works offline.

I recommend converting the warc file to wacz for use with replayweb, which adds page indexing. Use this tool for that: webrecorder/py-wacz

Example:

$ wget "https://en.wikipedia.org/wiki/Linux" \
    --page-requisites \
    --execute robots=off \
    --no-warc-keep-log \
    --span-hosts \
    --no-warc-compression \
    --delete-after \
    --domains en.wikipedia.org,upload.wikimedia.org \
    --warc-file="wikipedia-linux"
$ wacz create wikipedia-linux.warc \
    --detect-pages \
    --output wikipedia-linux.wacz

anarcat · 2022-08-23T13:40:31Z

On 2022-08-23 04:56:07, Julien Muret wrote: Unfortunately, as mentioned in this #12 (comment) I don't want to spend time on this project before I finish another one (private for now, but open in the future) that will improve the personal data management for this application (and others).

That, of course, makes perfect sense. :) Take all the time you need!

But your request is relevant, I discovered warc after zim format, both are relevant in different cases. I plan to implement warc / wacz support when I restart this project, but don't expect it to happen in the next few months, it won't happen for several years (my private project is really complex). In the meantime, you can try to convert your warc file to a zim file using this tool from the openzim team (I haven't tested it): [openzim/warc2zim](https://github.com/openzim/warc2zim). Also check [kiwix/kiwix-desktop](https://github.com/kiwix/kiwix-desktop) from the same team as an actively maintained zim reader.

Oh that's really neat, thanks! I was aware of kiwix, but not warc2zim, that makes a lot of sense...

I will keep this issue opened for the future.

Thanks!

…

-- People in glass houses shouldn't throw stones. People in glass cities shouldn't fire missiles. - Banksy

birros pinned this issue Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARC file support? #27

WARC file support? #27

anarcat commented Aug 23, 2022

birros commented Aug 23, 2022

anarcat commented Aug 23, 2022 via email

WARC file support? #27

WARC file support? #27

Comments

anarcat commented Aug 23, 2022

birros commented Aug 23, 2022

anarcat commented Aug 23, 2022 via email