Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC file support? #27

Open
anarcat opened this issue Aug 23, 2022 · 2 comments
Open

WARC file support? #27

anarcat opened this issue Aug 23, 2022 · 2 comments

Comments

@anarcat
Copy link

anarcat commented Aug 23, 2022

Hi!

When I found out about this project, its name made me think it was a tool to read WARC files, which stands for... Web ARChives!

Is there support for WARC planned? it would be pretty interesting because it would allow the reader to use archive.org extracts, which are typically in the WARC file format. Other web crawlers (e.g. wget, but also web browsers) can also output WARC files...

@birros
Copy link
Owner

birros commented Aug 23, 2022

Hi @anarcat, thanks for your interest.

Unfortunately, as mentioned in this #12 (comment) I don't want to spend time on this project before I finish another one (private for now, but open in the future) that will improve the personal data management for this application (and others).

But your request is relevant, I discovered warc after zim format, both are relevant in different cases. I plan to implement warc / wacz support when I restart this project, but don't expect it to happen in the next few months, it won't happen for several years (my private project is really complex).

In the meantime, you can try to convert your warc file to a zim file using this tool from the openzim team (I haven't tested it): openzim/warc2zim. Also check kiwix/kiwix-desktop from the same team as an actively maintained zim reader.

I will keep this issue opened for the future.


You can also open warc file with webrecorder/replayweb.page, which is a free, self-hosting software that works offline.

I recommend converting the warc file to wacz for use with replayweb, which adds page indexing. Use this tool for that: webrecorder/py-wacz

Example:

$ wget "https://en.wikipedia.org/wiki/Linux" \
    --page-requisites \
    --execute robots=off \
    --no-warc-keep-log \
    --span-hosts \
    --no-warc-compression \
    --delete-after \
    --domains en.wikipedia.org,upload.wikimedia.org \
    --warc-file="wikipedia-linux"
$ wacz create wikipedia-linux.warc \
    --detect-pages \
    --output wikipedia-linux.wacz

@birros birros pinned this issue Aug 23, 2022
@anarcat
Copy link
Author

anarcat commented Aug 23, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants