-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC file support? #27
Comments
Hi @anarcat, thanks for your interest. Unfortunately, as mentioned in this #12 (comment) I don't want to spend time on this project before I finish another one (private for now, but open in the future) that will improve the personal data management for this application (and others). But your request is relevant, I discovered warc after zim format, both are relevant in different cases. I plan to implement warc / wacz support when I restart this project, but don't expect it to happen in the next few months, it won't happen for several years (my private project is really complex). In the meantime, you can try to convert your warc file to a zim file using this tool from the openzim team (I haven't tested it): openzim/warc2zim. Also check kiwix/kiwix-desktop from the same team as an actively maintained zim reader. I will keep this issue opened for the future. You can also open warc file with webrecorder/replayweb.page, which is a free, self-hosting software that works offline. I recommend converting the warc file to wacz for use with replayweb, which adds page indexing. Use this tool for that: webrecorder/py-wacz Example: $ wget "https://en.wikipedia.org/wiki/Linux" \
--page-requisites \
--execute robots=off \
--no-warc-keep-log \
--span-hosts \
--no-warc-compression \
--delete-after \
--domains en.wikipedia.org,upload.wikimedia.org \
--warc-file="wikipedia-linux"
$ wacz create wikipedia-linux.warc \
--detect-pages \
--output wikipedia-linux.wacz |
On 2022-08-23 04:56:07, Julien Muret wrote:
Unfortunately, as mentioned in this #12 (comment) I don't want to spend time on this project before I finish another one (private for now, but open in the future) that will improve the personal data management for this application (and others).
That, of course, makes perfect sense. :) Take all the time you need!
But your request is relevant, I discovered warc after zim format, both are relevant in different cases. I plan to implement warc / wacz support when I restart this project, but don't expect it to happen in the next few months, it won't happen for several years (my private project is really complex).
In the meantime, you can try to convert your warc file to a zim file using this tool from the openzim team (I haven't tested it): [openzim/warc2zim](https://github.com/openzim/warc2zim). Also check [kiwix/kiwix-desktop](https://github.com/kiwix/kiwix-desktop) from the same team as an actively maintained zim reader.
Oh that's really neat, thanks! I was aware of kiwix, but not warc2zim,
that makes a lot of sense...
I will keep this issue opened for the future.
Thanks!
…--
People in glass houses shouldn't throw stones.
People in glass cities shouldn't fire missiles.
- Banksy
|
Hi!
When I found out about this project, its name made me think it was a tool to read WARC files, which stands for... Web ARChives!
Is there support for WARC planned? it would be pretty interesting because it would allow the reader to use archive.org extracts, which are typically in the WARC file format. Other web crawlers (e.g. wget, but also web browsers) can also output WARC files...
The text was updated successfully, but these errors were encountered: