The more ubiquitous format in modern times is Web ARChive (WARC[0]) format which is supported by tools like wget, Apache Nutch and organizations like Internet Archive and most national libraries[1]
WARC is also standarized by ISO and has a nice spec that's pretty easy to understand[2]
@gildas I saw there was a comparison table[3] but it seems to be missing WARC. Could you shed some light on why?
I included in the table only formats that are at least supported natively in one modern browser. That's why the WARC format is missing. I'm not against adding it in the table though.
I see. Thanks for your reply! Yes, to open a WARC file you would need something like https://github.com/webrecorder/webrecorder-player or another viewer[0] but the benefit is that you can now contribute to the webs archiving efforts and just upload the result directly to Internet Archive!
This is completely unrelated, but I want to ask this question because people that know about this are probably following the thread:
Assuming I download webpages from via ssl/TLS, would there be a way to also save their criptographic signature so that the resulting file, along with the website certificate, could be verifiable, possibly in court?
I've seen a number of situation where malicious public clerk do not update ab official public website with information about upcoming events, and then updating it once it's too late.
I'm wondering whether I could use Https features to bring such actors to court.
This is interesting. I've been using https://github.com/danny0838/webscrapbook since firefox changed their addons. It also saves web pages in a zip file but names them with a .htz extension.
I saved this page with both webscrapbook and singlefilez. Both archives looked the same. Webscrapbook's was 22.4kb while singlefilez was 65.9kb. I unzipped singlefilez and rezipped it with higher compression and got it to 20kb but it wouldn't open in the browser.
While the size doesn't really matter, what I don't like is that singlefilez renamed the images to sequentially numbered files (1.gif, 2.giff, etc.) and css to stylesheet_0.css while webscrapbook kept the original names of the files. I would much rather it kept the original file names.
The additional 40KB corresponds to the part which self-extracts the zip file (in order to view the page without installing any extension). Note that the original URLs can be found in the comments of each entry in the zip file.
Actually, what is supposed to be interesting and innovative in SingleFileZ is the fact that it produces self-extracting valid zip files in the form of HTML files. Thus, they can be read natively by any modern browser that supports JavaScript.
SingleFileZ creates its own paths to reference all the resources of the page in the zip. This prevents any invalid path issues. The URL of the resource is stored as a comment for each entry in the zip though (I'm unsure I'm really answering to your question).
If there is a bug in the extraction script, there are some chances you can still unzip the file.
Can somebody explain the benefit of this over just SingleFile? I’ve seen this fork before, but never quite understood how it differs and how it might benefit me more.
The main benefit is the size of the saved page. The file will be smaller because binary resources (e.g. images) are not encoded in base 64 [1]. Moreover, the page and these resources are also compressed. The other benefit is that you can unzip the saved page and edit it more easily than a page saved with SingleFile because the saved page won't contain data URIs [2].
Mostly because I used to use the older “Scrapbook” add-on (before it stopped working in Firefox 60) and I still have a number of pages saved in that format – ScrapbookQ is, with some effort, compatible with those saved pages.
I'm the author of the extension. Browsers don't offer a way to register a (browser) extension with a given mime-type or filename extension. That's why SingleFileZ uses the HTML format to wrap the zip content.
Yes you can. See this page for example: https://gildas-lormeau.github.io/ (check the source code of the page if you're curious). It was saved with SingleFileZ and is served via a HTTP server. Note that you don't need any extension to view it in modern browsers.
On paper, you could even store an entire website in a SingleFileZ file. It just need to be implemented...
Thank you. I could not reproduce your issue on the latest versions of Chrome and Firefox. I'll try to do more tests to see what went wrong on your end.
Completely personal opinion here: I love PDF for saving web pages. They preserve the formatting, don't include javascript, and can be viewed across most OSes and browsers.
But pdf does a terrible job of preserving formatting. In most cases it breaks a site up into pages based on paper size without regard to content. Many elements do not render correctly or use a print media tag and forget formatting. Finally, like most PDFs the result odd frozen into a print based size which makes subsequent viewing clumsy.
It's a strange format, but I think it falls in the "good enough" category. Still I never see it in the wild
https://en.m.wikipedia.org/wiki/MHTML