SingleFileZ, a web extension for saving pages as HTML/ZIP hybrid files

wongarsu · on Nov 3, 2019

MHTML seems to have the same purpose, is supported by all browsers (except Firefox, but including IE5) and is standardized since 1999.

It's a strange format, but I think it falls in the "good enough" category. Still I never see it in the wild

https://en.m.wikipedia.org/wiki/MHTML

capableweb · on Nov 3, 2019

The more ubiquitous format in modern times is Web ARChive (WARC[0]) format which is supported by tools like wget, Apache Nutch and organizations like Internet Archive and most national libraries[1]

WARC is also standarized by ISO and has a nice spec that's pretty easy to understand[2]

@gildas I saw there was a comparison table[3] but it seems to be missing WARC. Could you shed some light on why?

- [0] https://en.wikipedia.org/wiki/Web_ARChive

- [1] http://digitalia.sbn.it/article/view/1473

- [2] https://iipc.github.io/warc-specifications/specifications/wa...

- [3] https://github.com/gildas-lormeau/SingleFile#file-format-com...

gildas · on Nov 3, 2019

I included in the table only formats that are at least supported natively in one modern browser. That's why the WARC format is missing. I'm not against adding it in the table though.

capableweb · on Nov 3, 2019

I see. Thanks for your reply! Yes, to open a WARC file you would need something like https://github.com/webrecorder/webrecorder-player or another viewer[0] but the benefit is that you can now contribute to the webs archiving efforts and just upload the result directly to Internet Archive!

With that said, I do understand your motivation.

- [0] https://www.archiveteam.org/index.php?title=The_WARC_Ecosyst...

BiteCode_dev · on Nov 6, 2019

Apparently there is an extension: https://addons.mozilla.org/en-US/firefox/addon/page-archive-...

gildas · on Nov 3, 2019

It is not supported by Safari neither. Today, the only modern browsers supporting MHTML are Chromium-based browsers.

unilynx · on Nov 3, 2019

MHTML is basically equivalent to the MIME format used for email, especially when using rich text or embedding images.

So you've definitely seen it in the wild, perhaps without knowing about it.. about every email is one.

riffraff · on Nov 3, 2019

I used to save a lot of things for offline reading in mhtml years ago when I was on dial up, I remember this fondly.

znpy · on Nov 3, 2019

This is completely unrelated, but I want to ask this question because people that know about this are probably following the thread:

Assuming I download webpages from via ssl/TLS, would there be a way to also save their criptographic signature so that the resulting file, along with the website certificate, could be verifiable, possibly in court?

I've seen a number of situation where malicious public clerk do not update ab official public website with information about upcoming events, and then updating it once it's too late.

I'm wondering whether I could use Https features to bring such actors to court.

jstanley · on Nov 3, 2019

https://tlsnotary.org/ might do what you want.

znpy · on Nov 4, 2019

Thanks!

jccalhoun · on Nov 3, 2019

This is interesting. I've been using https://github.com/danny0838/webscrapbook since firefox changed their addons. It also saves web pages in a zip file but names them with a .htz extension.

I saved this page with both webscrapbook and singlefilez. Both archives looked the same. Webscrapbook's was 22.4kb while singlefilez was 65.9kb. I unzipped singlefilez and rezipped it with higher compression and got it to 20kb but it wouldn't open in the browser.

While the size doesn't really matter, what I don't like is that singlefilez renamed the images to sequentially numbered files (1.gif, 2.giff, etc.) and css to stylesheet_0.css while webscrapbook kept the original names of the files. I would much rather it kept the original file names.

gildas · on Nov 3, 2019

Thank you for your feedback.

The additional 40KB corresponds to the part which self-extracts the zip file (in order to view the page without installing any extension). Note that the original URLs can be found in the comments of each entry in the zip file.

burtonator · on Nov 3, 2019

I wrote a similar feature within Polar:

https://getpolarized.io

It can take full HTML files and internalize the CSS + HTML and compiles them into a .zip file.

The major difference is that its an Electron app and not a web extension though we're about 80% of the way done porting all of it to a web extension.

In retrospect, I would have done this as an EPUB.

Our users have asked for other features like taking multiple pages and building them as one 'book' and this feature is supported in EPUB.

Also, it would mean Polar would support EPUB natively anyway which is another big feature we need as we only support PDFs right now.

I lot of people here mention MHTML and WebArchive formats.

I think my main criticism of these is that EPUB has more universal 'reader' support and EPUB 3.0 is basically just HTML in an enclosure anyway.

gildas · on Nov 3, 2019

Actually, what is supposed to be interesting and innovative in SingleFileZ is the fact that it produces self-extracting valid zip files in the form of HTML files. Thus, they can be read natively by any modern browser that supports JavaScript.

burtonator · on Nov 3, 2019

Ah. Interesting. That's a good innovation. How do you determine the domain/URL? I guess it's just the URL of the source?

What would you do if your HTML extraction script had a bug in the pre-compiled form? I guess you're just stuck?

I guess that's not the end of the world.

The EPUB form in a future version of Polar would at least require an EPUB reader which makes it a bit heavier.

gildas · on Nov 3, 2019

SingleFileZ creates its own paths to reference all the resources of the page in the zip. This prevents any invalid path issues. The URL of the resource is stored as a comment for each entry in the zip though (I'm unsure I'm really answering to your question). If there is a bug in the extraction script, there are some chances you can still unzip the file.

mikaelmorvan · on Nov 3, 2019

Having a component that can save a web page in a very compact format is great! It's far better than MHTML.

In addition, the backup is not as usual the naive backup of files but a real backup of the page as interpreted by the browser.

For me it is the best component for saving web pages from far away.

jplayer01 · on Nov 3, 2019

Can somebody explain the benefit of this over just SingleFile? I’ve seen this fork before, but never quite understood how it differs and how it might benefit me more.

gildas · on Nov 3, 2019

The main benefit is the size of the saved page. The file will be smaller because binary resources (e.g. images) are not encoded in base 64 [1]. Moreover, the page and these resources are also compressed. The other benefit is that you can unzip the saved page and edit it more easily than a page saved with SingleFile because the saved page won't contain data URIs [2].

[1] https://en.wikipedia.org/wiki/Base64

[2] https://en.wikipedia.org/wiki/Data_URI_scheme

mikaelmorvan · on Nov 3, 2019

In fact the SingleFileZ format is quite magical : It's SingleFile that save the content of the page and the Z stand for Zip.

The SingleFile saved content is saved in Zip format in the result Html page.

The final SingleFileZ html page has Html header, Zip file in the body and a bunch of Javascript to decompress the content when the browser opens it.

It's magic !

teddyh · on Nov 3, 2019

I use the Firefox “ScrapbookQ” addon:

https://addons.mozilla.org/en-US/firefox/addon/scrapbookq/

Mostly because I used to use the older “Scrapbook” add-on (before it stopped working in Firefox 60) and I still have a number of pages saved in that format – ScrapbookQ is, with some effort, compatible with those saved pages.

donatj · on Nov 3, 2019

Why not just use the existing WebArchive format?

gildas · on Nov 3, 2019

I'm the author of the extension. Browsers don't offer a way to register a (browser) extension with a given mime-type or filename extension. That's why SingleFileZ uses the HTML format to wrap the zip content.

dhruvdh · on Nov 3, 2019

Quick question, can I also use this to serve a webpage?

As in take a website I am developing and have my server serve SingleFileZ instead of what I would usually.

gildas · on Nov 3, 2019

Yes you can. See this page for example: https://gildas-lormeau.github.io/ (check the source code of the page if you're curious). It was saved with SingleFileZ and is served via a HTTP server. Note that you don't need any extension to view it in modern browsers.

On paper, you could even store an entire website in a SingleFileZ file. It just need to be implemented...

NKosmatos · on Nov 3, 2019

That would be great! I know of a few scenarios where HTTtrack and SingleFileZ would save a lot of space and effort.

mstade · on Nov 3, 2019

It wasn’t very clear from the page why, but on the SingleFile repo page there’s a comparison table between different formats that may shed some light in this: https://github.com/gildas-lormeau/SingleFile#file-format-com...

majkinetor · on Nov 3, 2019

Very nice alternative to MHT.

FYI, when I saved github readme page animated gif location was blank.

Thanks.

gildas · on Nov 3, 2019

Thank you. I could not reproduce your issue on the latest versions of Chrome and Firefox. I'll try to do more tests to see what went wrong on your end.

BiteCode_dev · on Nov 6, 2019

What's the benefit over the forked SingleFile ?

I use it regularly, and it creates a single pure HTML file with inline medias. Easy to read anywhere.

So appart from the compression gain, why the need for SingleFileZ ?

falcolas · on Nov 3, 2019

Completely personal opinion here: I love PDF for saving web pages. They preserve the formatting, don't include javascript, and can be viewed across most OSes and browsers.

Tagbert · on Nov 3, 2019

But pdf does a terrible job of preserving formatting. In most cases it breaks a site up into pages based on paper size without regard to content. Many elements do not render correctly or use a print media tag and forget formatting. Finally, like most PDFs the result odd frozen into a print based size which makes subsequent viewing clumsy.

jvzr · on Nov 4, 2019

Some tools allow for a full-page PDF (the whole page as a long, continuous PDF)

tehlike · on Nov 3, 2019

Same. The chrome integration should get better to save it to google drive... Would be great for me, personally.

watersb · on Nov 3, 2019

I wish epub was a viable format for web page archive. Too much overhead for single page?