> SQLite databases have a binary format which is known to change periodically. A...

falcolas · on Nov 2, 2017

Not that I disagree that sqlite is fairly ubiquitous today, but remember that not that long ago, DbaseII, dbm, and BerkleyDB were fairly ubiquitous.

I give you "the relevant XKCD":

https://xkcd.com/1909/

Somewhat more practically, a relevant DLib Magazine [0] quote: "not all file formats are suitable for long term preservation, even if they have an open specification. Some lossy and compressed file formats pose a higher risk of total loss if even a single bit is lost."

And from the Library of Congress [1], in the context of format preferences for text with structural markup: "XML or SGML using standard or well-known DTD or schema appropriate to a particular textual genre."

[0] http://www.dlib.org/dlib/july16/houghton/07houghton.html [1] http://www.digitalpreservation.gov/series/challenge/

comex · on Nov 3, 2017

> "not all file formats are suitable for long term preservation, even if they have an open specification. Some lossy and compressed file formats pose a higher risk of total loss if even a single bit is lost."

Wouldn't this issue apply more to OpenDocument, which is compressed (into a ZIP archive), than SQLite, which (at least by default) is not?

But then, I question the appropriateness of the advice. If you're serious about archiving, you should be using error-correcting codes in some form, so that the archived data will remain recoverable bit-for-bit even with a large number of bit errors in the underlying medium. To be honest, I'm not that familiar with long-term archiving practices, but if you have some kind of RAID setup, that should give you both error correction (for bit errors) and drive redundancy (for loss of entire drives). Alternatively, you could use dedicated ECC tools like par2.

True, most data that gets preserved will probably be preserved by chance, by people who are not serious about archiving, and may not take sufficient steps to prevent errors. But they're also not going to choose a format for optimal archiving either, so you're kind of stuck with the fact that many modern file formats have built-in compression and/or checksums, and thus don't hold up well when corrupted. We could keep the issue in mind when designing new formats, but is resilience to corruption really worth the additional storage cost of leaving data uncompressed? Or perhaps we could design formats to have built-in ECC instead of just checksums, but that would also waste space...