Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> SQLite databases have a binary format which is known to change periodically. An attempt to read a SQLite file in 100 years would have to know the exact version of the file, and be able to reach back through history to find that version of SQLite (or a compatible version), and find a way to compile it for whatever CPU architecture exists in 100 years.

I contend this is not as big of a problem as you make it sound. First of all, the sqlite file format states its version in a string at the beginning of the file, so determining the version of a file is trivial even without any special tools. The sqlite source code is extremely widely distributed. Between linux distros and the many projects which, for better or worse, embed sqlite in their source trees, there are easily over a million copies of sqlite3.c floating around. The chances of all of these copies being lost, even in a catastrophic event, is negligible. Even if we did somehow lose every copy of sqlite3.c, reverse engineering a binary format is far from impossible. In fact the old binary MS Office file formats were reverse engineered by multiple groups.

The chances of us losing the ability to compile C code are even more negligible. If human civilization has fallen that far, we will probably be too busy killing each other with sharp sticks over drinkable water to care about reading some hundred year old documents.



Not that I disagree that sqlite is fairly ubiquitous today, but remember that not that long ago, DbaseII, dbm, and BerkleyDB were fairly ubiquitous.

I give you "the relevant XKCD":

https://xkcd.com/1909/

Somewhat more practically, a relevant DLib Magazine [0] quote: "not all file formats are suitable for long term preservation, even if they have an open specification. Some lossy and compressed file formats pose a higher risk of total loss if even a single bit is lost."

And from the Library of Congress [1], in the context of format preferences for text with structural markup: "XML or SGML using standard or well-known DTD or schema appropriate to a particular textual genre."

[0] http://www.dlib.org/dlib/july16/houghton/07houghton.html [1] http://www.digitalpreservation.gov/series/challenge/


> "not all file formats are suitable for long term preservation, even if they have an open specification. Some lossy and compressed file formats pose a higher risk of total loss if even a single bit is lost."

Wouldn't this issue apply more to OpenDocument, which is compressed (into a ZIP archive), than SQLite, which (at least by default) is not?

But then, I question the appropriateness of the advice. If you're serious about archiving, you should be using error-correcting codes in some form, so that the archived data will remain recoverable bit-for-bit even with a large number of bit errors in the underlying medium. To be honest, I'm not that familiar with long-term archiving practices, but if you have some kind of RAID setup, that should give you both error correction (for bit errors) and drive redundancy (for loss of entire drives). Alternatively, you could use dedicated ECC tools like par2.

True, most data that gets preserved will probably be preserved by chance, by people who are not serious about archiving, and may not take sufficient steps to prevent errors. But they're also not going to choose a format for optimal archiving either, so you're kind of stuck with the fact that many modern file formats have built-in compression and/or checksums, and thus don't hold up well when corrupted. We could keep the issue in mind when designing new formats, but is resilience to corruption really worth the additional storage cost of leaving data uncompressed? Or perhaps we could design formats to have built-in ECC instead of just checksums, but that would also waste space...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: