This sort of practice is not limited to just Instagram. Plenty of places that do soft deletes when they should be doing hard deletes. Data life-cycles are about the poorest understood subject in startup land. Ingestion is usually top notch, friction free and heavily automated. Deletion - assuming it even exists - is semi automatic or even manual, full of friction and usually incomplete or broken.
You see a similar pattern with respect to signups vs account cancellation.
The weird thing to me is that it is usually the marketing department and not the legal or the compliance department that has the upper hand in these data retention discussions. Fortunately thanks to the GDPR this is now changing and slowly companies are coming around on this.
I think just about everything should be a soft delete, however you need a time limit where you sweep those. Ideally you would even give the user an option to accelerate that (as much as technically possible) if they really want something gone.
That would be one way to implement a data life cycle that would probably meet with regulators approval, but it depends on lots of little details and the length of that 'time limit'.
I have seen people giving advice to retain trial, inactive and deleted accounts for a little while so they can analyze and train their ml models in startup groups.
Yes, I've seen this too. Too often to be happy about. On the plus side, more and more companies we look at really get it and do their utmost best to do it right. The GDPR has definitely woken people up.
Yes, especially backups and log files are hard. The databases are relatively easy if they allow for in place overwrites or compaction of tables. Even then you have to be careful.
Data is funny that way. It is easy to acquire, easy to lose if you want to keep it and devilishly hard to get rid of for real if that is what you want to do.
You have several options, none of them pretty. The first is to load, update and rewrite the backup. Time consuming, error prone and you may fuck up the backup that you need.
The second starts way further back: encrypt sensitive fields at rest, drop the keys when the user requests a deletion.
That way you never even have to touch the backups in order to have the right end result.
Not currently running a user visible service. Fortunately. The one I did run was started in 1998 and had turned into a very large pile of super insecure PHP and was shut down. If I were to start a user visible service today I would think really long and hard about whether or not the kind of superstructure required would be offset by the potential gains. The web is no longer a place for amateur projects with a focus on the happy path, you need to do it all 'just so' or you can expect to be compromised. And doing it right isn't simple nor is it cheap.
Key management is indeed just another - hopefully simpler - version of the same problem. The reason why it simplifies things is because a single key can invalidate a lot of data stored in places that are out of reach such as cold backups.
The vast majority of companies I've seen consider issuing "SQL DELETE" is sufficient for GDPR compliance.
One day, they will be burned when some investigation finds out that those rows are still sitting in uncompacted tables, or on now-unallocated disk blocks, or on now-remapped SSD sectors.
The only way to be sure the data is deleted is to copy all the data you want to keep to a new drive and burn the old one. Anything less and there is a reasonable chance an expert could recover at least some of the deleted info.
All it takes is one dataleak[+] to blow that wide open. Secure erase is hard. But what is much less hard is to encrypt data and then to limit the problem to getting rid of the decryption key. This reduces the problem in scope to one single datum rather than a whole chain of possible plaintext copies.
Secure erase is a contractual requirement for many relationships, it is interesting that none of the major db vendors as far as I know have a secure delete option.
There are lots of tricks and attempts to work around it but no official support for such functionality afaik. And that's before we get into VM snapshots, database snapshots, backup copies, copies that were made by developers or data scientists for test purposes (of course, nobody ever does that) and so on. Nasty little problem.
[+] or some employee shooting their mouth off in an online forum.
> But what is much less hard is to encrypt data and then to limit the problem to getting rid of the decryption key. This reduces the problem in scope to one single datum rather than a whole chain of possible plaintext copies.
Key issue: you can't do any operations on encrypted data, essentially you're killing off your database.
Homomorphic encryption is academic research, not something that is widely available and supported in common open and closed source databases. The best you can get is a database that encrypts the on disk data (Oracle TDE), but that only protects against a server being stolen or hacked on the OS level.
I fully expect that we will see application level and system level tools that are compliant with the law pop up any day now, the need is certainly there.
Also, for purposes of operations on data it all depends on what the column holds. For instance, you don't actually need access to fields such as names and dates of birth if you have a client ID and that is yours, not the customers and so you could leave that field in plain text. Any operation would then need that client ID but that's workable.
You could even say that if you would need that decryption key for anything other than user or controller directed computation that you are probably doing something you shouldn't be doing. In all other cases the context is clear, consent has been obtained and the data can be decrypted if required.
Secure delete is not implemented in databases because it is extraordinarily expensive, and destroys the performance of all other non-delete operations. When you say “encrypt data”, you are ignoring the fact that it can’t be implemented as “same data structures, just encrypted”. Encryption puts constraints on data structures and data representation that are fundamentally incompatible with and adverse to the design and functioning of most databases in existence, even ignoring the other ugly operational issues with that approach which no one thinks about. It would require radically redesigning the internals of most database engines, which is not really a practical option.
I’ve thought a lot about what it would take to design a “delete optimized” database kernel every since GDPR became a thing. In principle it is possible, but I seriously doubt anyone would use a database that is literally orders of magnitude slower and less scalable for everything except delete operations. Would it be acceptable to increase the resource intensity of databases by 10x (and environmental footprint implied) to get “real” deletes? That is the tradeoff here.
This has precedent in SQL databases designed for high-assurance applications with ultra-fine access and visibility controls. When the average software engineer understands how they work, it sounds like a great idea for having more secure data and they wonder why it doesn’t seem to exist. The reality is that they do exist but they are so abysmally slow for even elementary things that no one would ever dream of using it unless there is a narrow government requirement.
> When you say “encrypt data”, you are ignoring the fact that it can’t be implemented as “same data structures, just encrypted”. Encryption puts constraints on data structures and data representation that are fundamentally incompatible with and adverse to the design and functioning of most databases in existence, even ignoring the other ugly operational issues with that approach which no one thinks about. It would require radically redesigning the internals of most database engines, which is not really a practical option.
You can't do effective per-user encryption on columns the database software needs to read (things you'll query or join on), but the database rarely needs message/post content and image content (often not even stored in the database). So encrypting those could be privacy helpful, if you can make the per-user encrypted store better at deletes than in general.
For person to person messages, if you have a separate record for the sender and the receiver, you can do some per-user transformation on the other correspondent, but that might be indexed, so data without keys would still show messaging patterns.
The law doesn't really care about your notion of what can be done effeciently, if it can be done then you should probably do it or risk being found in violation of the law. There is some very specific language to that effect in the GDPR to make sure that it is clear that 'whatever you could reasonably do' needs to be done. You can then go and argue that you felt it wasn't reasonable but I doubt that will fly.
But you're totally right that this is a tricky problem and hard to do properly. On the plus side, field level encryption is something that we've been doing for ages and that already works quite well. If you design carefully you can even get some processing done on those fields though it will take some major breakthroughs in DB engine design before you can have your cake and eat it too, in the sense that you can be both legally compliant and do all the kinds of processing you can do today. I even doubt if that is desirable, lots of those examples of processing should probably not be done in the first place.
Encryption at record granularity introduces two technical problems that no one has come up with a tractable solution for, and a solution likely doesn't exist. It is actually a discussion of fundamental tractability, not "efficiency". The law can declare that we should be able to break AES encryption too but that doesn't manufacture plausibility.
First, encryption has a block size. Data field storage in databases is typically measured in bits, as closed to the information theoretic limit as practical. Storing an 11-bit datum for some column as a 256-bit AES block represents a 20x expansion in storage cost. We could go back to the old row storage model, which would allow the record to be encrypted as contiguous memory, but that would both bloat storage (for different reasons) and we get to relive the golden age of very poor query performance. Modern databases are built on succinct representations because throughput is memory-bandwidth bound. Even in conventional databases, ignoring this design detail will cost you 100x in throughput.
Second, keys and key schedules thoroughly thrash the CPU cache for scan operators. In a typical data model, a single row will be much smaller than the key schedule for decrypting that row. Every single row, several thousand per page, will require an unpredictable cache line fill to access the required key. Then you have two choices: compute a new key schedule for each row, which will be computationally expensive, or precompute the key schedule and take the even larger RAM hit. In large scale-out databases, many gigabytes of key infrastructure will need to be locally cached in RAM on each server -- you can't afford a network hop or page fault -- to decrypt each row in a page scan. Key management state consumes most of your runtime resources and crowds out the data model. I've worked on the design of such schemes in real systems, you end up devoting almost all of your cache/RAM to key state to the exclusion of the actual data model. Also burns up quite a bit of precious memory bandwidth without doing any real work.
Any database built on encryption for record-level physical deletion will be unusable for almost any modern application for well-studied technical reasons. It will work, in theory, if you can run your business on a database that performs and scales like it is 1995.
The best technical solution for physical deletion today is to rewrite cold storage, which is still extremely expensive and has extremely low delete operation throughput if you do it synchronously but at least it doesn't break database computer science. The only high-throughput and economical way to implement rewriting is asynchronously with very long deferrals e.g. over 30 days. Which is how databases have always worked, but with the deferral being indefinite.
The tricky bit to me is that fundamentally deletion should not be harder than insertion, but because we have historically only focused on the happy path nobody cares at all about the deletion mechanism. Whereas a secure erase-in-place of a field option alone would already take you to 80% or more of a workable solution.
Perfection, the way you describe it is for now unattainable. But the bigger problem as far as my practice tells me is that people simply don't care and set the 'deleted' bit and leave the plain text records + backups + log files all untouched.
The low hanging fruit is pretty much dragging the ground.
Sure, you can trivially design a data structure where insert and delete have the same cost. This works if you don’t care about query performance; most people care about query performance a great deal. This was litigated in the marketplace decades ago. Even every open source database rejects designs that produce rubbish query performance. It also does not reflect the real-world distribution of operations between insert and delete — we do a lot more inserts.
The “erase-in-place” operations you mention were common in databases a few decades ago and abandoned because the typical performance was terrible compared to the alternative. I am old enough I even implemented a few. It isn’t like these designs didn’t exist, they were deeply flawed for reasons that apply today.
This has nothing to do with perfection. Database engineers would love if deletes where inexpensive even in the absence of hard delete requirements, as that would make update operations, which people want, dramatically cheaper. But that isn’t the reality. You are essentially re-litigating settled database kernel engineering without understanding why they are designed the way they are.
If you thinking there is “low hanging fruit dragging on the ground” then I encourage you to prove it by designing a useful database kernel that can scale deletes while preserving insert and query performance, the two operations that drive all economics in databases. You’ll be instantly famous as a computer scientist because there are some nasty theoretical computer science problems in there.
They probably know the odds of an "investigation" actually occurring is very low. The odds of one occurring with that level, or probably any level, of technical sophistication is near zero.
From regulators: yes, they typically work reactive, until there is some kind of breach you'll never hear from them. From investors, customers and potential acquirers audits are pretty common and becoming more so every day.
You see a similar pattern with respect to signups vs account cancellation.
The weird thing to me is that it is usually the marketing department and not the legal or the compliance department that has the upper hand in these data retention discussions. Fortunately thanks to the GDPR this is now changing and slowly companies are coming around on this.