Data Retention is Two Different Problems
A common artifact in security programs is something called a data retention policy. Like other policies, there’s a lot of jargon, but the centerpiece is typically a big table, with categories of data pointing to specific timeframes - for example:
| Data Category | Retention Period |
|---|---|
| Customer data | 30 days |
| Employee records | 2 years |
| Security event logs | 5 years |
| Financial records | 7 years |
To me, these policies are confusing because they conflate two different goals:
- Data preservation for the minimum amount of time you need to keep important data. This applies to stuff like audit logs you need if you get hacked, or financial records you need to investigate fraud.
- Data deletion for the maximum amount of time you’re allowed to keep personal data. This is driven by contracts and privacy laws which require you to delete personal data when you don’t need it anymore.
It’s a good idea to write clear data retention policies to detangle these two goals. But also - you should provide advice to the people who are storing data, at design time, to help make sure you actually do the thing you said you’d do. Two simple and valuable ideas I’ve brought to numerous conversational tables to support this fidelity are immutability (for preservation) and ephemerality (for deletion).
Can it be immutable?
Remember that for data you need to preserve - audit logs, security events, compliance records - the numeric duration specified in the policy is a minimum. In the median case, I propose any duration other than indefinitely is cargo culture copy-paste with zero explicit rationale. Presuming it’s not privacy-risk bearing personal data, other traditional reasons to delete old data - storage costs, database performance - matter far less than they used to. Cloud storage is cheap. Modern data warehouses handle large datasets well. The operational complexity of managing data lifecycle policies costs more than just keeping the data.
Beyond keeping archival data indefinitely, consider immutability - write once, never modify or delete. For data in cloud-based SaaS, indefinite is the default retention strategy, but immutability is more rigorous. The difference is ensuring the data is not just retained - it also can’t be deleted. The controls for immutability are built into cloud storage systems like S3: object lock or versioning. A more indirect route to immutability is restricting access with service control policies or MFA delete. Combining with resilient isolated backups, you can uphold retention policies even in black swan events like data breaches or accidental outages.
Can it be be ephemeral?
If you store personal data, you should delete it when you lose the rights to it, lest you suffer the wrath of privacy legislators. You may lose your rights when a contract ends, or when someone sends you a deletion request.
The traditional approach requires building deletion machinery to plumb these requests through every system where personal data lives: your production database, analytics warehouse, ML training datasets, etc. Each system needs its own deletion endpoint, and you need orchestration to coordinate all of it. Yuck.
Ephemeral data solves this by design. If you automatically expire old data with something like S3 lifecycle management, your retention period becomes your deletion mechanism. You retain the data for a little while, and then it naturally disappears. This doesn’t work for your main durable stores of personal data (e.g. your production database), but is worth considering everywhere else.
Like, you probably don’t want to selectively delete data from database backups. Same with logs that capture personal identifiers or detailed analytics that include user-level information. Aging this data out means no-touch compliance with your data retention policies.
For data lakes and warehouses with personal data AND long-term analytics requirements, enforce a lifecycle retention on fact tables with personal data, then apply pseudonymization or anonymization on transformed tables as part of your data pipeline. I’ll note that validating the effective removal of personal information during these data transformations is an exercise left to the not-faint-of-heart reader.
Design for immutability or ephemerality from the start
So, when you’re doing security design reviews or threat modeling for systems that will store data, ask these questions:
- For archival data: Can we make this immutable?
- For personal data: Can we make this ephemeral?
If the answer is yes, use storage systems with built-in immutability or expiration. Make the data lifecycle automatic. If the answer is no, then you need to build custom machinery - either deletion controls for durable personal data, or integrity and availability controls for archival data. Pointing people in the right direction early - during design, not after the system is built - is what keeps operational complexity at bay.
Remember - the hard part of security isn’t writing a policy. It’s the work that goes into making it reality.