“Anonymous Data” That Still Knows Your Client: The Classic Lie
We strip the names and call it 'anonymous.' But data has a way of finding its owner. We expose the myth of anonymization and the reality of re-identification.
The Illusion of the Blank Face
We have a dataset. It contains sensitive financial habits of high-net-worth individuals. We want to analyze it. So, we delete the column labeled “Name.” We delete the column labeled “SSN.”
“There,” we say, wiping our hands. “It is anonymous. No one can get hurt.”
Enfin, if only it were that simple.
Data is not a collection of isolated islands. It is a web. And in that web, it is shockingly easy to trace a line back to the source.
We tell ourselves the lie of anonymity because it is convenient. It allows us to trade, sell, and analyze data without the guilt of surveillance. But in the era of Big Data, “Anonymous” is a fairy tale we tell to compliance officers to make them sleep at night.
The Tacky Habit: Lazy Redaction
The habit of “Lazy Redaction”—simply removing direct identifiers—is dangerous.
Consider this: You have a dataset of “anonymous” medical records. It lists a 45-year-old male, living in Zip Code 90210, who visited a cardiologist on Tuesday.
How many 45-year-old males in that zip code visited a heart specialist on that specific Tuesday? Perhaps three? Perhaps one?
You have not protected this man. You have simply given the spy a puzzle to solve. And computers are excellent at solving puzzles.
This is what we call the Mosaic Effect. By combining your “anonymous” dataset with a public voter registration list or a LinkedIn scrape, I can re-attach the names to the faces in milliseconds.
Promising anonymity when you are only offering redaction is not just technically flawed; it is a breach of contract. You are selling a safety you cannot guarantee.
The Professional Standard: Aggregation or Nothing
If you cannot guarantee that a human cannot be re-identified, you must treat the data as personal. Period.
The sophisticated approach is not to strip fields, but to Aggregate.
Do not store rows of people. Store buckets of trends.
- Bad: “User A: 40-50, London, spent £500.”
- Good: “Segment London-Mid-Spenders: 500 individuals, Avg Spend £500.”
In the second example, the individual has dissolved. They are safe because they no longer exist as a discrete entity.
We must stop playing games with definitions. If you are holding a row of data that corresponds to a single human being, it is personal data. Treat it with the liability it demands.
Do not hide behind the “Anonymous” label. It is thin, it is transparent, and frankly, it is beneath us.
FAQs
If we remove the email address, isn't it anonymous?
No. That is 'pseudonymized.' It is like wearing a mask but keeping your name tag on. It is not enough.
What is the Mosaic Effect?
It is when harmless pieces of data are combined to reveal a complete picture. Like solving a puzzle you thought was destroyed.
So we shouldn't share data at all?
We should share *insights*, not raw rows. Aggregate the data until the individual dissolves completely.