Synthetic Data for GDPR Compliance: Anonymization Done Right
How synthetic data, anonymization, and pseudonymization differ under GDPR, what Recital 26 means for test data, and when each removes compliance scope.
By FakeName Editorial TeamPublished June 25, 2026Last updated June 25, 20269 min read
Fully synthetic data generated from scratch is not personal data under GDPR, so it sits outside the regulation's scope and keeps real people out of your test environments. The legal pivot is simple: if a record can never be tied back to a living person, GDPR does not apply to it (Recital 26). This guide explains where that line sits, why anonymization and pseudonymization are not the same thing, and how to choose between masking, pseudonymization, and synthetic data.
What is the difference between masking, pseudonymization, and synthetic data?
Masking obscures fields in real records, pseudonymization swaps identifiers for tokens while keeping a re-link key, and synthetic data is generated from scratch with no source record at all. GDPR scope turns on one question: can the data be tied back to a real person? Only synthetic-from-scratch reliably answers no.
GDPR applies to personal data, defined in Article 4(1) as any information relating to an identified or identifiable natural person [gdpr-art4]. The European Data Protection Board frames re-identification around three failure modes you must rule out before calling data anonymous: singling out (isolating one person's record), linkability (connecting two records about the same person), and inference (deducing an attribute with high probability) [edpb-anon].
| Failure mode (Art. 29 WP / EDPB) | Question it asks | Example that fails the test | Why synthetic-from-scratch passes |
|---|---|---|---|
| Singling out | Can one individual be isolated in the set? | A masked row with rare ZIP + birth date still uniquely identifies one person | No row maps to a real person, so isolation reveals nobody |
| Linkability | Can two records about the same person be linked? | Pseudonym token re-links across two leaked tables | Generated values share no key with any production record |
| Inference | Can an attribute be deduced with high confidence? | Model-trained synthetic data memorizes a rare salary outlier | Rule-based generation never observes a real attribute to leak |
Data masking: real data, obscured
Masking overwrites or shuffles sensitive fields in real production records, for example replacing the middle digits of a phone number or swapping last names between rows. The original structure and statistics survive, which is why teams reach for it in testing. But the dataset is still derived from real people. Static masking that only blanks obvious identifiers leaves quasi-identifiers behind: Latanya Sweeney's analysis of 1990 US Census data found that 87.1 percent of the US population was uniquely identifiable from ZIP code, date of birth, and sex alone [sweeney]. Unless masking is irreversible and rigorous, the output stays personal data.
Pseudonymization: reversible by design
GDPR Article 4(5) defines pseudonymization as processing personal data so it can no longer be attributed to a specific person without additional information that is kept separately. The operative word is separately: a mapping key still exists. Because the link can be restored, pseudonymized data stays fully in scope as personal data [gdpr-art4]. Article 32 encourages pseudonymization as a security measure, but it reduces risk rather than removing obligations. NIST draws the same line in SP 800-122 (April 2010), treating de-identification as effective only when the re-identification mapping is destroyed or held under controls that make linkage infeasible [nist-800122].
Synthetic data: generated, not derived
Fully synthetic data is produced by an algorithm to mimic the format and distribution of real data without copying any individual record. A synthetic customer table has names, addresses, and identifiers that look valid and pass validation logic, yet each row corresponds to no actual person. With no link to an identifiable individual and re-identification not reasonably likely, the data is anonymous and falls outside GDPR. The distinction that decides everything is derived versus generated: a value computed from a real person's record inherits that person's identifiability; a value drawn from independent rules and reserved ranges does not.
What does GDPR Recital 26 actually say about anonymous data?
Recital 26 states that the principles of data protection do not apply to anonymous information, defined as data that does not relate to an identified or identifiable person, or to personal data rendered anonymous so the data subject is no longer identifiable [gdpr-rec26]. Anonymous data is out of GDPR scope entirely. Synthetic-from-scratch data qualifies because no original subject ever existed to re-identify.
The same recital sets a demanding test. To decide whether a person is identifiable, you must account for all the means reasonably likely to be used, by the controller or by another party, to identify the person directly or indirectly, weighing objective factors including cost, time, and available technology. Deleting a name column is not enough. The UK ICO calls this the spectrum of identifiability and warns that re-identification risk must be judged against motivated third parties and auxiliary data, not the dataset in isolation [ico-anon].
Anonymisation is the process of rendering personal data anonymous so that an individual is not (or is no longer) identifiable. Where data has been successfully anonymised, it is no longer personal data and data protection law does not apply.
Generated-from-scratch synthetic data turns the Recital 26 analysis into a one-liner: the data never related to a real person, so there is no subject to re-identify. The Court of Justice of the EU confirmed in *Breyer* (Case C-582/14, 19 October 2016) that identifiability hinges on whether the means to re-link exist and are legally and practically available, not on whether you personally hold the key [cjeu-breyer]. Remove every path back to a real subject and the dataset's legal weight drops to near zero.
How do masking, pseudonymization, and synthetic data compare?
Masking keeps real data with medium-to-high re-identification risk; pseudonymization stays fully in GDPR scope because its key can re-link records; fully synthetic data sits outside scope when truly anonymous and exposes no real people in a breach. Pick by one question: do you ever need to re-link to a real person?
| Dimension | Real-data masking | Pseudonymization | Fully synthetic (generated) |
|---|---|---|---|
| Source of data | Real records, fields obscured | Real records, IDs tokenized | Generated from rules/models |
| Re-identification risk | Medium-high (quasi-identifiers persist) | High (key can re-link) | Very low when generated from scratch |
| Reversible? | Sometimes (shuffling, format-preserving) | Yes, by design (separate key) | No original to reverse to |
| GDPR status | Usually still personal data | Personal data (Art. 4(5)) | Outside scope if truly anonymous |
| Data utility for testing | High (real distributions) | High (structure intact) | High to medium (depends on generator) |
| Compliance burden | Lawful basis, DPIA likely | Full GDPR obligations apply | Minimal once anonymity verified |
| Breach impact | Real people exposed | Real people exposed if key leaks | No real people exposed |
| Best fit | One-off analytics on masked extracts | Internal processing needing relink | Dev, QA, staging, demos, load tests |
How do you generate test identifiers that pass validation but match nobody?
Use reserved or sandbox ranges that satisfy each standard's format and check-digit rules while being defined as never-issued or test-only. A Stripe test card like 4242 4242 4242 4242 passes the Luhn check; SSN area numbers 900-999, 000, and 666 are never allocated; phone numbers 555-0100 through 555-0199 are reserved for fiction. These values clear validation without colliding with a real person or account.
| Identifier | Governing standard | Checksum / format rule | Reserved test space we use |
|---|---|---|---|
| Credit card number | ISO/IEC 7812 (issuer ID + Luhn) | Luhn mod-10 check digit | Stripe / processor test BINs (e.g. 4242 4242 4242 4242) |
| US Social Security Number | SSA numbering scheme | Area-group-serial structure | Never-allocated ranges (900-999, 000, 666 area) |
| IBAN | ISO 13616 | Mod-97 check digits | Documentation-only country/bank codes |
| Phone number (US) | NANP | Valid area code + exchange | 555-0100 to 555-0199 fictional block |
| Email / domain | RFC 2606 | Syntactically valid address | example.com / .test reserved TLDs |
How should you use synthetic data in non-production environments?
Stop copying production data into lower environments and seed them with synthetic data instead. Replicating personal data into staging or test is itself processing under GDPR, expands your breach surface, and often triggers a Data Protection Impact Assessment under Article 35. Synthetic data removes the personal-data problem at the source, across four common workloads:
- Local and CI test fixtures: seed deterministic synthetic users so tests are reproducible and contain zero real PII.
- Load and performance testing: generate millions of synthetic accounts to size infrastructure without exporting customer records.
- Demos and sales sandboxes: populate environments with believable but fictional profiles, addresses, and identifiers.
- Schema and integration testing: confirm downstream systems accept correctly formatted identifiers using structurally valid test values.
How do you migrate off cloned production data?
Cut compliance scope environment by environment rather than in one risky switch. This five-step sequence moves a team from production dumps to synthetic fixtures without breaking pipelines mid-flight:
- Inventory the PII flowing into each non-prod environment and tag every personal-data field by category (direct identifier, quasi-identifier, special category).
- Pin seed data in CI to deterministic synthetic fixtures so build pipelines stop touching real records.
- Generate volume data for load and staging environments from reserved ranges, matching production cardinality and distribution shape.
- Cut over staging, revoke its access to production replicas, and document the Recital 26 re-identification assessment for the synthetic set.
- Audit and lock: add a CI check that fails the build if a known production value pattern appears in a fixture.
A quick decision rule for GDPR and CCPA
If you ever need to re-link records to real customers, use pseudonymization and keep full GDPR obligations. If you only need realistic data and never need to identify a real person, generate synthetic data, aim for true anonymity, and document the Recital 26 re-identification assessment. CCPA/CPRA reaches the same result: de-identified and synthetic data the business cannot reasonably re-identify is excluded from personal information, provided you commit to keeping it de-identified [ccpa-deid].
| Concept | GDPR (EU/UK) | CCPA / CPRA (California) | Effect on synthetic-from-scratch data |
|---|---|---|---|
| In-scope data term | Personal data (Art. 4(1)) | Personal information (Cal. Civ. Code 1798.140) | Out of scope: no link to a natural person / consumer |
| Out-of-scope category | Anonymous information (Recital 26) | De-identified / aggregate information | Synthetic data fits this category when truly anonymous |
| Reversible-token term | Pseudonymisation (Art. 4(5)) - still in scope | Pseudonymized info - still personal information | Not relevant: no token maps back to a real subject |
| Standing obligation | Re-ID assessment of all reasonable means | Commit and act to keep data de-identified | Document that values are generated, not derived |
| Trigger for impact assessment | DPIA (Art. 35) for high-risk processing | Risk assessment under CPRA regulations | Avoided when no personal data is processed |
Start from our synthetic data generator to produce datasets for your non-production environments, browse country-specific formats, and treat the comparison table above as your default policy: synthetic by default, pseudonymization only when re-linking is a genuine requirement, and raw production data confined to production.
References & sources
- Recital 26 - Not Applicable to Anonymous Data — gdpr-info.eu
- Article 4 - Definitions (personal data, pseudonymisation) — gdpr-info.eu
- Opinion 05/2014 on Anonymisation Techniques — Article 29 Working Party / EDPB
- Anonymisation, pseudonymisation and privacy enhancing technologies guidance — UK Information Commissioner's Office
- California Consumer Privacy Act (CCPA) — California Attorney General
- SP 800-122: Guide to Protecting the Confidentiality of PII — NIST
- Simple Demographics Often Identify People Uniquely — Carnegie Mellon University, L. Sweeney
- Case C-582/14 Patrick Breyer v Bundesrepublik Deutschland — Court of Justice of the European Union
Frequently asked questions
Is synthetic data considered personal data under GDPR?+
Synthetic data generated independently, with no way to link it to an identifiable living person, is not personal data, so GDPR does not apply. The caveat: synthetic datasets trained on real records can leak attributes, and if individuals can still be singled out or re-identified, the data may remain personal. Values generated from scratch, like those from our generator, carry no link to a real person.
What is the difference between anonymization and pseudonymization under GDPR?+
Pseudonymized data replaces identifiers with tokens but keeps a key that can re-link records to people, so GDPR Article 4(5) still treats it as personal data. Anonymized data is irreversibly stripped of any link to an individual, and under Recital 26 it falls outside GDPR. The test is whether re-identification is reasonably likely using all available means.
Does GDPR Recital 26 mean anonymous data is exempt?+
Yes. Recital 26 states that data protection principles do not apply to anonymous information or to personal data rendered anonymous so the data subject is no longer identifiable. The bar is high: you must weigh all means reasonably likely to be used to re-identify, including by third parties, accounting for cost, time, and available technology.
Can I use production data in my staging or test environment?+
Copying production personal data into non-production environments is processing under GDPR and usually requires a lawful basis, access controls, and often a DPIA under Article 35. It also widens your breach surface. Most teams avoid this by generating synthetic data for dev, QA, and staging, so no real personal data ever leaves production.
Does CCPA treat de-identified and synthetic data differently from GDPR?+
The CCPA/CPRA excludes de-identified and aggregate consumer information from personal information, provided the business cannot reasonably re-identify it and commits to keeping it de-identified. Synthetic data with no link to a real consumer sits outside CCPA scope, mirroring the GDPR position on anonymous data.
Is generating fake names and SSNs for testing legal?+
Generating fictional names, addresses, and structurally valid test identifiers for software testing, QA, and privacy work is legal and common. It becomes illegal when used to impersonate a real person or commit fraud. Reputable generators use never-issued or reserved ranges and sandbox test numbers, so values pass format checks without matching a real person.