Synthetic Data for GDPR Compliance: Anonymization Done Right

How synthetic data, anonymization, and pseudonymization differ under GDPR, what Recital 26 means for test data, and when each removes compliance scope.

By FakeName Editorial TeamPublished June 25, 2026Last updated June 25, 20269 min read

Fully synthetic data generated from scratch is not personal data under GDPR, so it sits outside the regulation's scope and keeps real people out of your test environments. The legal pivot is simple: if a record can never be tied back to a living person, GDPR does not apply to it (Recital 26). This guide explains where that line sits, why anonymization and pseudonymization are not the same thing, and how to choose between masking, pseudonymization, and synthetic data.

What is the difference between masking, pseudonymization, and synthetic data?

Masking obscures fields in real records, pseudonymization swaps identifiers for tokens while keeping a re-link key, and synthetic data is generated from scratch with no source record at all. GDPR scope turns on one question: can the data be tied back to a real person? Only synthetic-from-scratch reliably answers no.

GDPR applies to personal data, defined in Article 4(1) as any information relating to an identified or identifiable natural person [gdpr-art4]. The European Data Protection Board frames re-identification around three failure modes you must rule out before calling data anonymous: singling out (isolating one person's record), linkability (connecting two records about the same person), and inference (deducing an attribute with high probability) [edpb-anon].

Failure mode (Art. 29 WP / EDPB)Question it asksExample that fails the testWhy synthetic-from-scratch passes
Singling outCan one individual be isolated in the set?A masked row with rare ZIP + birth date still uniquely identifies one personNo row maps to a real person, so isolation reveals nobody
LinkabilityCan two records about the same person be linked?Pseudonym token re-links across two leaked tablesGenerated values share no key with any production record
InferenceCan an attribute be deduced with high confidence?Model-trained synthetic data memorizes a rare salary outlierRule-based generation never observes a real attribute to leak
The three re-identification risks the EDPB requires you to eliminate before calling data anonymous.

Data masking: real data, obscured

Masking overwrites or shuffles sensitive fields in real production records, for example replacing the middle digits of a phone number or swapping last names between rows. The original structure and statistics survive, which is why teams reach for it in testing. But the dataset is still derived from real people. Static masking that only blanks obvious identifiers leaves quasi-identifiers behind: Latanya Sweeney's analysis of 1990 US Census data found that 87.1 percent of the US population was uniquely identifiable from ZIP code, date of birth, and sex alone [sweeney]. Unless masking is irreversible and rigorous, the output stays personal data.

Pseudonymization: reversible by design

GDPR Article 4(5) defines pseudonymization as processing personal data so it can no longer be attributed to a specific person without additional information that is kept separately. The operative word is separately: a mapping key still exists. Because the link can be restored, pseudonymized data stays fully in scope as personal data [gdpr-art4]. Article 32 encourages pseudonymization as a security measure, but it reduces risk rather than removing obligations. NIST draws the same line in SP 800-122 (April 2010), treating de-identification as effective only when the re-identification mapping is destroyed or held under controls that make linkage infeasible [nist-800122].

Synthetic data: generated, not derived

Fully synthetic data is produced by an algorithm to mimic the format and distribution of real data without copying any individual record. A synthetic customer table has names, addresses, and identifiers that look valid and pass validation logic, yet each row corresponds to no actual person. With no link to an identifiable individual and re-identification not reasonably likely, the data is anonymous and falls outside GDPR. The distinction that decides everything is derived versus generated: a value computed from a real person's record inherits that person's identifiability; a value drawn from independent rules and reserved ranges does not.

What does GDPR Recital 26 actually say about anonymous data?

Recital 26 states that the principles of data protection do not apply to anonymous information, defined as data that does not relate to an identified or identifiable person, or to personal data rendered anonymous so the data subject is no longer identifiable [gdpr-rec26]. Anonymous data is out of GDPR scope entirely. Synthetic-from-scratch data qualifies because no original subject ever existed to re-identify.

The same recital sets a demanding test. To decide whether a person is identifiable, you must account for all the means reasonably likely to be used, by the controller or by another party, to identify the person directly or indirectly, weighing objective factors including cost, time, and available technology. Deleting a name column is not enough. The UK ICO calls this the spectrum of identifiability and warns that re-identification risk must be judged against motivated third parties and auxiliary data, not the dataset in isolation [ico-anon].

Anonymisation is the process of rendering personal data anonymous so that an individual is not (or is no longer) identifiable. Where data has been successfully anonymised, it is no longer personal data and data protection law does not apply.
UK Information Commissioner's Office, Anonymisation guidance

Generated-from-scratch synthetic data turns the Recital 26 analysis into a one-liner: the data never related to a real person, so there is no subject to re-identify. The Court of Justice of the EU confirmed in *Breyer* (Case C-582/14, 19 October 2016) that identifiability hinges on whether the means to re-link exist and are legally and practically available, not on whether you personally hold the key [cjeu-breyer]. Remove every path back to a real subject and the dataset's legal weight drops to near zero.

How do masking, pseudonymization, and synthetic data compare?

Masking keeps real data with medium-to-high re-identification risk; pseudonymization stays fully in GDPR scope because its key can re-link records; fully synthetic data sits outside scope when truly anonymous and exposes no real people in a breach. Pick by one question: do you ever need to re-link to a real person?

DimensionReal-data maskingPseudonymizationFully synthetic (generated)
Source of dataReal records, fields obscuredReal records, IDs tokenizedGenerated from rules/models
Re-identification riskMedium-high (quasi-identifiers persist)High (key can re-link)Very low when generated from scratch
Reversible?Sometimes (shuffling, format-preserving)Yes, by design (separate key)No original to reverse to
GDPR statusUsually still personal dataPersonal data (Art. 4(5))Outside scope if truly anonymous
Data utility for testingHigh (real distributions)High (structure intact)High to medium (depends on generator)
Compliance burdenLawful basis, DPIA likelyFull GDPR obligations applyMinimal once anonymity verified
Breach impactReal people exposedReal people exposed if key leaksNo real people exposed
Best fitOne-off analytics on masked extractsInternal processing needing relinkDev, QA, staging, demos, load tests
Decision matrix: choose by whether you ever need to relink to a real person.

How do you generate test identifiers that pass validation but match nobody?

Use reserved or sandbox ranges that satisfy each standard's format and check-digit rules while being defined as never-issued or test-only. A Stripe test card like 4242 4242 4242 4242 passes the Luhn check; SSN area numbers 900-999, 000, and 666 are never allocated; phone numbers 555-0100 through 555-0199 are reserved for fiction. These values clear validation without colliding with a real person or account.

IdentifierGoverning standardChecksum / format ruleReserved test space we use
Credit card numberISO/IEC 7812 (issuer ID + Luhn)Luhn mod-10 check digitStripe / processor test BINs (e.g. 4242 4242 4242 4242)
US Social Security NumberSSA numbering schemeArea-group-serial structureNever-allocated ranges (900-999, 000, 666 area)
IBANISO 13616Mod-97 check digitsDocumentation-only country/bank codes
Phone number (US)NANPValid area code + exchange555-0100 to 555-0199 fictional block
Email / domainRFC 2606Syntactically valid addressexample.com / .test reserved TLDs
Reserved ranges let synthetic identifiers pass format and checksum checks while matching no real allocation.

How should you use synthetic data in non-production environments?

Stop copying production data into lower environments and seed them with synthetic data instead. Replicating personal data into staging or test is itself processing under GDPR, expands your breach surface, and often triggers a Data Protection Impact Assessment under Article 35. Synthetic data removes the personal-data problem at the source, across four common workloads:

  • Local and CI test fixtures: seed deterministic synthetic users so tests are reproducible and contain zero real PII.
  • Load and performance testing: generate millions of synthetic accounts to size infrastructure without exporting customer records.
  • Demos and sales sandboxes: populate environments with believable but fictional profiles, addresses, and identifiers.
  • Schema and integration testing: confirm downstream systems accept correctly formatted identifiers using structurally valid test values.

How do you migrate off cloned production data?

Cut compliance scope environment by environment rather than in one risky switch. This five-step sequence moves a team from production dumps to synthetic fixtures without breaking pipelines mid-flight:

  1. Inventory the PII flowing into each non-prod environment and tag every personal-data field by category (direct identifier, quasi-identifier, special category).
  2. Pin seed data in CI to deterministic synthetic fixtures so build pipelines stop touching real records.
  3. Generate volume data for load and staging environments from reserved ranges, matching production cardinality and distribution shape.
  4. Cut over staging, revoke its access to production replicas, and document the Recital 26 re-identification assessment for the synthetic set.
  5. Audit and lock: add a CI check that fails the build if a known production value pattern appears in a fixture.

A quick decision rule for GDPR and CCPA

If you ever need to re-link records to real customers, use pseudonymization and keep full GDPR obligations. If you only need realistic data and never need to identify a real person, generate synthetic data, aim for true anonymity, and document the Recital 26 re-identification assessment. CCPA/CPRA reaches the same result: de-identified and synthetic data the business cannot reasonably re-identify is excluded from personal information, provided you commit to keeping it de-identified [ccpa-deid].

ConceptGDPR (EU/UK)CCPA / CPRA (California)Effect on synthetic-from-scratch data
In-scope data termPersonal data (Art. 4(1))Personal information (Cal. Civ. Code 1798.140)Out of scope: no link to a natural person / consumer
Out-of-scope categoryAnonymous information (Recital 26)De-identified / aggregate informationSynthetic data fits this category when truly anonymous
Reversible-token termPseudonymisation (Art. 4(5)) - still in scopePseudonymized info - still personal informationNot relevant: no token maps back to a real subject
Standing obligationRe-ID assessment of all reasonable meansCommit and act to keep data de-identifiedDocument that values are generated, not derived
Trigger for impact assessmentDPIA (Art. 35) for high-risk processingRisk assessment under CPRA regulationsAvoided when no personal data is processed
Mapping GDPR and CCPA/CPRA terms: synthetic data generated from scratch stays out of scope under both.

Start from our synthetic data generator to produce datasets for your non-production environments, browse country-specific formats, and treat the comparison table above as your default policy: synthetic by default, pseudonymization only when re-linking is a genuine requirement, and raw production data confined to production.

References & sources

  1. Recital 26 - Not Applicable to Anonymous Datagdpr-info.eu
  2. Article 4 - Definitions (personal data, pseudonymisation)gdpr-info.eu
  3. Opinion 05/2014 on Anonymisation TechniquesArticle 29 Working Party / EDPB
  4. Anonymisation, pseudonymisation and privacy enhancing technologies guidanceUK Information Commissioner's Office
  5. California Consumer Privacy Act (CCPA)California Attorney General
  6. SP 800-122: Guide to Protecting the Confidentiality of PIINIST
  7. Simple Demographics Often Identify People UniquelyCarnegie Mellon University, L. Sweeney
  8. Case C-582/14 Patrick Breyer v Bundesrepublik DeutschlandCourt of Justice of the European Union

Frequently asked questions

Is synthetic data considered personal data under GDPR?+

Synthetic data generated independently, with no way to link it to an identifiable living person, is not personal data, so GDPR does not apply. The caveat: synthetic datasets trained on real records can leak attributes, and if individuals can still be singled out or re-identified, the data may remain personal. Values generated from scratch, like those from our generator, carry no link to a real person.

What is the difference between anonymization and pseudonymization under GDPR?+

Pseudonymized data replaces identifiers with tokens but keeps a key that can re-link records to people, so GDPR Article 4(5) still treats it as personal data. Anonymized data is irreversibly stripped of any link to an individual, and under Recital 26 it falls outside GDPR. The test is whether re-identification is reasonably likely using all available means.

Does GDPR Recital 26 mean anonymous data is exempt?+

Yes. Recital 26 states that data protection principles do not apply to anonymous information or to personal data rendered anonymous so the data subject is no longer identifiable. The bar is high: you must weigh all means reasonably likely to be used to re-identify, including by third parties, accounting for cost, time, and available technology.

Can I use production data in my staging or test environment?+

Copying production personal data into non-production environments is processing under GDPR and usually requires a lawful basis, access controls, and often a DPIA under Article 35. It also widens your breach surface. Most teams avoid this by generating synthetic data for dev, QA, and staging, so no real personal data ever leaves production.

Does CCPA treat de-identified and synthetic data differently from GDPR?+

The CCPA/CPRA excludes de-identified and aggregate consumer information from personal information, provided the business cannot reasonably re-identify it and commits to keeping it de-identified. Synthetic data with no link to a real consumer sits outside CCPA scope, mirroring the GDPR position on anonymous data.

Is generating fake names and SSNs for testing legal?+

Generating fictional names, addresses, and structurally valid test identifiers for software testing, QA, and privacy work is legal and common. It becomes illegal when used to impersonate a real person or commit fraud. Reputable generators use never-issued or reserved ranges and sandbox test numbers, so values pass format checks without matching a real person.

We use cookies for analytics and ads to keep this generator free. See our Privacy Policy.