Synthetic Data for GDPR Compliance: Anonymization Done Right

How synthetic data, anonymization, and pseudonymization differ under GDPR, what Recital 26 means for test data, and when each removes compliance scope.

By FakeName Editorial TeamPublished June 25, 2026Last updated June 25, 20269 min read

Fully synthetic data generated from scratch is not personal data under GDPR, so it sits outside the regulation's scope and keeps real people out of your test environments. The legal pivot is simple: if a record can never be tied back to a living person, GDPR does not apply to it (Recital 26). This guide explains where that line sits, why anonymization and pseudonymization are not the same thing, and how to choose between masking, pseudonymization, and synthetic data.

What is the difference between masking, pseudonymization, and synthetic data?

Masking obscures fields in real records, pseudonymization swaps identifiers for tokens while keeping a re-link key, and synthetic data is generated from scratch with no source record at all. GDPR scope turns on one question: can the data be tied back to a real person? Only synthetic-from-scratch reliably answers no.

GDPR applies to personal data, defined in Article 4(1) as any information relating to an identified or identifiable natural person ^[gdpr-art4]. The European Data Protection Board frames re-identification around three failure modes you must rule out before calling data anonymous: singling out (isolating one person's record), linkability (connecting two records about the same person), and inference (deducing an attribute with high probability) ^[edpb-anon].

Failure mode (Art. 29 WP / EDPB)	Question it asks	Example that fails the test	Why synthetic-from-scratch passes
Singling out	Can one individual be isolated in the set?	A masked row with rare ZIP + birth date still uniquely identifies one person	No row maps to a real person, so isolation reveals nobody
Linkability	Can two records about the same person be linked?	Pseudonym token re-links across two leaked tables	Generated values share no key with any production record
Inference	Can an attribute be deduced with high confidence?	Model-trained synthetic data memorizes a rare salary outlier	Rule-based generation never observes a real attribute to leak

The three re-identification risks the EDPB requires you to eliminate before calling data anonymous.

Data masking: real data, obscured

Masking overwrites or shuffles sensitive fields in real production records, for example replacing the middle digits of a phone number or swapping last names between rows. The original structure and statistics survive, which is why teams reach for it in testing. But the dataset is still derived from real people. Static masking that only blanks obvious identifiers leaves quasi-identifiers behind: Latanya Sweeney's analysis of 1990 US Census data found that 87.1 percent of the US population was uniquely identifiable from ZIP code, date of birth, and sex alone ^[sweeney]. Unless masking is irreversible and rigorous, the output stays personal data.

Pseudonymization: reversible by design

GDPR Article 4(5) defines pseudonymization as processing personal data so it can no longer be attributed to a specific person without additional information that is kept separately. The operative word is separately: a mapping key still exists. Because the link can be restored, pseudonymized data stays fully in scope as personal data ^[gdpr-art4]. Article 32 encourages pseudonymization as a security measure, but it reduces risk rather than removing obligations. NIST draws the same line in SP 800-122 (April 2010), treating de-identification as effective only when the re-identification mapping is destroyed or held under controls that make linkage infeasible ^{[nist-800122]}.

Synthetic data: generated, not derived

Fully synthetic data is produced by an algorithm to mimic the format and distribution of real data without copying any individual record. A synthetic customer table has names, addresses, and identifiers that look valid and pass validation logic, yet each row corresponds to no actual person. With no link to an identifiable individual and re-identification not reasonably likely, the data is anonymous and falls outside GDPR. The distinction that decides everything is derived versus generated: a value computed from a real person's record inherits that person's identifiability; a value drawn from independent rules and reserved ranges does not.

What does GDPR Recital 26 actually say about anonymous data?

Recital 26 states that the principles of data protection do not apply to anonymous information, defined as data that does not relate to an identified or identifiable person, or to personal data rendered anonymous so the data subject is no longer identifiable ^[gdpr-rec26]. Anonymous data is out of GDPR scope entirely. Synthetic-from-scratch data qualifies because no original subject ever existed to re-identify.

The same recital sets a demanding test. To decide whether a person is identifiable, you must account for all the means reasonably likely to be used, by the controller or by another party, to identify the person directly or indirectly, weighing objective factors including cost, time, and available technology. Deleting a name column is not enough. The UK ICO calls this the spectrum of identifiability and warns that re-identification risk must be judged against motivated third parties and auxiliary data, not the dataset in isolation ^[ico-anon].

Anonymisation is the process of rendering personal data anonymous so that an individual is not (or is no longer) identifiable. Where data has been successfully anonymised, it is no longer personal data and data protection law does not apply.
— UK Information Commissioner's Office, Anonymisation guidance

Generated-from-scratch synthetic data turns the Recital 26 analysis into a one-liner: the data never related to a real person, so there is no subject to re-identify. The Court of Justice of the EU confirmed in *Breyer* (Case C-582/14, 19 October 2016) that identifiability hinges on whether the means to re-link exist and are legally and practically available, not on whether you personally hold the key ^{[cjeu-breyer]}. Remove every path back to a real subject and the dataset's legal weight drops to near zero.

How do masking, pseudonymization, and synthetic data compare?

Masking keeps real data with medium-to-high re-identification risk; pseudonymization stays fully in GDPR scope because its key can re-link records; fully synthetic data sits outside scope when truly anonymous and exposes no real people in a breach. Pick by one question: do you ever need to re-link to a real person?

Dimension	Real-data masking	Pseudonymization	Fully synthetic (generated)
Source of data	Real records, fields obscured	Real records, IDs tokenized	Generated from rules/models
Re-identification risk	Medium-high (quasi-identifiers persist)	High (key can re-link)	Very low when generated from scratch
Reversible?	Sometimes (shuffling, format-preserving)	Yes, by design (separate key)	No original to reverse to
GDPR status	Usually still personal data	Personal data (Art. 4(5))	Outside scope if truly anonymous
Data utility for testing	High (real distributions)	High (structure intact)	High to medium (depends on generator)
Compliance burden	Lawful basis, DPIA likely	Full GDPR obligations apply	Minimal once anonymity verified
Breach impact	Real people exposed	Real people exposed if key leaks	No real people exposed
Best fit	One-off analytics on masked extracts	Internal processing needing relink	Dev, QA, staging, demos, load tests

Decision matrix: choose by whether you ever need to relink to a real person.

How do you generate test identifiers that pass validation but match nobody?

Use reserved or sandbox ranges that satisfy each standard's format and check-digit rules while being defined as never-issued or test-only. A Stripe test card like 4242 4242 4242 4242 passes the Luhn check; SSN area numbers 900-999, 000, and 666 are never allocated; phone numbers 555-0100 through 555-0199 are reserved for fiction. These values clear validation without colliding with a real person or account.

Identifier	Governing standard	Checksum / format rule	Reserved test space we use
Credit card number	ISO/IEC 7812 (issuer ID + Luhn)	Luhn mod-10 check digit	Stripe / processor test BINs (e.g. 4242 4242 4242 4242)
US Social Security Number	SSA numbering scheme	Area-group-serial structure	Never-allocated ranges (900-999, 000, 666 area)
IBAN	ISO 13616	Mod-97 check digits	Documentation-only country/bank codes
Phone number (US)	NANP	Valid area code + exchange	555-0100 to 555-0199 fictional block
Email / domain	RFC 2606	Syntactically valid address	example.com / .test reserved TLDs

Reserved ranges let synthetic identifiers pass format and checksum checks while matching no real allocation.

How should you use synthetic data in non-production environments?

Stop copying production data into lower environments and seed them with synthetic data instead. Replicating personal data into staging or test is itself processing under GDPR, expands your breach surface, and often triggers a Data Protection Impact Assessment under Article 35. Synthetic data removes the personal-data problem at the source, across four common workloads:

Local and CI test fixtures: seed deterministic synthetic users so tests are reproducible and contain zero real PII.
Load and performance testing: generate millions of synthetic accounts to size infrastructure without exporting customer records.
Demos and sales sandboxes: populate environments with believable but fictional profiles, addresses, and identifiers.
Schema and integration testing: confirm downstream systems accept correctly formatted identifiers using structurally valid test values.

How do you migrate off cloned production data?

Cut compliance scope environment by environment rather than in one risky switch. This five-step sequence moves a team from production dumps to synthetic fixtures without breaking pipelines mid-flight:

Inventory the PII flowing into each non-prod environment and tag every personal-data field by category (direct identifier, quasi-identifier, special category).
Pin seed data in CI to deterministic synthetic fixtures so build pipelines stop touching real records.
Generate volume data for load and staging environments from reserved ranges, matching production cardinality and distribution shape.
Cut over staging, revoke its access to production replicas, and document the Recital 26 re-identification assessment for the synthetic set.
Audit and lock: add a CI check that fails the build if a known production value pattern appears in a fixture.

A quick decision rule for GDPR and CCPA

If you ever need to re-link records to real customers, use pseudonymization and keep full GDPR obligations. If you only need realistic data and never need to identify a real person, generate synthetic data, aim for true anonymity, and document the Recital 26 re-identification assessment. CCPA/CPRA reaches the same result: de-identified and synthetic data the business cannot reasonably re-identify is excluded from personal information, provided you commit to keeping it de-identified ^[ccpa-deid].

Concept	GDPR (EU/UK)	CCPA / CPRA (California)	Effect on synthetic-from-scratch data
In-scope data term	Personal data (Art. 4(1))	Personal information (Cal. Civ. Code 1798.140)	Out of scope: no link to a natural person / consumer
Out-of-scope category	Anonymous information (Recital 26)	De-identified / aggregate information	Synthetic data fits this category when truly anonymous
Reversible-token term	Pseudonymisation (Art. 4(5)) - still in scope	Pseudonymized info - still personal information	Not relevant: no token maps back to a real subject
Standing obligation	Re-ID assessment of all reasonable means	Commit and act to keep data de-identified	Document that values are generated, not derived
Trigger for impact assessment	DPIA (Art. 35) for high-risk processing	Risk assessment under CPRA regulations	Avoided when no personal data is processed

Mapping GDPR and CCPA/CPRA terms: synthetic data generated from scratch stays out of scope under both.

Start from our synthetic data generator to produce datasets for your non-production environments, browse country-specific formats, and treat the comparison table above as your default policy: synthetic by default, pseudonymization only when re-linking is a genuine requirement, and raw production data confined to production.

References & sources

Recital 26 - Not Applicable to Anonymous Data — gdpr-info.eu
Article 4 - Definitions (personal data, pseudonymisation) — gdpr-info.eu
Opinion 05/2014 on Anonymisation Techniques — Article 29 Working Party / EDPB
Anonymisation, pseudonymisation and privacy enhancing technologies guidance — UK Information Commissioner's Office
California Consumer Privacy Act (CCPA) — California Attorney General
SP 800-122: Guide to Protecting the Confidentiality of PII — NIST
Simple Demographics Often Identify People Uniquely — Carnegie Mellon University, L. Sweeney
Case C-582/14 Patrick Breyer v Bundesrepublik Deutschland — Court of Justice of the European Union

Frequently asked questions

Is synthetic data considered personal data under GDPR?+

Synthetic data generated independently, with no way to link it to an identifiable living person, is not personal data, so GDPR does not apply. The caveat: synthetic datasets trained on real records can leak attributes, and if individuals can still be singled out or re-identified, the data may remain personal. Values generated from scratch, like those from our generator, carry no link to a real person.

What is the difference between anonymization and pseudonymization under GDPR?+

Pseudonymized data replaces identifiers with tokens but keeps a key that can re-link records to people, so GDPR Article 4(5) still treats it as personal data. Anonymized data is irreversibly stripped of any link to an individual, and under Recital 26 it falls outside GDPR. The test is whether re-identification is reasonably likely using all available means.

Does GDPR Recital 26 mean anonymous data is exempt?+

Yes. Recital 26 states that data protection principles do not apply to anonymous information or to personal data rendered anonymous so the data subject is no longer identifiable. The bar is high: you must weigh all means reasonably likely to be used to re-identify, including by third parties, accounting for cost, time, and available technology.

Can I use production data in my staging or test environment?+

Copying production personal data into non-production environments is processing under GDPR and usually requires a lawful basis, access controls, and often a DPIA under Article 35. It also widens your breach surface. Most teams avoid this by generating synthetic data for dev, QA, and staging, so no real personal data ever leaves production.

Does CCPA treat de-identified and synthetic data differently from GDPR?+

The CCPA/CPRA excludes de-identified and aggregate consumer information from personal information, provided the business cannot reasonably re-identify it and commits to keeping it de-identified. Synthetic data with no link to a real consumer sits outside CCPA scope, mirroring the GDPR position on anonymous data.

Is generating fake names and SSNs for testing legal?+

Generating fictional names, addresses, and structurally valid test identifiers for software testing, QA, and privacy work is legal and common. It becomes illegal when used to impersonate a real person or commit fraud. Reputable generators use never-issued or reserved ranges and sandbox test numbers, so values pass format checks without matching a real person.