Data Masking vs Synthetic Data vs Tokenization for Non-Prod
Data masking vs synthetic data: comparing static/dynamic masking, tokenization, and format-preserving encryption on reversibility and compliance.
By FakeName Editorial TeamPublished June 25, 2026Last updated June 25, 20269 min read
Lower environments leak. Test databases get cloned to laptops, shared with contractors, and snapshotted into CI logs. The question is not whether to protect data before it reaches non-production, but which technique to use. This guide compares the four practical options engineering and compliance teams weigh: static and dynamic data masking, tokenization, format-preserving encryption (FPE), and fully synthetic data generation built from never-issued and reserved ranges.
What is the difference between data masking, tokenization, and synthetic data?
Data masking rewrites real values in a copy so the result looks plausible but no longer reveals the original. Tokenization replaces a sensitive value with a surrogate token and stores the mapping in a vault, keeping it reversible. Synthetic data is generated from scratch and contains no real record at all, so there is nothing to reverse.
The dividing line that matters most for non-prod is reversibility. Masking (when done irreversibly), tokenization, and FPE all start from real production data and transform it. Synthetic data never touches a real record, which is why it changes the compliance conversation. The other line is how the link to a real person is broken: a one-way transform versus a key or vault you could in principle undo.
Static vs dynamic masking
Static data masking (SDM) transforms data once, at copy time, and writes the masked result to the non-production target. The originals never land downstream. Dynamic data masking (DDM) leaves the source intact and masks values on the fly in query results, deciding per user or role. SDM is the safer default for shared test environments because the real values are physically absent; DDM is a production access control, not a data-distribution technique.
De-identification is not a single technique, but a collection of approaches, tools, and algorithms applied to different kinds of data with differing levels of effectiveness.
Which non-prod data protection technique should I use? (comparison table)
Pick by the properties you cannot give up. If you must reverse a value later, you need tokenization or FPE. If you want the link to real people gone, you need irreversible static masking or synthetic data. The table below maps each technique against the dimensions that decide non-prod fit, from reversibility to re-identification risk.
| Technique | Reversible? | Referential integrity | Realism of output | Re-identification risk | Performance cost | Best use |
|---|---|---|---|---|---|---|
| Static data masking (irreversible) | No (one-way) | Yes, if consistent masking is configured | High (looks like real records) | Low to medium (depends on residual quasi-identifiers) | One-time batch cost at copy | Shared dev/test/QA from a prod copy |
| Dynamic data masking | No transform, source intact | N/A (source unchanged) | High at query layer | High (real data still present underneath) | Per-query runtime overhead | Role-based prod read access, not data distribution |
| Tokenization (deterministic) | Yes (via vault) | Yes (same input maps to same token) | Medium (format may differ unless preserved) | Medium to high (vault compromise reverses it) | Vault lookup per tokenize/detokenize | PCI scope reduction where reversal is required |
| Format-preserving encryption (FF1/FF3-1) | Yes (via key) | Yes (deterministic for a fixed key/tweak) | High (same length and charset) | Medium to high (key compromise reverses it) | Per-value cipher computation | Legacy schemas needing same format and reversibility |
| Fully synthetic data | No original to reverse | Yes, if generated with relational constraints | High (statistically plausible, fabricated) | Lowest (no real individual represented) | Generation cost, no prod copy needed | Net-new test data, demos, open sharing, AI/ML test sets |
A worked sizing example: masking a 50-million-row customer table where 8 columns are sensitive means 50,000,000 x 8 = 400,000,000 cell transforms in a single batch job. Deterministic tokenization of one email column across 3 tables holding 12M, 4M, and 1M rows means 12M + 4M + 1M = 17,000,000 vault lookups, and every distinct email must resolve to the same token in all three tables to keep joins intact. Synthetic generation avoids both the copy and the vault entirely because it starts from zero rows and produces exactly what the test plan asks for.
How does tokenization vs masking differ for referential integrity?
Both can preserve referential integrity, but only when applied consistently. Deterministic tokenization and consistent masking guarantee the same input always produces the same output, so foreign keys and cross-table joins survive. Random, per-row substitution breaks joins because the same source value yields different outputs in different rows and tables.
- Deterministic tokenization: input X always maps to token T. A customer ID joined across orders, invoices, and shipments stays joinable after tokenizing because every occurrence becomes the same T.
- Consistent (deterministic) masking: the masking function is seeded so the same name or SSN maps to the same masked value everywhere, preserving distinct-count cardinality and join keys.
- Random masking/tokenization: each occurrence is independent, which destroys joins and changes cardinality. Acceptable only for free-text or non-key columns.
- FPE: deterministic for a fixed key and tweak, so it preserves joins like consistent masking while keeping the original length and character set.
Which technique fits GDPR, HIPAA, and PCI DSS? (compliance mapping table)
Compliance fit turns on reversibility. Under GDPR, reversible techniques are pseudonymization and the data stays personal data; only genuinely irreversible de-identification can reach anonymization and fall outside the Regulation. HIPAA recognizes Safe Harbor and Expert Determination de-identification. PCI DSS lets tokenization and synthetic test data reduce or remove cardholder-data scope.
| Technique | GDPR classification | HIPAA fit | PCI DSS fit | Stays in regulatory scope? |
|---|---|---|---|---|
| Irreversible static masking | Anonymization if re-identification not reasonably likely (Recital 26); otherwise pseudonymization | Supports Safe Harbor / Expert Determination de-identification | Removes PAN from test data if no reversal path | No, if true anonymization is achieved |
| Dynamic masking | Access control, not anonymization; underlying data still personal | Access safeguard; data still PHI | Does not remove PAN from the store | Yes |
| Tokenization (vaulted) | Pseudonymization, Article 4(5); remains personal data | Limited Data Set / safeguard; PHI unless de-identified | Recognized for scope reduction (PCI tokenization guidance) | Yes (reversible) |
| Format-preserving encryption | Pseudonymization (encryption is reversible with the key) | Encryption safeguard; PHI unless de-identified | Strong cryptography per Req. 3; reduces but rarely removes scope | Yes (reversible) |
| Fully synthetic data | Not personal data if no real individual is represented | No PHI when generated from non-real values | No CHD when built from reserved test BINs | No |
GDPR Recital 26 is the anchor: the principles of data protection do not apply to anonymous information, and to decide whether someone is identifiable you must account for all the means reasonably likely to be used to identify them [gdpr-recital-26]. That word reasonably is why irreversible masking can qualify as anonymization while tokenization and FPE cannot, since the vault or key is a reasonably-likely means of reversal [gdpr-art4].
PCI and synthetic card numbers
PCI DSS v4.0 Requirement 6.5.6 directs teams to remove test data and test accounts from system components before they go into production, and the related pre-production control discourages using live PANs outside production [pci-dss-v4]. For payment-flow testing, synthetic numbers built from reserved test BINs and validated with the Luhn algorithm [luhn] let you exercise authorization, settlement, and decline paths without bringing real cardholder data into scope. Card brands and payment processors publish sandbox test card numbers for exactly this purpose.
The table below lists widely documented sandbox test cards. These numbers are reserved for testing, pass the Luhn checksum, and never correspond to a real cardholder account, so they are safe to commit to fixtures and CI [stripe-test-cards].
| Card brand | Test card number | Digits | CVC length | Notes |
|---|---|---|---|---|
| Visa | 4242 4242 4242 4242 | 16 | 3 | De facto standard sandbox Visa; any future expiry |
| Visa (debit) | 4000 0566 5566 5556 | 16 | 3 | Exercises debit-specific routing in test |
| Mastercard | 5555 5555 5555 4444 | 16 | 3 | Common approved-payment Mastercard test number |
| American Express | 3782 822463 10005 | 15 | 4 | 15-digit Amex format; 4-digit CID |
| Discover | 6011 1111 1111 1117 | 16 | 3 | Discover sandbox number |
When is synthetic data generation the better choice over masking?
Choose synthetic data when there is no clean production copy to mask, when you need volume or edge cases that production lacks, or when the dataset will be shared widely or used to train models. Because synthetic records represent no real person, the re-identification surface that masking leaves behind through residual quasi-identifiers largely disappears.
- No safe source copy exists. Greenfield products and pre-launch features have no production data to mask; generation is the only option.
- You need controlled edge cases. Leap-year dates, maximum-length names, rare locales, and high-cardinality stress sets are easier to fabricate than to find in production.
- Wide distribution. Open demos, partner sandboxes, and public datasets are safer with fabricated data than with a masked prod extract that may carry residual identifiers.
- ML and analytics test sets. Synthetic generators can reproduce statistical structure without copying individuals, though you should still test the output for memorized real values.
- Referential complexity. When you need dozens of linked tables to stay consistent, generating them together is often less error-prone than masking a live schema.
Open-source libraries such as Faker generate names, addresses, and identifiers across locales and are a common starting point for synthetic test data [faker]. For browser-based generation and linked multi-record exports, start at the identity generator and use the bulk export to produce consistent related rows. For the privacy-law angle on generated data, see our companion guide on synthetic data and GDPR anonymization.
What are the residual risks of each test data anonymization technique?
Every technique leaves something behind. Masking can leak through quasi-identifiers and unmasked free-text fields; tokenization and FPE are reversible and therefore only as strong as their vault or key; dynamic masking keeps the real data present. Synthetic data's main risk is leakage, where a generator memorizes and reproduces real source records.
| Technique | Primary residual risk | Mitigating control |
|---|---|---|
| Static masking | Re-identification via residual quasi-identifiers (ZIP + DOB + gender) | Generalize or suppress quasi-identifiers; test against linkage attacks |
| Dynamic masking | Real data still present; bypass via direct DB access or privilege escalation | Lock down source access; treat as access control, not distribution |
| Tokenization | Vault compromise reverses all tokens at once | Isolate and harden the token vault; rotate; strict access logging |
| Format-preserving encryption | Key compromise reverses data; weak modes (original FF3) attacked | Use FF1 or FF3-1 per NIST SP 800-38G; manage keys in an HSM/KMS |
| Synthetic data | Generator memorizes and emits real source records (model leakage) | Scan output for verbatim source values; prefer reserved/never-issued ranges |
The k-anonymity model formalizes the quasi-identifier risk: a release is k-anonymous when every record is indistinguishable from at least k-1 others on its quasi-identifiers, which is the standard yardstick for whether masked output resists linkage [k-anonymity]. NIST SP 800-188 walks through these de-identification trade-offs for real datasets and is the authoritative engineering reference [nist-800-188].
References & sources
- GDPR Recital 26 - Not Applicable to Anonymous Data — gdpr-info.eu
- GDPR Article 4 - Definitions (incl. pseudonymisation) — gdpr-info.eu
- NIST SP 800-188, De-Identifying Government Data Sets — NIST
- PCI DSS v4.0.1 Requirements and Testing Procedures (Document Library) — PCI Security Standards Council
- Luhn algorithm (mod 10 checksum) — Wikipedia
- k-anonymity privacy model — Wikipedia
- Faker documentation — Faker
- Stripe Documentation - Test card numbers — Stripe
Frequently asked questions
Is data masking the same as anonymization under GDPR?+
Not automatically. Irreversible masking can reach GDPR anonymization, which puts data outside the Regulation's scope, but only if re-identification is no longer reasonably likely. Reversible techniques such as tokenization and format-preserving encryption are pseudonymization under GDPR Article 4(5) and remain personal data.
Which technique has the lowest re-identification risk?+
Fully synthetic data has the lowest risk because every record is generated and no original individual is represented. Tokenization and format-preserving encryption have the highest residual risk among these techniques because they are reversible by design and the mapping or key can be compromised.
Does tokenization preserve referential integrity?+
Yes, when tokenization is deterministic. The same input value always maps to the same token, so foreign keys and joins across tables stay consistent. Random per-row tokenization breaks joins unless you store and reuse the mapping for every occurrence of a value.
Can I use synthetic data for PCI DSS test environments?+
Yes. PCI DSS v4.0 Requirement 6.5.6 directs teams to remove test data and test accounts from system components before they go into production, and related guidance discourages live PANs in pre-production systems. Synthetic card numbers built from reserved test BINs and validated with the Luhn check let you exercise payment flows without bringing real cardholder data into scope.
What is the difference between static and dynamic data masking?+
Static data masking permanently rewrites values in a copied dataset before it lands in non-production, so the masked database never contains the originals. Dynamic data masking leaves the source unchanged and obscures values at query time based on the requesting user, so the real data still exists underneath.
Is format-preserving encryption approved by NIST?+
Yes, partially. NIST Special Publication 800-38G specifies the FF1 and FF3-1 modes for format-preserving encryption. FF3 was withdrawn after cryptographic weaknesses were found; FF3-1 is the revised, currently specified variant alongside FF1.