Data Masking vs Synthetic Data vs Tokenization for Non-Prod

Data masking vs synthetic data: comparing static/dynamic masking, tokenization, and format-preserving encryption on reversibility and compliance.

By FakeName Editorial TeamPublished June 25, 2026Last updated June 25, 20269 min read

Lower environments leak. Test databases get cloned to laptops, shared with contractors, and snapshotted into CI logs. The question is not whether to protect data before it reaches non-production, but which technique to use. This guide compares the four practical options engineering and compliance teams weigh: static and dynamic data masking, tokenization, format-preserving encryption (FPE), and fully synthetic data generation built from never-issued and reserved ranges.

What is the difference between data masking, tokenization, and synthetic data?

Data masking rewrites real values in a copy so the result looks plausible but no longer reveals the original. Tokenization replaces a sensitive value with a surrogate token and stores the mapping in a vault, keeping it reversible. Synthetic data is generated from scratch and contains no real record at all, so there is nothing to reverse.

The dividing line that matters most for non-prod is reversibility. Masking (when done irreversibly), tokenization, and FPE all start from real production data and transform it. Synthetic data never touches a real record, which is why it changes the compliance conversation. The other line is how the link to a real person is broken: a one-way transform versus a key or vault you could in principle undo.

Static vs dynamic masking

Static data masking (SDM) transforms data once, at copy time, and writes the masked result to the non-production target. The originals never land downstream. Dynamic data masking (DDM) leaves the source intact and masks values on the fly in query results, deciding per user or role. SDM is the safer default for shared test environments because the real values are physically absent; DDM is a production access control, not a data-distribution technique.

De-identification is not a single technique, but a collection of approaches, tools, and algorithms applied to different kinds of data with differing levels of effectiveness.
NIST SP 800-188, De-Identifying Government Data Sets

Which non-prod data protection technique should I use? (comparison table)

Pick by the properties you cannot give up. If you must reverse a value later, you need tokenization or FPE. If you want the link to real people gone, you need irreversible static masking or synthetic data. The table below maps each technique against the dimensions that decide non-prod fit, from reversibility to re-identification risk.

TechniqueReversible?Referential integrityRealism of outputRe-identification riskPerformance costBest use
Static data masking (irreversible)No (one-way)Yes, if consistent masking is configuredHigh (looks like real records)Low to medium (depends on residual quasi-identifiers)One-time batch cost at copyShared dev/test/QA from a prod copy
Dynamic data maskingNo transform, source intactN/A (source unchanged)High at query layerHigh (real data still present underneath)Per-query runtime overheadRole-based prod read access, not data distribution
Tokenization (deterministic)Yes (via vault)Yes (same input maps to same token)Medium (format may differ unless preserved)Medium to high (vault compromise reverses it)Vault lookup per tokenize/detokenizePCI scope reduction where reversal is required
Format-preserving encryption (FF1/FF3-1)Yes (via key)Yes (deterministic for a fixed key/tweak)High (same length and charset)Medium to high (key compromise reverses it)Per-value cipher computationLegacy schemas needing same format and reversibility
Fully synthetic dataNo original to reverseYes, if generated with relational constraintsHigh (statistically plausible, fabricated)Lowest (no real individual represented)Generation cost, no prod copy neededNet-new test data, demos, open sharing, AI/ML test sets
Non-production data protection techniques compared across the properties that matter for engineering and compliance.

A worked sizing example: masking a 50-million-row customer table where 8 columns are sensitive means 50,000,000 x 8 = 400,000,000 cell transforms in a single batch job. Deterministic tokenization of one email column across 3 tables holding 12M, 4M, and 1M rows means 12M + 4M + 1M = 17,000,000 vault lookups, and every distinct email must resolve to the same token in all three tables to keep joins intact. Synthetic generation avoids both the copy and the vault entirely because it starts from zero rows and produces exactly what the test plan asks for.

How does tokenization vs masking differ for referential integrity?

Both can preserve referential integrity, but only when applied consistently. Deterministic tokenization and consistent masking guarantee the same input always produces the same output, so foreign keys and cross-table joins survive. Random, per-row substitution breaks joins because the same source value yields different outputs in different rows and tables.

  • Deterministic tokenization: input X always maps to token T. A customer ID joined across orders, invoices, and shipments stays joinable after tokenizing because every occurrence becomes the same T.
  • Consistent (deterministic) masking: the masking function is seeded so the same name or SSN maps to the same masked value everywhere, preserving distinct-count cardinality and join keys.
  • Random masking/tokenization: each occurrence is independent, which destroys joins and changes cardinality. Acceptable only for free-text or non-key columns.
  • FPE: deterministic for a fixed key and tweak, so it preserves joins like consistent masking while keeping the original length and character set.

Which technique fits GDPR, HIPAA, and PCI DSS? (compliance mapping table)

Compliance fit turns on reversibility. Under GDPR, reversible techniques are pseudonymization and the data stays personal data; only genuinely irreversible de-identification can reach anonymization and fall outside the Regulation. HIPAA recognizes Safe Harbor and Expert Determination de-identification. PCI DSS lets tokenization and synthetic test data reduce or remove cardholder-data scope.

TechniqueGDPR classificationHIPAA fitPCI DSS fitStays in regulatory scope?
Irreversible static maskingAnonymization if re-identification not reasonably likely (Recital 26); otherwise pseudonymizationSupports Safe Harbor / Expert Determination de-identificationRemoves PAN from test data if no reversal pathNo, if true anonymization is achieved
Dynamic maskingAccess control, not anonymization; underlying data still personalAccess safeguard; data still PHIDoes not remove PAN from the storeYes
Tokenization (vaulted)Pseudonymization, Article 4(5); remains personal dataLimited Data Set / safeguard; PHI unless de-identifiedRecognized for scope reduction (PCI tokenization guidance)Yes (reversible)
Format-preserving encryptionPseudonymization (encryption is reversible with the key)Encryption safeguard; PHI unless de-identifiedStrong cryptography per Req. 3; reduces but rarely removes scopeYes (reversible)
Fully synthetic dataNot personal data if no real individual is representedNo PHI when generated from non-real valuesNo CHD when built from reserved test BINsNo
How each technique maps to GDPR, HIPAA, and PCI DSS obligations for non-production data.

GDPR Recital 26 is the anchor: the principles of data protection do not apply to anonymous information, and to decide whether someone is identifiable you must account for all the means reasonably likely to be used to identify them [gdpr-recital-26]. That word reasonably is why irreversible masking can qualify as anonymization while tokenization and FPE cannot, since the vault or key is a reasonably-likely means of reversal [gdpr-art4].

PCI and synthetic card numbers

PCI DSS v4.0 Requirement 6.5.6 directs teams to remove test data and test accounts from system components before they go into production, and the related pre-production control discourages using live PANs outside production [pci-dss-v4]. For payment-flow testing, synthetic numbers built from reserved test BINs and validated with the Luhn algorithm [luhn] let you exercise authorization, settlement, and decline paths without bringing real cardholder data into scope. Card brands and payment processors publish sandbox test card numbers for exactly this purpose.

The table below lists widely documented sandbox test cards. These numbers are reserved for testing, pass the Luhn checksum, and never correspond to a real cardholder account, so they are safe to commit to fixtures and CI [stripe-test-cards].

Card brandTest card numberDigitsCVC lengthNotes
Visa4242 4242 4242 4242163De facto standard sandbox Visa; any future expiry
Visa (debit)4000 0566 5566 5556163Exercises debit-specific routing in test
Mastercard5555 5555 5555 4444163Common approved-payment Mastercard test number
American Express3782 822463 1000515415-digit Amex format; 4-digit CID
Discover6011 1111 1111 1117163Discover sandbox number
Published sandbox test card numbers that pass the Luhn check and never map to a real account, suitable for non-production payment testing.

When is synthetic data generation the better choice over masking?

Choose synthetic data when there is no clean production copy to mask, when you need volume or edge cases that production lacks, or when the dataset will be shared widely or used to train models. Because synthetic records represent no real person, the re-identification surface that masking leaves behind through residual quasi-identifiers largely disappears.

  1. No safe source copy exists. Greenfield products and pre-launch features have no production data to mask; generation is the only option.
  2. You need controlled edge cases. Leap-year dates, maximum-length names, rare locales, and high-cardinality stress sets are easier to fabricate than to find in production.
  3. Wide distribution. Open demos, partner sandboxes, and public datasets are safer with fabricated data than with a masked prod extract that may carry residual identifiers.
  4. ML and analytics test sets. Synthetic generators can reproduce statistical structure without copying individuals, though you should still test the output for memorized real values.
  5. Referential complexity. When you need dozens of linked tables to stay consistent, generating them together is often less error-prone than masking a live schema.

Open-source libraries such as Faker generate names, addresses, and identifiers across locales and are a common starting point for synthetic test data [faker]. For browser-based generation and linked multi-record exports, start at the identity generator and use the bulk export to produce consistent related rows. For the privacy-law angle on generated data, see our companion guide on synthetic data and GDPR anonymization.

What are the residual risks of each test data anonymization technique?

Every technique leaves something behind. Masking can leak through quasi-identifiers and unmasked free-text fields; tokenization and FPE are reversible and therefore only as strong as their vault or key; dynamic masking keeps the real data present. Synthetic data's main risk is leakage, where a generator memorizes and reproduces real source records.

TechniquePrimary residual riskMitigating control
Static maskingRe-identification via residual quasi-identifiers (ZIP + DOB + gender)Generalize or suppress quasi-identifiers; test against linkage attacks
Dynamic maskingReal data still present; bypass via direct DB access or privilege escalationLock down source access; treat as access control, not distribution
TokenizationVault compromise reverses all tokens at onceIsolate and harden the token vault; rotate; strict access logging
Format-preserving encryptionKey compromise reverses data; weak modes (original FF3) attackedUse FF1 or FF3-1 per NIST SP 800-38G; manage keys in an HSM/KMS
Synthetic dataGenerator memorizes and emits real source records (model leakage)Scan output for verbatim source values; prefer reserved/never-issued ranges
Primary residual risk and the control that addresses it, by technique.

The k-anonymity model formalizes the quasi-identifier risk: a release is k-anonymous when every record is indistinguishable from at least k-1 others on its quasi-identifiers, which is the standard yardstick for whether masked output resists linkage [k-anonymity]. NIST SP 800-188 walks through these de-identification trade-offs for real datasets and is the authoritative engineering reference [nist-800-188].

References & sources

  1. GDPR Recital 26 - Not Applicable to Anonymous Datagdpr-info.eu
  2. GDPR Article 4 - Definitions (incl. pseudonymisation)gdpr-info.eu
  3. NIST SP 800-188, De-Identifying Government Data SetsNIST
  4. PCI DSS v4.0.1 Requirements and Testing Procedures (Document Library)PCI Security Standards Council
  5. Luhn algorithm (mod 10 checksum)Wikipedia
  6. k-anonymity privacy modelWikipedia
  7. Faker documentationFaker
  8. Stripe Documentation - Test card numbersStripe

Frequently asked questions

Is data masking the same as anonymization under GDPR?+

Not automatically. Irreversible masking can reach GDPR anonymization, which puts data outside the Regulation's scope, but only if re-identification is no longer reasonably likely. Reversible techniques such as tokenization and format-preserving encryption are pseudonymization under GDPR Article 4(5) and remain personal data.

Which technique has the lowest re-identification risk?+

Fully synthetic data has the lowest risk because every record is generated and no original individual is represented. Tokenization and format-preserving encryption have the highest residual risk among these techniques because they are reversible by design and the mapping or key can be compromised.

Does tokenization preserve referential integrity?+

Yes, when tokenization is deterministic. The same input value always maps to the same token, so foreign keys and joins across tables stay consistent. Random per-row tokenization breaks joins unless you store and reuse the mapping for every occurrence of a value.

Can I use synthetic data for PCI DSS test environments?+

Yes. PCI DSS v4.0 Requirement 6.5.6 directs teams to remove test data and test accounts from system components before they go into production, and related guidance discourages live PANs in pre-production systems. Synthetic card numbers built from reserved test BINs and validated with the Luhn check let you exercise payment flows without bringing real cardholder data into scope.

What is the difference between static and dynamic data masking?+

Static data masking permanently rewrites values in a copied dataset before it lands in non-production, so the masked database never contains the originals. Dynamic data masking leaves the source unchanged and obscures values at query time based on the requesting user, so the real data still exists underneath.

Is format-preserving encryption approved by NIST?+

Yes, partially. NIST Special Publication 800-38G specifies the FF1 and FF3-1 modes for format-preserving encryption. FF3 was withdrawn after cryptographic weaknesses were found; FF3-1 is the revised, currently specified variant alongside FF1.

We use cookies for analytics and ads to keep this generator free. See our Privacy Policy.