Deterministic Test Data: Why Seeding Beats Random Generation
Deterministic seeded test data: how a seed yields the same dataset every run, why it kills flaky tests, and how to seed faker.js, Python Faker, and Bogus.
By FakeName Editorial TeamPublished June 25, 2026Last updated June 25, 20269 min read
A test that passes on your laptop and fails in CI is the most expensive kind of bug because nobody can reproduce it on demand. When the input is fresh random data on every run, the failing record is gone the moment the job ends. Deterministic test data removes that variable: seed the generator once and every run produces the same fictional dataset, so a red build becomes a bug you can replay, bisect, and fix. This guide explains seeding versus unseeded randomness, shows a worked example of a seed reproducing a dataset, and gives the exact seeding call for faker.js, Python Faker, and Bogus.
What is deterministic test data and how does seeding work?
Deterministic test data is fictional data generated from a fixed seed so the same code returns identical records on every run and every machine. Generators do not produce true randomness; they run a pseudorandom number generator (PRNG) that emits a fixed sequence from its starting state. Setting that state, the seed, pins the whole sequence and makes the output reproducible. [wiki-prng]
A PRNG is a deterministic function: given the same internal state, it returns the same next number, forever. Fake-data libraries layer field logic on top, using each drawn number to pick a first name, a street, or a digit. So when you fix the seed, you fix every downstream selection. The data still spans the full catalog of names and addresses; you have only chosen a known entry point into the stream instead of a clock- or entropy-derived one.
Seeded vs unseeded test data: which is better for CI?
Seeded data is better for automated testing because reproducibility is a hard requirement there: a failure you cannot rerun is a failure you cannot fix with confidence. Unseeded data has one niche, exploratory fuzzing, where you deliberately want new inputs each run. The table below compares the two across the properties that matter for a test suite and CI pipeline.
| Property | Seeded (deterministic) | Unseeded (non-deterministic) |
|---|---|---|
| Reproducibility | Same dataset every run on every machine | New dataset each run; failing case is lost |
| Debuggability | Replay the exact failing record locally | Cannot reliably reproduce the failure |
| Flakiness from data | Eliminated; data is constant across runs | A source of intermittent, hard-to-trace failures |
| Fixture commit | Seed (one integer) committed; data regenerated | Must commit full data dumps to share a case |
| Parallel/order safety | Stable with per-test seeding | Output depends on scheduling and timing |
| Best use | Unit, integration, snapshot, regression tests | Exploratory fuzzing for new edge cases |
The fixture-commit row is the quiet win. Instead of checking a multi-kilobyte JSON dump of generated users into version control, you commit a single integer seed and the generation code. Reviewers read the intent rather than a wall of data, and the fixture can never drift out of sync with the generator because it is computed, not stored.
Faker is a tool to generate fake data, but you may not always want it to be random. Sometimes you want to keep the same generated data between two runs. To achieve this, set the same seed value before you call any Faker generator.
How do flaky tests come from random data?
A flaky test is one that passes and fails without any code change, and unseeded data is a direct cause: a record that only sometimes appears, an empty string a generator picks one run in fifty, or an unusually long name that overflows a column. Because the input differs every run, the failure surfaces at random and resists reproduction. Google reported that almost 16% of its 4.2 million tests showed some level of flakiness over a one-month sample, making it a leading drag on developer productivity. [google-flaky]
With a fixed seed, the generator emits the same edge case on every run. If a seeded name of length 64 breaks a varchar(50) column, it breaks on every run until you fix it, and it stops breaking once you do. The test becomes a true signal of code state rather than a coin flip. Seeding does not find new edge cases on its own; it makes the ones you do encounter permanent and addressable. To explore new cases deliberately, run an unseeded or property-based pass separately so the discovery step never contaminates your deterministic suite.
When you still want randomness
Determinism and discovery are different jobs. Property-based and fuzz testing deliberately vary inputs to surface unknown bugs, but mature frameworks reconcile this with reproducibility: on failure they print the seed or a shrunk minimal case so you can replay it. The pattern is to randomize to find a bug, then pin the seed to fix it. Keep your standard regression suite seeded and confine free randomness to an explicit fuzzing job.
How does one seed produce the same dataset every time?
A seed produces the same dataset because it sets the PRNG's starting state, and every later draw is a pure function of that state. The first generated record consumes the first numbers in the sequence, the second record consumes the next, and so on. Re-seed with the same value and the sequence rewinds to the start, so record one is identical, record two is identical, and the whole batch matches. The worked example below shows the shape of that guarantee using faker.js with seed 42.
| Call order | Run A (seed 42) | Run B (seed 42) | Match? |
|---|---|---|---|
| faker.seed(42) | PRNG state set | PRNG state set | n/a |
| 1st person.firstName() | Vincenza | Vincenza | yes |
| 2nd person.firstName() | Cleta | Cleta | yes |
| 3rd person.firstName() | Era | Era | yes |
| Re-seed faker.seed(42), 1st call | Vincenza | Vincenza | yes |
Two properties matter here. First, order dependence: the seed fixes the sequence, so the values you get depend on how many times you call the generator and in what order. Insert one extra call near the top and every record after it shifts. Second, re-seeding rewinds: calling the seed function again resets the stream, which is why per-test seeding gives each test a clean, independent starting point regardless of what ran before it.
How do you seed faker.js, Python Faker, and Bogus?
Each major fake-data library exposes one call to fix its PRNG. Set the seed before you generate, and prefer per-instance seeding so tests stay isolated. The table maps the exact API for the three most common libraries across JavaScript, Python, and .NET.
| Library | Language | Global seed | Per-instance seed |
|---|---|---|---|
| faker.js (@faker-js/faker v8+) | JavaScript/TypeScript | faker.seed(123) | new Faker({ locale }).seed(123) |
| Faker | Python | Faker.seed(123) | fake = Faker(); fake.seed_instance(123) |
| Bogus | C# / .NET | Randomizer.Seed = new Random(123) | new Faker<T>().UseSeed(123) |
In faker.js, faker.seed(123) sets the seed for the shared instance and returns it; calling faker.seed() with no argument returns a fresh random seed you can log and reuse. [fakerjs-seed] In Python Faker, the class method Faker.seed(123) seeds the shared random instance, while fake.seed_instance(123) seeds a single Faker object without touching others, which keeps parallel tests independent. [pyfaker-seed] In Bogus, Randomizer.Seed = new Random(123) sets a global seed, and .UseSeed(123) on a Faker<T> overrides it locally for one generator. [bogus-readme]
A reproducible-fixture recipe
- Pick a stable integer seed per test or per fixture (for example, the issue number).
- Seed the generator at the start of the test, before any data is produced.
- Pin the generator library version in your lockfile so the seed maps to the same output over time.
- Commit the seed and the generation code, not a dump of the generated rows.
- On a failure, copy the seed into a focused repro test and debug the exact dataset.
How does our generator handle determinism?
Our engine is deterministic under the hood: identity records are assembled from a seedable pseudorandom stream, so a shared seed yields the same fictional dataset for every developer and every CI run. You get the reproducibility benefits above without wiring a seed through a library by hand. Generate a single seeded identity on the homepage at /, or produce a seeded batch of thousands for fixtures and load tests at /bulk. For end-to-end setup including schemas and CSV/JSON export, see the test-data generation guide.
All output is strictly fictional and meant for testing, QA, demos, and privacy-preserving development. Generated identities use reserved and documentation-style ranges so they cannot collide with a real person: names are synthetic, phone numbers fall in reserved fictional blocks such as the North American 555-01xx range set aside for fictional use, and example domains follow the RFC 2606 reserved names (example.com and the like). [rfc2606] Using fabricated data to impersonate, defraud, or bypass identity verification is outside the supported use and not permitted.
Each field type has an official reserved range that standards bodies set aside specifically so test and documentation values cannot collide with a real, in-service entity. A deterministic generator should draw from these ranges, which keeps a seeded dataset both reproducible and safe to share. The table lists the reserved blocks behind the most common identity fields.
| Field | Reserved range | Authority | Reference |
|---|---|---|---|
| Domain name | example.com, example.net, example.org (and .test, .example, .invalid) | IANA / IETF | RFC 2606 [rfc2606] |
| Phone number (North America) | 555-0100 through 555-0199 | NANPA | 555 Line Numbers [nanpa-555] |
| IPv4 address | 192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24 | IETF | RFC 5737 [rfc5737] |
| MAC address | 00-00-5E-00-53-00 through 00-00-5E-00-53-FF | IANA / IETF | RFC 7042 [rfc7042] |
Key takeaways for deterministic test data
- Seeding fixes the PRNG so the same seed yields the same fictional dataset on every run and machine.
- Seeded data makes failures reproducible and removes data randomness as a source of flaky CI.
- Commit the seed (one integer) plus generation code, not large committed data dumps.
- Use per-test or per-instance seeding for isolation and parallel safety; pin the library version too.
- Seed with faker.seed(123) (faker.js), Faker.seed(123) or fake.seed_instance(123) (Python), and .UseSeed(123) (Bogus).
- Keep generated identities fictional and reserved; use them only for testing, QA, and privacy.
References & sources
- Pseudorandom number generator — Wikipedia
- Faker.seed() API reference — Faker (faker.js)
- Randomizer and reproducible results — Faker (faker.js)
- Seeding the Generator — Python Faker documentation
- Bogus: UseSeed and Randomizer.Seed — Bogus (GitHub)
- Test Flakiness - One of the main challenges of automated testing — Google Testing Blog
- RFC 2606: Reserved Top Level DNS Names — IETF
- 555 Line Numbers — NANPA
- RFC 5737: IPv4 Address Blocks Reserved for Documentation — IETF
- RFC 7042: IANA Considerations and IETF Protocol and Documentation Usage for IEEE 802 Parameters — IETF
Frequently asked questions
What is deterministic test data?+
Deterministic test data is fictional test data produced from a fixed seed value so that the same generator code yields byte-for-byte identical records on every run and every machine. Determinism comes from seeding the pseudorandom number generator that drives field selection, so re-running a test reproduces the exact dataset that triggered a failure.
Does seeding make my data less random or less realistic?+
No. A seeded pseudorandom generator still draws from the full distribution of names, addresses, and other fields; it just starts from a known point in the sequence. The output looks exactly as varied as unseeded output. The only change is that the sequence is repeatable, which is what makes a failure reproducible.
How do I seed faker.js, Python Faker, and Bogus?+
In faker.js (Faker v8+) call faker.seed(123) before generating. In Python Faker call Faker.seed(123) for the shared instance or fake.seed_instance(123) for one instance. In Bogus (.NET) call .UseSeed(123) on a Faker<T> or set Randomizer.Seed globally. Each fixes the PRNG so the next calls return the same values.
Will the same seed give the same data across library versions?+
Not guaranteed. A seed pins the random sequence, but the mapping from sequence to output depends on the locale data and generation algorithm, which can change between major versions. faker.js documents that seeded results are stable within a major version, so pin your generator version alongside the seed for full reproducibility.
Should I seed globally or per test?+
Prefer per-test or per-instance seeding so tests stay isolated and can run in any order or in parallel without sharing PRNG state. A global seed set once in a shared setup is fine for small suites, but it couples tests together: one test consuming extra random values shifts every later test's data.
Is generating fake data for testing legal and ethical?+
Yes, when the data is fictional and used for testing, QA, demos, or privacy-preserving development. Reputable generators draw from reserved, never-issued, and documentation ranges so records cannot match a real person, card, or phone line. Using fabricated identities to impersonate, defraud, or evade verification is out of scope and prohibited.