← All projects
Open sourceTooling

Synthetic data generation

A configurable generator for realistic, referentially-consistent ERP/retail datasets — millions of rows with seasonality, customer cohorts, promotions, and returns — so you can demo, test, and prototype without touching real data.

PythonpandasnumpyFaker

My role: Sole author — designed the generator and its verification suite.

Executive summary

Good demos, tests, and prototypes need realistic data — but real customer and sales data usually can't be shared, and public datasets are either tiny or behaviourally flat. This generator fills the gap: it produces large, believable ERP/retail datasets that look and behave like the real thing, with none of the privacy or legal baggage.

It's built for analytics and BI teams, data scientists, and ERP consultants who need production-like data on demand — to build churn models, cohort analyses, dashboards, or sandbox environments.

Technical implementation

The generator emits six relational CSV tables — dimensions (items, customers, stores, promotions) and facts (invoice headers, sales lines) — on an AdventureWorks-style schema with full money decomposition and line-item margins.

  • Behavioural realismSeasonality (Black Friday, Ramadan, Eid → 2–3× spikes), six sticky customer cohorts, inflation-driven price drift, demographics, promotions, and ~3% linked returns.
  • Referential integrity & reconciliationForeign keys validated, returns linked to their originals, and every invoice's totals reconciled to the cent.
  • Fully reproducibleA single --seed drives all randomness, so the same seed produces byte-identical CSVs every run.
  • Multi-marketUS / GCC / EU locales swap currency, VAT, payment methods, weekends, and holiday calendars.
  • Built for scaleStreaming CSV writes keep memory flat — 1,000 customers across 11 years (100k+ transactions) in about 70 seconds.
Line chart of synthetic monthly revenue across 11 years showing seasonal spikes
Generated output: 11 years of monthly revenue (seed 42, 1,000 customers) — note the seasonal spikes.

Using it

It runs as a CLI: pick a seed, market, customer count, and date range, and it streams out the CSVs. A verification script validates integrity and a chart script renders showcase visuals; a pre-built sample dataset is included so you can explore the output immediately — no generation required.

Tech stack

Python 3.10+pandasnumpyFakerCSVCLI

Outcome

The result is production-like retail data on tap: statistically rich across seasonality, cohorts, demographics, promotions, and returns, yet fully synthetic and reproducible.

Its behaviour model is grounded in established academic frameworks (buy-till-you-die models, discrete-choice theory), and its verification checklist mirrors published synthetic-data evaluation standards — so the realism is principled, not guesswork.