Synthetic data generation
A configurable generator for realistic, referentially-consistent ERP/retail datasets — millions of rows with seasonality, customer cohorts, promotions, and returns — so you can demo, test, and prototype without touching real data.
My role: Sole author — designed the generator and its verification suite.
Executive summary
Good demos, tests, and prototypes need realistic data — but real customer and sales data usually can't be shared, and public datasets are either tiny or behaviourally flat. This generator fills the gap: it produces large, believable ERP/retail datasets that look and behave like the real thing, with none of the privacy or legal baggage.
It's built for analytics and BI teams, data scientists, and ERP consultants who need production-like data on demand — to build churn models, cohort analyses, dashboards, or sandbox environments.
Technical implementation
The generator emits six relational CSV tables — dimensions (items, customers, stores, promotions) and facts (invoice headers, sales lines) — on an AdventureWorks-style schema with full money decomposition and line-item margins.
- Behavioural realismSeasonality (Black Friday, Ramadan, Eid → 2–3× spikes), six sticky customer cohorts, inflation-driven price drift, demographics, promotions, and ~3% linked returns.
- Referential integrity & reconciliationForeign keys validated, returns linked to their originals, and every invoice's totals reconciled to the cent.
- Fully reproducibleA single --seed drives all randomness, so the same seed produces byte-identical CSVs every run.
- Multi-marketUS / GCC / EU locales swap currency, VAT, payment methods, weekends, and holiday calendars.
- Built for scaleStreaming CSV writes keep memory flat — 1,000 customers across 11 years (100k+ transactions) in about 70 seconds.

Using it
It runs as a CLI: pick a seed, market, customer count, and date range, and it streams out the CSVs. A verification script validates integrity and a chart script renders showcase visuals; a pre-built sample dataset is included so you can explore the output immediately — no generation required.
Tech stack
Outcome
The result is production-like retail data on tap: statistically rich across seasonality, cohorts, demographics, promotions, and returns, yet fully synthetic and reproducible.
Its behaviour model is grounded in established academic frameworks (buy-till-you-die models, discrete-choice theory), and its verification checklist mirrors published synthetic-data evaluation standards — so the realism is principled, not guesswork.