Building an in-house data platform that runs the reporting cycle by itself

Most reporting doesn't fail because the analysis is hard. It fails because the plumbing is manual. Someone downloads an export, reconciles it against last month's spreadsheet, refreshes a dashboard, attaches a workbook, and emails the same five people the same report they got yesterday. Multiply that by dozens of reports and you have a team that spends its mornings assembling numbers instead of acting on them.

So I built the thing that does the plumbing. This is a write-up of an automated data analytics platform I designed, built and operate end to end as Data Analytics Manager at an FMCG distributor — what it does, how it's put together, and the few decisions that made it reliable enough to trust without watching it.

The problem, stated plainly

The business produced data in every shape a real company produces: manual Excel uploads, vendor CSV drops, REST APIs, and operational databases. Reporting meant a human stitching those together on a schedule. That's slow, it doesn't scale, and — the part people underrate — it's fragile in a way nobody notices until it breaks. A late file or a renamed column quietly corrupts a number, and the first person to spot it is a stakeholder, not the team.

The goal wasn't a prettier dashboard. It was to take the entire loop — collect, validate, model, deliver — and make it run on its own, on a schedule, and tell us the moment something looked wrong.

The shape of the solution

The platform is an orchestrated pipeline with three stages. Every stage is scheduled, retried on failure, and observable from one console — the same operating model production tools like Apache Airflow use to run workflows reliably.

Ingest — pull from files, APIs, line-of-business systems and operational databases through one ingestion layer.
Consolidate — validate and clean, then stage and load into a governed warehouse as modelled fact and dimension tables.
Deliver — generate reports from those governed datasets and ship them to wherever each audience lives.

Ingestion and the warehouse

The hardest part of ingestion is that real sources are heterogeneous and badly behaved. The answer was a single ingestion layer that treats files, APIs and databases uniformly, so a new source plugs in through configuration rather than a bespoke script each time.

Before anything touches the warehouse it passes data-quality gates — type, range and completeness checks. Rows that fail are quarantined, not silently dropped, which matters: a dropped row is an invisible error; a quarantined one is a visible decision. Clean data lands in versioned warehouse tables that become the single, auditable source of truth for every downstream report. Shape the fact and dimension tables once, reuse them everywhere.

A dropped row is an invisible error. A quarantined row is a visible decision. Design for the second one.

Delivery: meet people where they already are

A report nobody opens isn't a report. So rather than force one format, the platform generates from one set of governed datasets and delivers through whatever channel the audience prefers. A single run can email a summary, attach an Excel and a PDF, and refresh a Power BI dataset — no analyst assembling packs by hand.

Email — scheduled summaries that land before the working day starts.
Power BI — refreshed datasets for the people who live in dashboards.
Excel — workbooks for the people who, reasonably, still want to pivot it themselves.
PDF — fixed documents for records and distribution.

Orchestration and monitoring — the part that earns trust

Automation you can't see is automation you can't trust. Everything runs on a schedule and reports its own health to a built-in web console that shows every pipeline: when it last ran, how long it took, and whether it succeeded. The whole operation is observable at a glance, the way an orchestrator's run grid surfaces a failure the moment it happens.

Two design choices did most of the work here:

Idempotent, retryable stages. A run can fail halfway, retry, and not double-count or corrupt anything. Failure becomes routine and recoverable instead of a fire drill.
Graceful degradation. If a source is late — say an inventory API doesn't respond — the run serves the last good load and flags it, so reports still ship while the team is alerted. Stale-but-flagged beats missing.

The mindset shift that made this reliable: treat reporting as a product, not a task. Products have SLAs, health checks, and alerting. Tasks have a person who remembers to do them. Only one of those scales.

What it runs at

In production the platform ingests around 100 data sources every day and executes roughly 500 pipeline runs per day, feeding tens of reports across four delivery channels, monitored 24/7. Because every run is validated, scheduled and observed, reporting got both faster and more trustworthy at the same time — which usually pull in opposite directions.

If you're attempting the same thing

Three things I'd tell my earlier self:

Validate at the door. The cheapest place to catch bad data is before it enters the warehouse. Quarantine, don't drop.
Make it observable on day one. A run grid that shows red the instant something breaks is worth more than a marginally cleverer transformation.
Decouple delivery from computation. Compute the dataset once; let email, Excel, PDF and BI all read from it. New channels become config, not code.

The payoff isn't really the time saved, though there's plenty of it. It's that the team's attention moves up the stack — from assembling reports to interpreting them. That's the whole point of analytics, and the plumbing is what frees you to do it.