When 125 Million Item-Locations Lied To Us: What Retail Forecasting Taught Me About Data Truth

There is a number that still makes me uncomfortable when I say it out loud.

One hundred and twenty-five million.

That is roughly how many item-location combinations a large national retailer must track across physical stores, fulfillment centers, and digital channels. Each pair has its own rhythm, demand curve, lead-time quirks, and personal definition of too much or not nearly enough.

For years, we pretended this entire universe could be managed with legacy forecasting tools, mainframe workflows, and an unspoken dependency on buyers patching gaps manually. The surface looked stable, the forecasts looked reasonable, the dashboards loaded and the pipelines ran perfectly fine.

Until the numbers stopped agreeing with each other.

Buyers would call with contradictions: “The system says I need to buy more of this item, but it has barely moved.” Or the opposite: “It says, I am overstocked, what do I tell the customers?” The models were not malfunctioning; rather, the data was and once we finally examined things closely, the data felt like it was staring back and saying, “You have no idea what I actually mean.”

This is the story of how we rebuilt truth at scale.

The Illusion of Accuracy

When I stepped into a major forecasting and replenishment modernisation program as a Lead Data Engineer, the setup looked fine on paper; a commercial forecasting suite handled predictions, while a legacy platform stitched together historical demand, promotions, returns, and inventory positions, mainframe feeds delivered nightly snapshots that had worked for years, and buyers filled the rest with experience.

But the accuracy numbers told a different story. Forecast accuracy hovered around 68%, with predictable distortions depending on category. Inventory counts drifted across systems, reconciliation cycles stretched into five-day loops of correction and backtracking. Buyers routinely spent 4–6 hours a week massaging inventory positions that simply did not match what they saw on the floor.

The easy conclusion would have been: We just need a better model. The honest conclusion was simpler and far more painful: we did not have one definition of reality. Ask three systems for the on-hand quantity of a single SKU in a single store and you could easily receive three different answers. Depending on which one you trusted, the forecast could look either perfectly acceptable or catastrophically wrong.

Tuning a model on top of conflicting truths is just a faster way of amplifying contradictions.

The Moment Everything Broke Open

The turning point was not a dramatic incident or a leadership escalation. It was a quiet inventory audit.

We picked one SKU in one location and traced a full year of its activity. The item master claimed continuous activity for fifty-two weeks. The store ledger disagreed on multiple intervals. Transfers appeared in one system and evaporated in another. Digital demand was recorded in one place but omitted elsewhere. On-hand counts varied depending on whether you trusted the store system, the warehouse system, or the nightly aggregated snapshot.

At high abstraction, the numbers looked “reasonable.” At the item-location level, the story fragmented instantly. That was the moment the smallest grain stopped being a technical detail and became the foundation. If we could not reconstruct what happened to one item at one location on one date, then everything above that grain, forecasts, replenishment rules, safety stock, buy plans, was just a layered guess.

Rebuilding the Spine

We made a decision that felt simple when stated, but brutal in practice:

The system only gets to speak about what it can express truthfully at item-location granularity with full lineage.

Everything else would derive from that.

Rebuilding the spine meant starting fresh on cloud infrastructure using analytical warehousing, streaming ingestion, and orchestrated transformation pipelines. BigQuery became the storage layer for unified item-location facts. Dataflow handled large-scale correction, transformation, and replay of historical feeds. Pub/Sub carried sales, returns, transfers, and shipment events as near real time signals. Composer enforced dependency order and stitched the entire cycle together.

The goal was not to forecast. The goal was to narrate.

We created a master dataset that could retell the full story of every item-location pair: sales, returns, price changes, promotions, vendor behaviour, store attributes, and inventory flows. When two upstream systems disagreed on a quantity, the pipeline had to decide which source would win, or surface the contradiction explicitly instead of burying it. Late-arriving data became a first-class design concern. Historical replays were built into the architecture, not bolted on.

What surprised us was how normal late-arriving corrections were. The original system treated first-seen data as final. Retail does not work that way. Inventory adjusts. Shipments slip. Returns arrive out of sequence. A pipeline that assumes the first version of truth is permanent quietly trains the forecast to trust misinformation.

The First Time the System Said Not Today

One of the earliest tests of the new platform was not a success case. It was a disagreement.

A buyer flagged what they believed was a bad recommendation: an unexpectedly high buy quantity for a cluster of regional stores. Previously, they would have overridden the number and moved on. This time, the platform could show its work.

We walked through the item’s timeline: fifty-two weeks of demand patterns, a few outlier spikes tied to promotions, revised lead times after a vendor delay, and safety stock adjustments triggered by repeated stockouts. The system also surfaced exact on-hand, in-transit, and on-order quantities from reconciled sources.

For years, that context had lived across a jumble of spreadsheets and internal tools. Now it lived in one narrative thread.

The buyer did not immediately agree, but the nature of the disagreement changed. Instead of fighting the output, we were now debating the replenishment rule behind it and the disagreements shifted from “your number is wrong” to “should this logic exist?” That was new, well, that was in fact progress.

Measuring Impact Without Hiding Behind Models

Once the spine stabilised, the forecasting layer became almost boring. We did not need exotic algorithms. We needed reliable inputs.

Over extended operational cycles, the improvements became insanely measurable:

Forecast accuracy rose from 68% to roughly 85%, bringing us into parity with peer retailers operating at enterprise scale. Inventory-related savings reached approximately $6M annually, with projected benefits of $30M over a five-year horizon. Manual reconciliation shrank from five days to two, eliminating nearly 60% of the cycle time. Buyers reclaimed 2–4 hours per week previously spent on correcting mismatched inventory positions. And in some flows, reconciliation across 16 million products and more than 600 locations compressed from multi-day loops to roughly a single day.

None of these gains came from a new model. They came from eliminating contradictory histories.

What We Underestimated

The hardest part was not the pipeline. It was trust.

When we surfaced discrepancies, the initial reaction was predictable: Your new system is broken. From their perspective, their spreadsheets had worked. What they could not see were the silent costs, markdowns, lost sales, extra freight, and weeks spent stitching together mismatched numbers.

If I could redo one thing, I would introduce comparative truth views earlier. When we finally built dashboards showing the old view, the new view, where they differed, and why one source outranked another, conversations became collaborative instead of adversarial. It stopped feeling like a verdict and started feeling like debugging.

That should have been part of Month One, not Year Two.

What This Changed About How I See Data Engineering

This project rewired how I think about data platforms, especially those feeding AI systems in retail.

Models are never the main event, truth is. If item-location consistency is broken, the model becomes a loudspeaker for errors. Granularity is a design philosophy, not an implementation detail. Late-arriving facts are pretty much normal in retail; architects must expect them and trust is not an output metric, It is in fact a prerequisite for the system to matter.

Data engineering is less about pipelines and more about truth maintenance. As AI layers enter forecasting and replenishment workflows, that responsibility becomes heavier, not lighter. You can attach a model to almost any dataset. You cannot attach trust to a system that contradicts itself.

To Put It Together

Every dataset ultimately flows to a human who must make a decision. If the underlying truth is fractured, that decision becomes guesswork dressed up as analytics.

For us, modernising forecasting was not about predicting the future better. It was about stopping 125 million item-locations from telling different stories. The cloud mattered. The models mattered. But the discipline that changed everything was older and simpler:

Decide what is true, enforce it everywhere, then build everything else on top of it.