Proving training works, in dollars

This whitepaper describes methodology and patterns from PeopleAnalytics.AI's engagement work on matched-cohort ROI analysis for learning and development programs. Engagement details are anonymised by design; no specific client outcomes are claimed. Published figures (ATD, SHRM, Kirkpatrick Model literature) are cited where load-bearing. Numbers drawn from the demo environment are labelled as demo measurements against synthetic data.


Summary

L&D budgets get cut first because nobody can prove they work. Kirkpatrick Levels 1 and 2 (reaction, learning) are easy and don't address the question Finance is actually asking. Level 4 (business results) is the question, and most organisations don't attempt it because the methodology is harder than completion-rate tracking. This whitepaper describes a matched-cohort approach that gets from completion data to defensible program-level ROI, the trade-offs involved, and what the analysis can and can't claim.


The problem

The annual L&D budget review tends to run the same way. The head of L&D presents completion rates in the high eighties and satisfaction scores above four out of five. Finance asks what the business impact is. L&D points to a Kirkpatrick Level 1 number. Finance says that wasn't what she asked. The budget gets flat-funded — which, for a function whose costs grow with headcount, is effectively a cut.

L&D leaders often have strong intuitions about which of their programs work. They've watched graduates of a leadership accelerator advance faster than their peers. They've seen a manager-effectiveness program change how certain teams run meetings. They just can't prove any of it with the analytics their LMS vendors give them. Vendor-provided analytics typically stop at completion, assessment score, and self-reported behaviour change — Kirkpatrick Levels 1 and 2, the easy levels. Level 3 (behaviour in the work environment) is hand-waved. Level 4 (business results) is not attempted.

The uncomfortable truth: L&D has been a faith-based budget item at many organisations because measuring it properly is hard, and everyone involved has an incentive to keep the measurement vague.

Why this is hard

The right answer to "does this training program work" is a randomised controlled trial. You can't run an RCT on most corporate training programs — the organisation won't hold high-performing employees back from development opportunities, and even if it would, the treatment group would know they were the treatment group, which breaks the design.

The approximate answer is matched-cohort analysis. Pair trained employees with similar untrained employees on the observable variables — department, role, tenure, performance rating, manager — and compare their outcomes twelve or eighteen months later. Matched cohorts aren't causal, but they're a lot closer to causal than a completion-rate chart.

Two recurring traps in the matched-cohort approach.

Selection bias. The employees who get into an elite leadership accelerator are often the ones their managers already think will advance. Comparing them to a random untrained cohort overstates the program's impact. The matching has to be on propensity to be selected, not just demographics.

Attribution. If a leadership accelerator graduate stays with the company two years longer than her match, is that the program — or is that the fact that the company invested in her and she felt seen? Partly both, and the decomposition isn't recoverable from the data alone. The honest move is to say so in the output, not to claim precision you don't have.

The approach

The system is implemented in src/app/demos/ld-roi/ with five tabs — Portfolio, Kirkpatrick, Cohort, Pipeline, Business Impact — and a data-flow diagram rendered in DataPipeline.tsx. LMS data lives in BigQuery under the meridian_lms dataset (courses and enrollments). Employee, performance, compensation, and goal-completion data live in DynamoDB. The two are joined server-side via src/app/demos/ld-roi/data/bigquery-lms.ts with a 60-minute cache on the BigQuery side to control query cost.

The architecture is medallion: bronze (raw enrollment rows from the LMS), silver (deduped, enriched with employee attributes), gold (cohort-level aggregates and program ROI estimates). Clients can switch the UI between layers (?layer=bronze|silver|gold) because different audiences want different resolutions. Data engineers want bronze. L&D leads want gold.

Kirkpatrick levels are modelled explicitly. Levels 1 (reaction) and 2 (learning) come from LMS data. Level 3 (behaviour) is proxied by manager-effectiveness scores (peopleanalytics-manager-effectiveness) and observed changes in performance rating (peopleanalytics-performance-reviews). Level 4 (results) is the cohort analysis: retention delta, performance-rating delta, compensation-trajectory delta, and goal-completion-rate delta for trained versus matched-untrained populations, measured over a twelve-month window.

We don't train a model. The analysis is deterministic: propensity-score matching on department, role, tenure band, baseline performance, and manager effectiveness, followed by a comparison of outcomes over a defined window. The trade-off is worth stating. A gradient-boosted estimator would reduce variance and give tighter confidence intervals. It would also add complexity and obscure the arithmetic Finance wants to see. At the sample sizes most in-house L&D programs run — hundreds of participants per program — the gains from a model are outweighed by the loss of legibility. Legibility wins.

Contributive ROI (src/app/demos/ld-roi/lib/roiCalculations.ts) takes the cohort deltas and converts them to dollars using a configurable revenue-per-employee assumption, set by default to a benchmark the client can override. The word contributive is doing work: the calculation estimates the program's contribution to the observed outcome, not its causal effect. That distinction appears on every dollar figure in the UI, because if it doesn't, someone will quote the number without the caveat.

What the system produces

  • Program-level contributive-ROI estimates, with the Level-4 delta broken out by outcome (retention, performance, compensation, goal completion).
  • Matched-cohort detail views: which employees matched, on which features, with what propensity-score distributions.
  • Kirkpatrick-level reporting across all four levels, with Level 3 explicitly flagged as proxy-based rather than direct-observation.
  • Dollar figures that expose their assumptions (revenue per employee, outcome-to-dollar conversions) as editable inputs, so a CFO who disagrees can set her own and see the analysis update.

What the system does not produce:

  • Causal ROI. Matched-cohort analysis is quasi-experimental; it reduces selection bias but doesn't eliminate it.
  • Attribution across overlapping programs. An employee who took a leadership accelerator and a manager bootcamp gets counted in both, and the decomposition has to be handled with care.
  • Short-window results. The twelve-month window is baseline; shorter windows miss the effects the programs are designed to produce.

Patterns from engagement work

Invest in baseline measurement from day one of every program. Post-hoc cohort analysis is possible; pre-post with a baseline is much tighter, and the incremental cost at program enrolment is small. Engagements that skip baseline measurement have a worse Level-4 story and a harder time defending the budget when it's challenged.

Run matching on a broader feature set than you think you need. Picking features up front biases the result toward what the analyst assumed mattered. The defensible move is to run propensity matching on a broad set and let the data decide which features are doing work.

Do the crass thing: quantify in dollars. L&D leaders are often uncomfortable quantifying their own impact in dollars. It feels crass. Without the dollar figure, though, the budget conversation defaults to feelings, and feelings lose to spreadsheets. The whole point of contributive ROI is to enable the uncomfortable conversation.

Tiny cohorts are a measurement problem, not an analytics problem. Programs with fewer than about thirty participants per matched side don't produce stable estimates. The analysis flags this, and the right move is to combine cohorts across years, or to acknowledge that the program is below the measurement threshold and evaluate it some other way.

Cultural programs are real and invisible to this tool. A leadership program that changes how the C-suite operates may be extremely valuable and not captured by the outcome variables in the HRIS. The system should be configured to say so rather than to pretend the tool covers everything.

Level 3 is the weak link. Manager effectiveness is a proxy, and a noisy one. A future version of this pattern adds a structured manager-observation instrument — a light 360 — at baseline and twelve months. Without that, Level 3 is the least defensible of the four levels, and we say so in the report.

Where this applies

This pattern works for organisations with a real LMS holding at least two years of participation history, an HRIS capturing performance and compensation, and programs large enough to produce matched cohorts of at least thirty per side. It works especially well for leadership development, manager effectiveness, and technical-certification programs whose outcomes are observable in the HRIS over a twelve-month horizon.

It does not work for programs with tiny cohorts, for organisations without a unified HRIS, or for programs whose value is primarily cultural and therefore not captured in the outcome variables the analysis measures.