Skip to main content

Proving training works, in dollars

This whitepaper describes methodology and patterns from PeopleAnalytics.AI's engagement work on matched-cohort ROI analysis for learning and development programs. Engagement details are anonymised by design; no specific client outcomes are claimed. Published figures (ATD, SHRM, Kirkpatrick Model literature) are cited where load-bearing. Numbers drawn from the demo environment are labelled as demo measurements against synthetic data.


Summary

L&D budgets get cut first because nobody can prove they work. Kirkpatrick Levels 1 and 2 (reaction, learning) are easy and don't address the question Finance is actually asking. Level 4 (business results) is the question, and most organisations don't attempt it because the methodology is harder than completion-rate tracking. This whitepaper describes a matched-cohort approach that gets from completion data to defensible program-level ROI, the trade-offs involved, and what the analysis can and can't claim.


The problem

The annual L&D budget review tends to run the same way. The head of L&D presents completion rates in the high eighties and satisfaction scores above four out of five. Finance asks what the business impact is. L&D points to a Kirkpatrick Level 1 number. Finance says that wasn't what she asked. The budget gets flat-funded — which, for a function whose costs grow with headcount, is effectively a cut.

L&D leaders often have strong intuitions about which of their programs work. They've watched graduates of a leadership accelerator advance faster than their peers. They've seen a manager-effectiveness program change how certain teams run meetings. They just can't prove any of it with the analytics their LMS vendors give them. Vendor-provided analytics typically stop at completion, assessment score, and self-reported behaviour change — Kirkpatrick Levels 1 and 2, the easy levels. Level 3 (behaviour in the work environment) is hand-waved. Level 4 (business results) is not attempted.

The uncomfortable truth: L&D has been a faith-based budget item at many organisations because measuring it properly is hard, and everyone involved has an incentive to keep the measurement vague.

Why this is hard

The right answer to "does this training program work" is a randomised controlled trial. You can't run an RCT on most corporate training programs — the organisation won't hold high-performing employees back from development opportunities, and even if it would, the treatment group would know they were the treatment group, which breaks the design.

The approximate answer is matched-cohort analysis. Pair trained employees with similar untrained employees on the observable variables — department, role, tenure, performance rating, manager — and compare their outcomes twelve or eighteen months later. Matched cohorts aren't causal, but they're a lot closer to causal than a completion-rate chart.

Two recurring traps in the matched-cohort approach.

Selection bias. The employees who get into an elite leadership accelerator are often the ones their managers already think will advance. Comparing them to a random untrained cohort overstates the program's impact. The matching has to be on propensity to be selected, not just demographics.

Attribution. If a leadership accelerator graduate stays with the company two years longer than her match, is that the program — or is that the fact that the company invested in her and she felt seen? Partly both, and the decomposition isn't recoverable from the data alone. The honest move is to say so in the output, not to claim precision you don't have.

The approach

The system is implemented in src/app/demos/ld-roi/ with five tabs — Portfolio, Kirkpatrick, Cohort, Pipeline, Business Impact — and a data-flow diagram rendered in DataPipeline.tsx. LMS data lives in BigQuery under the meridian_lms dataset (courses and enrollments). Employee, performance, compensation, and goal-completion data live in DynamoDB. The two are joined server-side via src/app/demos/ld-roi/data/bigquery-lms.ts with a 60-minute cache on the BigQuery side to control query cost.

The architecture is medallion: bronze (raw enrollment rows from the LMS), silver (deduped, enriched with employee attributes), gold (cohort-level aggregates and program ROI estimates). Clients can switch the UI between layers (?layer=bronze|silver|gold) because different audiences want different resolutions. Data engineers want bronze. L&D leads want gold.

Kirkpatrick levels are modelled explicitly. Levels 1 (reaction) and 2 (learning) come from LMS data. Level 3 (behaviour) is proxied by manager-effectiveness scores (peopleanalytics-manager-effectiveness) and observed changes in performance rating (peopleanalytics-performance-reviews). Level 4 (results) is the cohort analysis: retention delta, performance-rating delta, compensation-trajectory delta, and goal-completion-rate delta for trained versus matched-untrained populations, measured over a twelve-month window.

We don't train a model. The analysis is deterministic: propensity-score matching on department, role, tenure band, baseline performance, and manager effectiveness, followed by a comparison of outcomes over a defined window. The trade-off is worth stating. A gradient-boosted estimator would reduce variance and give tighter confidence intervals. It would also add complexity and obscure the arithmetic Finance wants to see. At the sample sizes most in-house L&D programs run — hundreds of participants per program — the gains from a model are outweighed by the loss of legibility. Legibility wins.

Contributive ROI (src/app/demos/ld-roi/lib/roiCalculations.ts) takes the cohort deltas and converts them to dollars using a configurable revenue-per-employee assumption, set by default to a benchmark the client can override. The word contributive is doing work: the calculation estimates the program's contribution to the observed outcome, not its causal effect. That distinction appears on every dollar figure in the UI, because if it doesn't, someone will quote the number without the caveat.

The formula is:

cROI = (Outcome $ × Contribution %) / Training Cost

Where Outcome $ is the dollar value of the business outcome (retention lift, productivity lift, promotion-readiness lift) measured on the treated cohort; Contribution % is the share of that outcome attributable to training, estimated via the matched-cohort comparison against the propensity-scored control group; and Training Cost is fully-loaded — program fee plus participant time at loaded comp rate plus facilitation. The UI surfaces a cROI band rather than a point estimate, using 0.4 and 0.7 as the sensitivity coefficients on Contribution %. Confidence intervals are bootstrapped (n=2,000) over the matched-cohort outcome delta; the full implementation ships in src/app/demos/ld-roi/analysis/ld_roi_matching.py and can be run independently to reproduce every number in the UI.

What the system produces

  • Program-level contributive-ROI estimates, with the Level-4 delta broken out by outcome (retention, performance, compensation, goal completion).
  • Matched-cohort detail views: which employees matched, on which features, with what propensity-score distributions.
  • Kirkpatrick-level reporting across all four levels, with Level 3 explicitly flagged as proxy-based rather than direct-observation.
  • Dollar figures that expose their assumptions (revenue per employee, outcome-to-dollar conversions) as editable inputs, so a CFO who disagrees can set her own and see the analysis update.

What the system does not produce:

  • Causal ROI. Matched-cohort analysis is quasi-experimental; it reduces selection bias but doesn't eliminate it.
  • Attribution across overlapping programs. An employee who took a leadership accelerator and a manager bootcamp gets counted in both, and the decomposition has to be handled with care.
  • Short-window results. The twelve-month window is baseline; shorter windows miss the effects the programs are designed to produce.

Patterns from engagement work

Invest in baseline measurement from day one of every program. Post-hoc cohort analysis is possible; pre-post with a baseline is much tighter, and the incremental cost at program enrolment is small. Engagements that skip baseline measurement have a worse Level-4 story and a harder time defending the budget when it's challenged.

Run matching on a broader feature set than you think you need. Picking features up front biases the result toward what the analyst assumed mattered. The defensible move is to run propensity matching on a broad set and let the data decide which features are doing work.

Do the crass thing: quantify in dollars. L&D leaders are often uncomfortable quantifying their own impact in dollars. It feels crass. Without the dollar figure, though, the budget conversation defaults to feelings, and feelings lose to spreadsheets. The whole point of contributive ROI is to enable the uncomfortable conversation.

Tiny cohorts are a measurement problem, not an analytics problem. Programs with fewer than about thirty participants per matched side don't produce stable estimates. The analysis flags this, and the right move is to combine cohorts across years, or to acknowledge that the program is below the measurement threshold and evaluate it some other way.

Cultural programs are real and invisible to this tool. A leadership program that changes how the C-suite operates may be extremely valuable and not captured by the outcome variables in the HRIS. The system should be configured to say so rather than to pretend the tool covers everything.

Level 3 is the weak link. Manager effectiveness is a proxy, and a noisy one. A future version of this pattern adds a structured manager-observation instrument — a light 360 — at baseline and twelve months. Without that, Level 3 is the least defensible of the four levels, and we say so in the report.

Where this applies

This pattern works for organisations with a real LMS holding at least two years of participation history, an HRIS capturing performance and compensation, and programs large enough to produce matched cohorts of at least thirty per side. It works especially well for leadership development, manager effectiveness, and technical-certification programs whose outcomes are observable in the HRIS over a twelve-month horizon.

It does not work for programs with tiny cohorts, for organisations without a unified HRIS, or for programs whose value is primarily cultural and therefore not captured in the outcome variables the analysis measures.

Methodology appendix: Python propensity-score matching

The cohort tab in the demo renders precomputed comparisons. The Python companion at src/app/demos/ld-roi/analysis/ld_roi_matching.py documents the statistical method behind those comparisons — the part Finance asks about when the dollar figures show up. It runs end-to-end on a deterministic synthetic Meridian-style cohort with a fixed seed, so the committed PNG and HTML artifacts under public/demos/ld-roi/ are reproducible by any reviewer with Python installed.

The pipeline is four steps:

  1. Generate an 1,800-employee cohort with a known per-trained-employee treatment effect of $2,200, intentionally confounded by the same covariates that drive enrolment so a naive comparison overstates the effect — that's the selection-bias story the matching is supposed to fix.
  2. Estimate propensities with statsmodels.api.Logit over department, tenure band, baseline performance, and manager effectiveness. The full regression summary (coefficients, z-stats, p-values, pseudo-R²) is printed at runtime so the model is auditable.
  3. Match 1:1 on the logit of the propensity with a 0.05 calliper, nearest-neighbour and without replacement. The implementation uses pandas and numpy directly so the matching loop is readable rather than hidden inside a third-party causal-inference library.
  4. Bootstrap the matched-pair outcome delta 2,000 times and report the 2.5th and 97.5th percentile range as a 95% confidence interval. The bootstrap distribution and CI lines are committed as roi_bootstrap_ci.png.

The matching is a non-parametric version of the design Finance is implicitly asking for when she asks "did this training cause the outcome?" — pair every trained employee with the most similar untrained employee on observables, look at the gap, and put a confidence interval on it. The committed matching_report.html documents the cohort, the propensity-model fit, the matched-pair count, the point delta, and the bootstrap CI; the same PNGs are surfaced in the L&D ROI demo's matched-cohort tab alongside a tool-stack badge that explicitly credits Python.

The known limitations are flagged deliberately in the script's README: 1:1 without replacement is the simplest defensible strategy; a richer propensity specification (interactions, splines, or an XGBoost estimator) would all be reasonable extensions; bias-corrected accelerated bootstrap intervals would be a reasonable upgrade over the percentile method used here.

Why Python for this slice rather than extending the TypeScript demo: statsmodels.api.Logit publishes a regression summary out of the box, pandas.pivot_table makes matched-pair deltas trivial, and numpy.quantile on a bootstrap array gives a defensible 95% CI in a few lines. Re-implementing the logit MLE in TypeScript to keep everything in one language is not a good use of time when this companion script is the one the data scientist actually reads.

The pinning style mirrors the existing Python service in src/app/demos/workforce-forecast/forecast-api/requirements.txt. Source: ld_roi_matching.py · README · requirements.txt.