Proving the AI systems touching employee data are actually sound
This whitepaper describes methodology and patterns from PeopleAnalytics.AI's engagement work on AI-governance consoles for regulated environments. Engagement details are anonymised by design; no specific client outcomes are claimed. Compliance-score and cost figures in this document are measurements from THIS site's demo against its own synthetic operating state — they are labelled as such and are not client results.
Summary
Deploying LLM-backed HR systems into a regulated enterprise introduces a governance problem classical infrastructure doesn't have. Model behaviour is not deterministic. Guardrails can fail in ways that classical access controls cannot. Vendor dashboards report on the vendor's own model, not on how it behaves under the organisation's retrieval layer, policy library, and employee data. This whitepaper describes the methodology for a single console covering platform health, data quality, HIPAA/SOX/PII/GDPR posture, audit-event logs, LLM operations, and a live model-vs-model benchmark.
The problem
The CHRO wants to deploy an attrition model, a policy assistant, and an executive rollup. The CIO has a different set of questions: how do we know the model isn't leaking PHI into logs? If Legal subpoenas us for a specific HR decision made six months ago, can we reconstruct the conversation that produced it? What happens when the vendor changes the model version? What do we pay per query, and is that budget line predictable?
HR's people analytics team often has good answers for each question individually. Nobody has put them in one place. When the CIO asks for the health of the whole stack on one screen, the answer tends to be a combination of three dashboards, one spreadsheet, and an email thread with the vendor. That's the gap this pattern fills.
The underlying problem is specific to AI. You can run classical infrastructure for decades without a compliance dashboard and be fine, because the behaviour is deterministic and the risks are well understood. LLM-backed systems are not deterministic and the risks are not all well understood. A model that hallucinates a policy citation, a guardrail that lets a PHI request through, a prompt-injection attack via an employee field — these are risks that traditional IT governance was not designed to monitor.
Why this is hard
Vendors report on vendors. Anthropic reports Anthropic. OpenAI reports OpenAI. Neither reports on how your system — your retrieval layer, your policy library, your employee data — behaves under their model. A governance dashboard that isn't built on your own telemetry is theatre.
Multi-framework scoping. A real compliance console has to cover multiple frameworks: HIPAA for health data, SOX for financial controls, PII for privacy, GDPR for EU operations. Most dashboards pick one. Framework-by-framework reporting is what auditors actually review; if the system doesn't support it, it gets built shadow in spreadsheets and the whole integrity argument collapses.
Fair benchmarking. CIOs increasingly ask whether they picked the right model. That is a legitimate question, and answering it requires running the same queries through multiple models on the same guardrails with the same retrieval — not cherry-picked examples from a vendor deck. Fair benchmarks are work, and most organisations don't do the work.
The approach
The console is implemented in src/app/demos/azure-planning/ with six tabs: Platform Health, Data Health, Compliance, Audit Trail, LLM Operations, and Architecture & Cost. (The directory name is a legacy artefact from an earlier cloud-planning prototype; the demo ships as "Enterprise Audit & Compliance.") The architecture and cost breakdown is rendered in ArchitectureCostTab.tsx.
Platform health (PlatformHealthTab.tsx) monitors the DynamoDB tables, BigQuery datasets, Supabase instance, and Lambda jobs that each demo depends on, with status, uptime, and alarm surfacing. Data health (DataHealthTab.tsx) reports scan completeness and data-quality scores across the underlying tables. These two together are the boring-but-essential base of the pyramid.
The compliance tab (ComplianceTab.tsx) scores four frameworks against defensible criteria:
- HIPAA — data classification (PII vs PHI), encryption at rest, encryption in transit, access logging, incident response, and regulatory retention.
- SOX — change control, access control with RLS, append-only audit trail, segregation of duties, incident response.
- PII handling — data minimisation, auto-purge where applicable, encryption, IP anonymisation via SHA-256 hashing, GDPR right-to-erasure support.
- GDPR — DPA in place, consent management, data-subject rights, breach notification, DPO assignment.
The demo computes a score for each framework against its synthetic operating state. Those scores are demo measurements, not client results — they reflect the design of THIS site's infrastructure, not a client environment, and the UI labels them accordingly. A real client deployment would run the same scoring methodology against the client's own environment.
Audit trail (AuditTrailTab.tsx) is written against five DynamoDB tables: peopleanalytics-bedrock-invocations (every Claude call with input/output tokens and cost), peopleanalytics-guardrail-log (guardrail evaluations and block reasons), peopleanalytics-audit-log (API gateway access), peopleanalytics-incident-log (compliance threshold breaches), and peopleanalytics-demo2-audit (Harry – HR Copilot Q&A cross-reference). Retention is configurable; the demo sets it to seven years for compliance events, aligning with SOX.
LLM operations (LLMOperationsTab.tsx) runs a side-by-side benchmark of Claude Haiku 4.5 against a comparison model on the same retrieval context, the same guardrails, the same queries. The benchmark on this site runs against the demo's synthetic corpus; benchmark numbers are labelled as demo measurements and will differ on a real client corpus. The design point is that the benchmark runs continuously rather than being a one-time model selection — model performance on your retrieval corpus is the only measurement that matters, and it drifts.
Architecture & cost breaks spend by service and by demo. Current monthly totals for THIS site's demo environment are published in the UI; those are real measurements of this site's operating cost, not proposed figures for a client. Publishing them is the point; cost transparency is itself a compliance artefact.
What the system produces
- A single screen showing the health of every LLM-backed demo on the site, across platform, data, compliance, audit, LLM ops, and cost.
- Framework-by-framework compliance scoring (HIPAA/SOX/PII/GDPR) that auditors can inspect.
- An append-only audit trail of every Claude invocation, every guardrail evaluation, every incident.
- A continuous benchmark of the deployed model against at least one comparison model on the same workload.
- Cost attribution by service and by demo, with the allocation method disclosed.
What the system does not produce:
- A SOC. The console is a telemetry and reporting layer, not a security function. Organisations that confuse the two will be disappointed by both.
- A replacement for framework-specific audit expertise. The scoring criteria are defensible defaults; reconciling them with a client's Legal team's interpretation is week-one work.
- A guarantee against novel failure modes. LLM risks are still being characterised; the console is instrumented to surface what we know how to monitor.
Patterns from engagement work
Build the audit-trail tables before the dashboards. Retrofitting audit schema is expensive. Every tab above Audit Trail depends on the trail existing; if it doesn't, the tabs show approximations, and approximations aren't audit-grade.
Align with the client's compliance team on framework scoring in week one. Sensible defaults exist; the client's Legal team usually has their own interpretation of SOX scoring, and the reconciliation takes longer than the build. Putting that conversation at the front avoids rework.
Instrument the benchmark console with real invocations from day one, even at a low sampling rate. Mock benchmark data is fine for a demo. It is not fine for a CIO making a model-selection decision. Sampled real invocations through both models on live traffic is the defensible version.
Cost attribution is a hybrid exercise. DynamoDB and BigQuery spend is hard to cleanly allocate across demos because tables are shared. Direct attribution where possible, proportional allocation where not, and disclose the method in the UI. An allocation without the disclosure is indefensible in audit.
Where this applies
This pattern works for organisations deploying at least two LLM-backed HR systems into regulated environments — financial services, healthcare, public sector, critical infrastructure — where audit and governance are actual constraints, not checkbox exercises. It works for organisations already operating under SOX or HIPAA; the console meets them where they are.
It does not work for organisations running a single LLM application with no compliance obligation; the overhead is disproportionate to the risk. It also does not work as a replacement for a security function.