Skip to main content

Proving the AI systems touching employee data are actually sound

This whitepaper describes methodology and patterns from PeopleAnalytics.AI's engagement work on AI-governance consoles for regulated environments. Engagement details are anonymised by design; no specific client outcomes are claimed. Compliance-score and cost figures in this document are measurements from THIS site's demo against its own synthetic operating state — they are labelled as such and are not client results.


Summary

Deploying LLM-backed HR systems into a regulated enterprise introduces a governance problem classical infrastructure doesn't have. Model behaviour is not deterministic. Guardrails can fail in ways that classical access controls cannot. Vendor dashboards report on the vendor's own model, not on how it behaves under the organisation's retrieval layer, policy library, and employee data. This whitepaper describes the methodology for a single console covering platform health, data quality, HIPAA/SOX/PII/GDPR posture, audit-event logs, LLM operations, and a live model-vs-model benchmark.


The problem

The CHRO wants to deploy an attrition model, a policy assistant, and an executive rollup. The CIO has a different set of questions: how do we know the model isn't leaking PHI into logs? If Legal subpoenas us for a specific HR decision made six months ago, can we reconstruct the conversation that produced it? What happens when the vendor changes the model version? What do we pay per query, and is that budget line predictable?

HR's people analytics team often has good answers for each question individually. Nobody has put them in one place. When the CIO asks for the health of the whole stack on one screen, the answer tends to be a combination of three dashboards, one spreadsheet, and an email thread with the vendor. That's the gap this pattern fills.

The underlying problem is specific to AI. You can run classical infrastructure for decades without a compliance dashboard and be fine, because the behaviour is deterministic and the risks are well understood. LLM-backed systems are not deterministic and the risks are not all well understood. A model that hallucinates a policy citation, a guardrail that lets a PHI request through, a prompt-injection attack via an employee field — these are risks that traditional IT governance was not designed to monitor.

Why this is hard

Vendors report on vendors. Anthropic reports Anthropic. OpenAI reports OpenAI. Neither reports on how your system — your retrieval layer, your policy library, your employee data — behaves under their model. A governance dashboard that isn't built on your own telemetry is theatre.

Multi-framework scoping. A real compliance console has to cover multiple frameworks: HIPAA for health data, SOX for financial controls, PII for privacy, GDPR for EU operations. Most dashboards pick one. Framework-by-framework reporting is what auditors actually review; if the system doesn't support it, it gets built shadow in spreadsheets and the whole integrity argument collapses.

Fair benchmarking. CIOs increasingly ask whether they picked the right model. That is a legitimate question, and answering it requires running the same queries through multiple models on the same guardrails with the same retrieval, with the same security gate applied to every input — not cherry-picked examples from a vendor deck. It also requires breadth across genuinely different ecosystems (different SDKs, different auth, different pricing shapes), not three flavours of the same vendor's portfolio. Fair benchmarks are work, and most organisations don't do the work.

The approach

The console is implemented in src/app/demos/azure-planning/ with six tabs: Platform Health, Data Health, Compliance, Audit Trail, LLM Operations, and Architecture & Cost. (The directory name is a legacy artefact from an earlier cloud-planning prototype; the demo ships as "Enterprise Audit & Compliance.") The architecture and cost breakdown is rendered in ArchitectureCostTab.tsx.

Platform health (PlatformHealthTab.tsx) monitors the DynamoDB tables and BigQuery datasets that each demo depends on, with status, uptime, and alarm surfacing. Data health (DataHealthTab.tsx) reports scan completeness and data-quality scores across the underlying tables. These two together are the boring-but-essential base of the pyramid.

The compliance tab (ComplianceTab.tsx) scores four frameworks against defensible criteria:

  • HIPAA — data classification (PII vs PHI), encryption at rest, encryption in transit, access logging, incident response, and regulatory retention.
  • SOX — change control, access control with RLS, append-only audit trail, segregation of duties, incident response.
  • PII handling — data minimisation, auto-purge where applicable, encryption, IP anonymisation via SHA-256 hashing, GDPR right-to-erasure support.
  • GDPR — DPA in place, consent management, data-subject rights, breach notification, DPO assignment.

The demo computes a score for each framework against its synthetic operating state. Those scores are demo measurements, not client results — they reflect the design of THIS site's infrastructure, not a client environment, and the UI labels them accordingly. A real client deployment would run the same scoring methodology against the client's own environment.

Below the scoring header, the compliance tab carries three structural artefacts that an auditor would actually open before reviewing scores:

  • A governance taxonomy (DataDictionaryPanel) defining nine data domains, five classification tiers (phi / pii / confidential / internal / public) with handling rules per tier, six lifecycle stages from collected through purged, and a role × domain access matrix that visualises what RBAC enforces. Source of truth: src/lib/governance/taxonomy.ts.
  • A field-level data catalogue (DataCatalogPanel) enumerating every field that the demos read — description, classification, compliance tags, retention pointer, FK relationships. Test coverage enforces that adding a typed field without a catalogue entry fails CI. Source of truth: src/lib/meridian/data-catalog.ts.
  • A framework report generator (ComplianceReportsPanel) producing on-demand reports for five frameworks: SOX IT General Controls, SOC 2 Type II Trust Services Criteria (AICPA TSP §100, 2017 revision), HIPAA Security Rule §164.312 Technical Safeguards, GDPR Articles 5/15/17/20/30/32/33, and ISO/IEC 27001:2022 Annex A. Each report renders coverage statistics (in-scope controls implemented / partial / not-applicable), the per-control objective, the evidence the platform witnesses through code or audit log, and cross-references to platform-native ITGC ids where applicable. Coverage against the synthetic demo state is SOX 100%, SOC 2 82%, HIPAA 86%, GDPR 50%, ISO 27001 89% — these are demo measurements of the artefacts the platform itself produces, not assertions about any client environment, and partial controls always carry a stateNote explaining what the platform witnesses versus what the deploying organisation owns. The same authored framework definitions back a CLI (scripts/generate-compliance-reports.ts) that writes versioned markdown + JSON under reports/<framework>/, reproducible from any commit hash.

These structural artefacts are the answer to "show me your governance documentation" before the conversation moves to "show me your scores." Auditors want the taxonomy and ROPA-style records (GDPR Art. 30) before the dashboard percentages; the structural layer makes that conversation possible without slides.

Audit trail (AuditTrailTab.tsx) is written against five DynamoDB tables: peopleanalytics-bedrock-invocations (every Claude call with input/output tokens and cost), peopleanalytics-guardrail-log (guardrail evaluations and block reasons), peopleanalytics-audit-log (API gateway access), peopleanalytics-incident-log (compliance threshold breaches), and peopleanalytics-demo2-audit (HR Copilot Q&A cross-reference). Retention is configurable; the demo sets it to seven years for compliance events, aligning with SOX.

LLM operations (LLMOperationsTab.tsx) runs a side-by-side benchmark of AWS Bedrock Claude Haiku 4.5 (the production runtime) against the OpenAI direct API (GPT-4o-mini default; GPT-4o opt-in) and Google Gemini (2.5 Flash) on the same retrieval context, the same guardrails, the same queries. The three providers cross genuinely different ecosystems — different SDKs (@aws-sdk/client-bedrock-runtime, openai, fetch against generativelanguage.googleapis.com), different auth models (IAM, API key, query-string key), different pricing shapes — which is what fair benchmarking actually requires. Every prompt passes the same OWASP content-injection gate (src/lib/owasp.ts) before it reaches any provider, so the security posture is provider-agnostic. The benchmark on this site runs against the demo's synthetic corpus; benchmark numbers are labelled as demo measurements and will differ on a real client corpus. The design point is that the benchmark runs continuously rather than being a one-time model selection — model performance on your retrieval corpus is the only measurement that matters, and it drifts.

Architecture & cost breaks spend by service and by demo. Current monthly totals for THIS site's demo environment are published in the UI; those are real measurements of this site's operating cost, not proposed figures for a client. Publishing them is the point; cost transparency is itself a compliance artefact.

The 2026 regulatory surface

HIPAA, SOX, PII, and GDPR are the foundational frameworks the console scores against. They are necessary but no longer sufficient. A 2026 enterprise audit of an HR-AI system has to also surface the AI-specific laws that took effect between 2023 and 2026. The console should be scored against each one the client operates under; the methodology below is regulator-side, not client-side — the scoring rubric is a function of what the law requires, not what any one client claims.

  • NYC Local Law 144 (Automated Employment Decision Tools) — any employer using an automated tool to make hiring or promotion decisions on candidates in New York City must commission an independent bias audit within the prior year, publish a summary of results, and notify candidates. The console must surface the latest audit date, the auditor's identity, and the impact ratios per protected class (sex, race/ethnicity, intersectional). The attrition demo's fairness panel and the model card described in the methodology notes are the input to this audit, not a substitute for it.

  • Colorado AI Act (SB24-205) — applies to "high-risk AI systems" used in consequential decisions (employment, compensation, performance). Requires an annual impact assessment, documentation of training-data provenance, and disclosure to the consumer when the system was a "substantial factor" in the decision. The effective date has moved through amendment activity since the statute's original 2026 target; the Colorado Attorney General's implementation guidance is the authoritative reference on timing and scope. The console must produce the impact-assessment artefact on demand and log the substantial-factor disclosure events.

  • Illinois AI Video Interview Act — employers using AI analysis of video interviews must disclose use, obtain consent, explain the characteristics the AI evaluates, and report demographic data annually to the Illinois Department of Commerce. The console must log the consent event alongside the invocation record.

  • EU AI Act (Regulation 2024/1689) — HR use cases (recruitment, performance evaluation, task allocation, monitoring) are high-risk under Annex III. Obligations include a risk-management system, data-governance documentation, technical documentation, automatic logging of events, human oversight, and conformity assessment. The Act phases in: the general rules apply from August 2024; prohibited-practice rules applied earlier; the high-risk Annex-III obligations that actually bind HR deployments apply from August 2026, with conformity-assessment infrastructure expected in place by then. The console's audit-trail tables satisfy the logging obligation; the compliance tab's framework scoring needs a dedicated EU AI Act row showing coverage of each Annex-III obligation.

  • EU Pay Transparency Directive (2023/970) — member-state transposition deadline was 7 June 2026, so the binding form for each employer is the national law transposing the directive in the jurisdictions where they operate. Reporting obligations are staged by headcount: employers of 250+ report annually from transposition; 150–249 every three years from transposition; 100–149 every three years from 2031; under 100 are not subject to mandatory reporting under the directive (national law may layer on additional obligations). Where a reported gender pay gap exceeds 5% within a category of workers and cannot be objectively justified, the employer must conduct a joint pay assessment with worker representatives. The executive overview's pay-equity view is the input; the console surfaces the reporting artefact and the assessment-trigger events.

Scoring each of these is the same methodology as HIPAA/SOX/PII/GDPR: defensible criteria per obligation, scored against the client's own operating state, reconciled with the client's Legal team in week one. The compliance tab's design accommodates adding a framework row; the audit-trail schema already captures the events each of these laws requires.

What the system produces

  • A single screen showing the health of every LLM-backed demo on the site, across platform, data, compliance, audit, LLM ops, and cost.
  • Framework-by-framework compliance scoring (HIPAA/SOX/PII/GDPR) that auditors can inspect.
  • On-demand framework reports — SOX IT General Controls, SOC 2 Type II Trust Services Criteria, HIPAA §164.312, GDPR Articles 5/15/17/20/30/32/33, and ISO/IEC 27001:2022 Annex A — generated from authored control definitions, downloadable as markdown, reproducible from any commit hash. Partial controls carry explicit stateNotes splitting platform-witnessed evidence from deploying-organisation responsibilities.
  • A governance taxonomy (data domains, classification tiers, lifecycle stages, role × domain access matrix) and a field-level data catalogue — the structural documentation auditors want to read before the scoring conversation starts.
  • A human-in-the-loop purge-review queue for every record past its retention window. Records are surfaced with PII-redacted summaries, legal basis (EEOC/FLSA/HIPAA/GDPR/SOX/Contract), and days overdue. A compliance reviewer approves each purge with a documented reason that writes to the audit log; no record is auto-deleted. This is the primary illustration of Article-14-style human oversight on the platform, and the backing policy lives in knowledge/data-retention-policy.md. The queue is the system of record for GDPR Art. 17 right-to-erasure handling and the evidence artefact for HIPAA §164.530 and SOX §802 retention obligations.
  • An append-only audit trail of every Claude invocation, every guardrail evaluation, every incident, and every purge-review decision.
  • A continuous benchmark of the deployed model (Bedrock Claude Haiku) against at least two cross-ecosystem comparison models (OpenAI GPT-4o-mini, Google Gemini 2.5 Flash) on the same workload, behind the same content-injection gate.
  • Cost attribution by service and by demo, with the allocation method disclosed.

What the system does not produce:

  • A SOC. The console is a telemetry and reporting layer, not a security function. Organisations that confuse the two will be disappointed by both.
  • A replacement for framework-specific audit expertise. The scoring criteria are defensible defaults; reconciling them with a client's Legal team's interpretation is week-one work.
  • A guarantee against novel failure modes. LLM risks are still being characterised; the console is instrumented to surface what we know how to monitor.

Patterns from engagement work

Build the audit-trail tables before the dashboards. Retrofitting audit schema is expensive. Every tab above Audit Trail depends on the trail existing; if it doesn't, the tabs show approximations, and approximations aren't audit-grade.

Align with the client's compliance team on framework scoring in week one. Sensible defaults exist; the client's Legal team usually has their own interpretation of SOX scoring, and the reconciliation takes longer than the build. Putting that conversation at the front avoids rework.

Instrument the benchmark console with real invocations from day one, even at a low sampling rate. Mock benchmark data is fine for a demo. It is not fine for a CIO making a model-selection decision. Sampled real invocations through every comparison model on live traffic is the defensible version. The provider abstraction (src/lib/llm/) makes adding a fourth or fifth comparison provider one adapter file plus one entry in the fan-out array — the persistence schema, cost tracking, and OWASP gate stay provider-agnostic.

Cost attribution is a hybrid exercise. DynamoDB and BigQuery spend is hard to cleanly allocate across demos because tables are shared. Direct attribution where possible, proportional allocation where not, and disclose the method in the UI. An allocation without the disclosure is indefensible in audit.

Where this applies

This pattern works for organisations deploying at least two LLM-backed HR systems into regulated environments — financial services, healthcare, public sector, critical infrastructure — where audit and governance are actual constraints, not checkbox exercises. It works for organisations already operating under SOX or HIPAA; the console meets them where they are.

It does not work for organisations running a single LLM application with no compliance obligation; the overhead is disproportionate to the risk. It also does not work as a replacement for a security function.