How We Govern Clinical AI

Governance is the continuous practice of monitoring, evaluating, iterating, and re-evaluating clinical AI systems throughout their operational lifecycle. It is distinct from point-in-time evaluation. It requires architectural choices made at design time, not compliance activities added after deployment.

Our Approach

Four Evaluation Dimensions

We integrate four evaluation dimensions within a single continuous system. While governance frameworks have been proposed in the literature, no published work has demonstrated operational governance of a deployed clinical AI agent with empirical evidence.

Dimension 1

Rubric Validation

Case-specific rubrics authored by expert clinicians encode what correct documentation should contain for each clinical encounter.

Dimension 2

Live Clinician Feedback

Real-use failure detection during deployment. Clinicians report issues through in-workflow mechanisms during live patient encounters.

Dimension 3

Technical Performance

Latency, failure rates, and reliability monitored through tiered logging architecture with per-stage attribution.

Dimension 4

Cost Tracking

Economic sustainability of governance. Token attribution, compute costs, and clinician time tracked across every governance activity.

These dimensions are connected by controlled experimentation that gates every engineering change. Candidate system versions are tested against the full benchmark before deployment. No change ships without quantitative evidence.

Architecture

Design Principles for Governability

Governability is an architectural property: the degree to which a system's design enables governance. These design choices, made during system development, determine whether governance is tractable.

Structured Outputs

Typed, schema-defined objects at every pipeline stage enable automated validation and cross-version comparison.

Explicit Intermediate Reasoning

Inspectable reasoning layers enable failure attribution to specific pipeline stages.

EHR-Bounded Action Space

Constrained to predefined clinical actions validated against the patient chart.

Computable Performance Objective

Case-specific rubrics produce quantitative scores enabling controlled version comparison.

Research

Peer-Reviewed Papers

Submitted to npj Digital Medicine

End-to-End Evaluation and Governance of Hyperscribe, an EHR-Embedded Clinical AI Agent

The first end-to-end governance framework for a deployed clinical AI agent, integrating rubric validation, live clinician feedback, technical performance monitoring, and cost tracking with controlled experimentation gating changes before deployment.

arXiv:2604.27309 →

Submitted to JAMIA

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement

A case-specific, clinician-authored rubric methodology for evaluating clinical AI documentation. LLM-generated rubrics converge with clinician agreement at three orders of magnitude lower cost.

arXiv:2604.24710 →

Open Science

Data & Code

Dataset on PhysioNet (coming soon)

823 de-identified clinical encounters, 1,646 expert rubrics, 11,454 criteria. Credentialed access.

Hyperscribe Source Code

Open-source EHR-embedded clinical AI agent. Full pipeline: audio to structured chart updates.

Dataset Usage Scripts

Data loading, rubric scoring, and LLM rubric generation scripts and prompt templates.