Building agentic AI for healthcare — HealthQCopilot

Multi-agent

specialised agents on Semantic Kernel, not one general model

FHIR R4

every clinical claim grounded in a retrieved resource

Cite-or-refuse

unsupported assertions are blocked, not surfaced

Multi-agent orchestration on Semantic Kernel, grounded against a FHIR R4 record.

The problem

Healthcare teams don't lack data — they drown in it. The brief for HealthQCopilot was to put an agentic layer over clinical workflows on Azure, using Semantic Kernel to orchestrate specialised agents over a patient's FHIR R4 record. One constraint shaped every decision that followed: in this domain, a confident wrong answer is worse than no answer at all.

That reframes the whole engineering problem. The hard part of a clinical copilot isn't generating fluent text — modern models do that easily. It's guaranteeing that everything it says is traceable to the record in front of it, and that anything it can't ground gets withheld rather than guessed.

The architecture

The system is a set of specialised agents coordinated by a planner on Semantic Kernel — each with a narrow, well-defined tool surface rather than one general model handed broad access. Smaller domain-tuned models handle the bounded, high-volume tasks; heavier reasoning is reserved for the steps that genuinely need it. Keeping each agent narrow is what makes the system's behaviour predictable enough to trust in a clinical setting.

Planner agent — decomposes the request and routes each sub-task to the agent best suited to it, rather than letting one model attempt everything.
FHIR retrieval agent — pulls only the resources in scope for the request; every downstream claim must cite a resource it returned.
Hallucination guard — validates each generated assertion against the retrieved FHIR context before it reaches a clinician (see below).
Composition agent — assembles the grounded, verified pieces into the clinician-facing response, preserving each claim's provenance.

Why the hallucination guard became the most important agent

The reasoning agents were supposed to be the hard part. The agent that actually decided whether the product was usable was the guard: a verification pass that rejects any statement not directly supported by a retrieved FHIR resource.

The model that generates is not allowed to be the model that approves. Grounding has to be a separate, auditable step.

Concretely, the guard sits between generation and the clinician: it extracts the discrete factual claims from a draft response and checks each one against the FHIR context the retrieval agent returned. A claim that maps cleanly to a resource passes with its provenance attached. A claim with no support doesn't get softened or hedged — it's dropped, and where the gap matters the request escalates rather than ships a guess. Separating "write" from "approve" is what makes that decision auditable after the fact.

What the work taught us about prior-auth

Prior-authorization looks, from the outside, like a document-summarisation problem — read the policy, read the record, decide. In practice the rules are conditional, payer-specific, and change often, so the early instinct to treat it as one big reasoning prompt produced answers that were plausible and unverifiable in equal measure. The lesson that reshaped the design: prior-auth is a retrieval-and-grounding problem before it's a reasoning problem. Pin every determination to the specific policy clause and record field it depends on, and the same cite-or-refuse discipline that protects clinical claims protects authorization decisions too.

Orchestration patterns that survived real use

Narrow tools over broad agents. The wider an agent's surface, the harder its output is to predict and verify. Constraining each agent to a small, explicit toolset traded a little flexibility for a lot of reliability.
Cite-or-refuse. Every clinical claim carries its FHIR provenance or it doesn't ship. This single rule did more for trust than any model upgrade.
Separate generation from verification. A distinct, auditable approval step — not the generating model grading its own work — is what made the system defensible in a clinical context.

Outcome

The architecture's value isn't a cleverer model — it's that the system is honest about what it knows. Grounding and verification as first-class, separate steps mean unsupported claims are withheld by design, and every statement that does reach a clinician carries the record it came from. That auditability is the difference between a demo and something a clinician can actually rely on.

The same pattern generalises beyond healthcare: any domain where a wrong answer is expensive benefits from narrow agents, mandatory grounding, and a generation/approval split. It's the part of the design I'd reach for first on the next high-stakes agentic system.

Why I’m building this in the open

I’m developing HealthQCopilot as open source — AI for Humanity. Life-critical AI shouldn’t be a black box: in healthcare, a confident wrong answer can hurt someone, so the safety layer — grounding, verification, guardrails — belongs in the open, where anyone can inspect it, trust it, and build on it. Open-sourcing that trust layer makes safe AI a public good, not a proprietary moat — within reach of smaller clinics, regulated teams, and under-resourced settings, not just the giants.