Validating AI-generated code in regulated environments: a practical playbook

AI-generated code is showing up in medical device firmware, eQMS systems and production support tools. The FDA doesn't prohibit it. The FDA expects you to validate it the same way you'd validate human-written code, with attention to the failure modes AI generation introduces.

If your engineering team uses GitHub Copilot, Claude Code, Cursor or any other AI coding assistant, your validation team is already validating AI-generated code. They may not know it.

What's actually new

Validating AI-generated code is not validating an AI model. The model lives at the vendor. What you receive is a code artifact, no different in form from code a human wrote.

What is new is the failure mode profile. AI-generated code fails differently from human code. Three categories show up in production code review:

Plausible-but-wrong logic. The code looks correct. Naming is conventional. Comments exist. The right libraries are imported. The algorithm is subtly wrong. A unit calculation off by a factor of ten. A boundary condition that excludes the boundary. A regulatory threshold checked against the wrong field.

Hallucinated APIs. The code calls library functions that don't exist, or calls real functions with fabricated signatures. This category is shrinking as models improve. It hasn't disappeared.

Pattern-matched insecurity. The model produces code patterns common in its training data, including ones with known security flaws. SQL string concatenation. Weak password handling. Missing input validation.

Your verification needs to test for these specifically. Not just the failure modes you'd expect from a junior engineer's code.

The IEC 62304 mapping

Three sections of IEC 62304 take on extra weight when AI-generated code is in the mix.

§5.4 detailed software design requires documented design decisions. AI generation does not produce a design rationale. The engineer who accepted the suggestion needs to document why the design is appropriate. This is a procedural change to your code review process, not an architectural one.

§5.5.2 unit verification requires documented acceptance criteria. AI-generated code that passes unit tests still needs human review of the test coverage. AI is good at writing code that passes tests. AI is also good at writing tests that pass for the wrong reasons.

§8.1.1 SOUP designation arguably applies. The model is software of unknown provenance that contributed to your software item. Some quality systems treat AI-generated code as SOUP and document it accordingly. Others treat the model as a tool, like a compiler, and document the tool instead. The FDA hasn't formally taken a position. Pick an interpretation, write it into your QMS, defend it consistently.

What to test for

The failure modes drive the test strategy. For AI-generated code at any non-trivial risk class, the verification burden expands beyond traditional unit testing.

Functional correctness against requirements. Standard verification, with extra emphasis on boundary conditions, off-by-one errors and unit conversions. AI is statistically prone to these.

Algorithmic correctness against domain expertise. A human SME reviews the algorithm against domain knowledge. For medical calculations that means a clinical reviewer or biomedical engineer reads the code, not just the test results.

API call validation. Every external library call gets verified against the actual library documentation. For high-risk software, run the suspect calls in isolation to confirm behavior matches the code's assumptions.

Security pattern review. Static analysis tools tuned for the AI failure mode profile. Hardcoded credentials, weak randomness, SQL injection patterns, missing authentication checks.

Coverage analysis. Statement and decision coverage at minimum for Class B. Branch coverage and MC/DC for Class C critical units. AI-generated tests need particular scrutiny because the model may have written tests that exercise the same wrong logic the code implements.

The code review record

The FDA inspector who finds AI-generated code in your software will ask one question. "How do you know this code is correct?"

The defensible answer is a documented review record showing the author identified the code as AI-generated, a qualified human reviewer independently evaluated it, the reviewer's evaluation criteria are documented and any modifications made during review are recorded.

Most teams don't currently capture the first item. Their code review tool tracks human authors. AI generation is invisible. The gap is fixable with a process change. Add a checkbox, a commit message convention, a PR template field. Capture it consistently. Anything is better than the absence.

The validation evidence chain

For a Class B or C software item that includes AI-generated code, the DHF should contain:

The volume isn't larger than traditional software validation. The content is different.

Where teams cut corners

Three patterns I see at companies in the 50-to-500 person range.

Treating AI-generated code as if it were human-written. The review process doesn't distinguish, the test strategy doesn't address AI-specific failure modes, the DHF doesn't capture provenance. Works until an inspector asks.

Banning AI generation by policy. The QMS prohibits AI tools, the engineering team uses them anyway, the validation team has no visibility. The worst possible state. The risk exists. The controls don't.

Validating the AI tool instead of the code. Some teams have written validation packets for Copilot or Claude. This misses the point. You can't validate a tool whose internal behavior you don't control. You validate the output, with controls proportionate to the risk class.

The defensible posture is to allow AI tools, document their use and apply proportionate verification to the output.

The vendor evidence question

For higher-risk applications, some teams ask whether the AI vendor's evidence (model card, evaluation results, safety testing) can offset the verification burden.

The honest answer is no, not currently. Vendor model cards provide useful context, but they aren't designed as evidence for medical device validation. The evaluation datasets don't match your use case. The safety testing doesn't address your specific failure modes.

This may change. Anthropic, OpenAI and others are publishing more rigorous evaluation evidence. For now vendor evidence is supporting context. Not primary evidence.

What's worth doing now

Audit your engineering team for AI tool use. Ask honestly. The tools are pervasive. If the answer is "we don't use them," the answer is probably wrong.

Add AI provenance tracking to your code review process. A commit message convention or a PR template field. Capture which code was AI-generated, which was AI-modified and which was human-only.

Update the software development plan to address AI-generated code. Define your review criteria, your verification approach and your DHF expectations. Get it through change control before the next inspection. Not after.

The AI tooling shift in software engineering is irreversible. The validation discipline that addresses it explicitly will be more defensible than the one that pretends it isn't happening.


VibeVal validates IEC 62304 structural requirements for any code diff, AI-generated or otherwise. The check returns a verdict, a CSA-aligned rationale and a SHA-256 attestation hash. The verdict and hash give you a defensible record for the DHF on every change. Pay per check.

Ship validated changes faster.
Submit a code diff, get a CSA-aligned compliance verdict in seconds. Pay per check.
Get started

← All articles