When Your Mental Health App Learns: Applying ISO 14971 Risk Management to AI in DMHT

ISO 14971 is the risk management standard behind every medical device technical file. If you are building a digital mental health product, or signing off on one as a Clinical Safety Officer, you will encounter it. You will write a hazard log, a risk management plan, and eventually a clinical safety case. All of it sits on ISO 14971 foundations.

The problem is that ISO 14971 was written for devices that behave the same way every time. A machine learning model does not. And for mental health technology specifically, the problem runs deeper still: ISO 14971 is a product safety paradigm, and mental health harm is partly sociotechnical. The device can behave exactly as intended and the system can still fail.

To try and bridge the Machine Learning problem, the BSI published the AAMI TIR34971:2023 guidance in May 2023. To be clear, this is not a new standard but a practical guide to applying ISO 14971 when your device uses AI or ML. The underlying standard remains ISO 14971. What BS AAMI TIR34971 does is tell you which additional questions you need to answer at each stage of that process, and why those questions matter for a system that learns.

Here is what that means in practice for the three documents that matter most.

The hazard log

A standard ISO 14971 hazard log asks: what could go wrong with this device, and what is the likelihood and severity of the resulting harm?

For an AI product, those questions are necessary but not sufficient. The guide asks you to go further.

Who was the model trained on, and does that population match the people who will actually use the product? A conversational AI fine-tuned predominantly on data from young White British women will perform differently for young men, ethnic minority users, and people with severe depression. In a subgroup validation study, male-identified users rated AI responses as helpful 44% of the time compared with 71% for female-identified users. The severe depression subgroup rated them helpful 38% of the time.

These numbers are illustrative, but if you think biases of this magnitude are unrealistic because you would assume they get caught and mitigated before deployment, think again. The published literature on AI in mental health tells a different story. Research consistently finds that aggregate performance metrics mask clinically significant subgroup differences: gender disparities in speech-based depression detection, systematic under-diagnosis of specific demographic groups in anxiety screening, and performance gaps that persist even after bias mitigation attempts are applied. These are not hypothetical risks. They are documented findings in peer-reviewed research on deployed systems.

Beyond training data, two further failure modes require explicit documentation. The first is data drift: a model trained on mental health conversations from 2022 may respond less safely to presentations shaped by events that occurred afterwards. The hazard is real and foreseeable, the likelihood and severity can be estimated, and the appropriate control is a drift monitoring plan with clinically meaningful metrics, not output format consistency checks. The second is opacity: for models whose outputs cannot be fully traced to specific inputs, the clinical implications of that opacity must be documented. If a harmful response is generated and the root cause cannot be identified, prevention of recurrence is uncertain. That uncertainty is a foreseeable hazard in its own right.

The risk management file

The guide introduces two additional documents that sit within the risk management file for AI and ML products.

The first is a training data specification. This documents who the model was trained on, what the demographic and clinical characteristics of the training population were, what the known gaps are relative to the intended use population, how bias was assessed, and what mitigations are in place. If you are building on a third-party foundation model and do not have access to the training data, that does not remove this requirement. You document what you know, infer the likely characteristics from publicly available information about the model, identify the subgroups in your intended user population that are likely to be underrepresented, and design post-market surveillance to detect differential performance in real-world use.

The second is a Predetermined Change Control Plan. A machine learning model is updated frequently. The PCCP specifies in advance which changes are pre-approved, which require CSO review, which require formal change control assessment, and which would require a fresh conformity assessment. Without it, every model update could theoretically require a new conformity assessment. With it, you can update the model within defined boundaries without triggering a full re-assessment.

The clinical safety case

The clinical safety case is the document that brings everything together and carries the CSO's professional sign-off. For an AI product, a clinical safety case built only on standard ISO 14971 process compliance is incomplete.

Before signing, a CSO should be able to answer five questions:

  1. Is there a training data specification that identifies gaps relative to the intended use population?

  2. Has the model been validated across clinically relevant subgroups, not just the overall population?

  3. Is there a drift monitoring plan with clinically meaningful metrics and defined thresholds?

  4. Is there a documented explainability statement addressing the clinical implications of the model's opacity for this specific use case?

  5. Is there a Predetermined Change Control Plan?

These are not technical details. They are clinical safety questions. A model that performs significantly worse for young men with low mood, or for people with severe depression, is a product with a foreseeable harm pathway that has not been adequately addressed. Signing off on a clinical safety case without those questions answered is a professional risk, not just a regulatory one.

The gap between clinical safety and AI safety

Teams developing AI products sometimes misunderstand what clinical safety frameworks require. Not because they are not thinking about safety, but because they are answering a different question.

AI safety asks: can this model be made to produce harmful outputs? Clinical safety asks: even when the model behaves exactly as intended, does this product create foreseeable harm pathways for this population?

The communication gap becomes most visible when it comes to adversarial testing and what the developer hopes to answer through that. A red team can design technically sophisticated attacks: prompt injection, jailbreaking, attempts to extract harmful content. What they cannot do without clinical input is define what a harmful output looks like for a specific population. That definition is clinical, not technical.

The clinician’s role in adversarial testing is bounded: define the harm taxonomy before testing begins, interpret borderline outputs that technical classifiers cannot adjudicate, determine whether residual risk is acceptable at sign-off. Designing attack vectors belongs to the red team.

However, that contribution sits within a broader accountability. The CSO carries named responsibility for clinical safety across the product lifecycle. Adversarial testing is one part of that, not the definition of it.​​​​​​​​​​​​​​​​

Because a model responding warmly to a user who has just described stopping eating, stopping sleeping, and feeling like a burden is doing exactly what its developers intended. The failure is not the response. It is the absence of escalation alongside it. The model passed its design criteria and failed the clinical safety test simultaneously.

A practical note

The guide does not create entirely new obligations from scratch. It reframes the existing ISO 14971 questions for a context where the device learns, adapts, and may perform differently for different people. The hazard identification process still asks what could go wrong. The guide adds: and for whom, and under what conditions of use, and as the model changes over time.

If you are building a digital mental health product with an AI component, the earlier you engage with these questions the better. A training data specification written before fine-tuning costs almost nothing. The same document written under pressure during a conformity assessment costs considerably more, and may require changes to the model or the intended use population that affect the product substantially.

Key takeaways

  • BS AAMI TIR34971:2023 is not a new standard. It is a guidance document for applying ISO 14971 when your device uses AI or ML.

  • The hazard log must address training data bias, data drift, and model opacity as foreseeable hazards, not just device malfunction.

  • Subgroup validation is a clinical safety requirement, not a statistical nicety. Aggregate performance metrics routinely mask clinically significant disparities.

  • Two additional documents are required in the risk management file for AI products: a training data specification and a Predetermined Change Control Plan.

  • A clinical safety case that covers only standard ISO 14971 compliance for an AI product is incomplete.

  • AI safety and clinical safety address different layers of the same problem. Both are necessary. Neither substitutes for the other.

Dr Hellen von Winckler is a Specialty Doctor in Psychiatry and Director and Clinical Safety Officer at F&G Strategy Ltd, a clinical safety and regulatory advisory practice specialising in digital mental health technology and Software as a Medical Device. Explore free resources and upcoming modules on SaMD risk management at the F&G Strategy Academy: academy.fgstrategy.co.uk.

Next
Next

Risk Management in Digital Mental Health: Beyond the Checklist