Why GAMP 5 AI Validation Is Different
GAMP 5 has been the primary validation methodology for GxP computerised systems since its first edition in 1994. Its category-based risk classification, lifecycle approach, and IQ/OQ/PQ validation framework have guided thousands of system validations across pharmaceutical, biotech and medical device organisations. For conventional software, the framework works well.
AI and machine learning systems challenge several GAMP 5 assumptions. GAMP 5 was designed for deterministic software: given the same inputs, the system produces the same outputs. AI systems — particularly those that learn from operational data — may not meet this condition. A model that was validated against historical training data may behave differently as it processes new operational data over time. The GAMP 5 AI Supplement, published in 2022, extends the GAMP 5 framework to address these characteristics. The six validation gaps in this guide are the most common failures to apply this extended framework correctly.
The Six GAMP 5 AI Validation Failures
The single most common GAMP 5 AI validation gap. Intended use boundaries define the data types, ranges, contexts and decision types for which the AI model was trained and validated. Without defined boundaries, there is no basis for determining when the system is operating within validated conditions — and no mechanism for detecting when it is not. Remediation: document intended use boundaries in the user requirements specification and validation summary report, and implement monitoring that detects out-of-boundary operation.
Training data is the foundation on which model behaviour is built. If training data quality is not assessed, provenance is not documented, and data preprocessing decisions are not recorded, the validation package cannot demonstrate that the model was built on reliable foundations. This is a data integrity failure. Remediation: apply ALCOA+ principles to training data — document provenance, preprocessing methodology, quality assessment criteria, and data version control. Retain training datasets as GxP records.
A model tested against data that was used in training will appear more capable than it is. Test datasets must be selected before training begins, must not overlap with training data, and must represent the operational conditions the model will face — not just the conditions it was trained on. Remediation: document test dataset selection methodology in the validation protocol, confirm independence from training data, and include operational edge cases that were not well-represented in training.
Acceptance criteria for AI systems must be defined before validation testing begins — not selected after observing results. Post-hoc selection of metrics that the model happens to meet is not validation. Remediation: define primary performance metrics, acceptance thresholds, and the minimum test dataset size required to demonstrate statistical confidence in the metrics before the validation protocol is executed.
AI systems that are not continuously monitored may drift from their validated performance without detection. A model that was validated with 97% accuracy may, over time, degrade to 89% — below the acceptance criterion established at validation — without any system-generated alert. Remediation: implement continuous monitoring for primary performance metrics, define alert thresholds, establish escalation procedures for threshold breaches, and integrate monitoring into the change control procedure as a trigger for revalidation assessment.
Standard software change control was designed for intentional, human-initiated changes. AI systems can change their behaviour through mechanisms that traditional change control does not capture: model retraining, fine-tuning, changes to training data, and in some architectures, continuous learning from operational data. Remediation: extend your change control procedure with AI-specific change categories and define the revalidation requirements for each category.
Self-Assessment — Where Is Your AI System?
Remediation Priority Framework
| Gap | Risk Level | Typical Remediation Timeline | Inspection Consequence if Unaddressed |
|---|---|---|---|
| Intended use boundaries | Critical | 2–4 weeks | Major finding — caps element capability assessment |
| Training data governance | Critical | 4–12 weeks (depends on data reconstruction complexity) | Critical data integrity finding |
| Independent test datasets | High | 2–6 weeks | Major finding — validation package rejected |
| Pre-defined acceptance criteria | High | 1–2 weeks if model not yet deployed | Major finding for new systems; observation for legacy |
| Continuous monitoring | High | 4–8 weeks | Major finding — system may be considered uncontrolled |
| AI change control | Medium | 2–4 weeks | Observation — potential major if changes have been made without assessment |
GAMP 5 and AI validation specialists. Gap assessment proposal within 48 hours.