Domain V: Operationalize AI Solution — Comprehensive Study Guide
Exam weight: 17% of PMI-CPMAI exam (~20 scored questions) Score-report framing: ❌ Below Target — PRIORITY 2 for rebuild Maps to CPMAI methodology phase: Phase VI — Operationalization Number of ECO tasks: 7 (V.1 through V.7) Estimated study time: 11 hoursNote from docs/ECO_TASK_REFERENCE.md: the score report flagged Task V.6 (Manage AI solution transition plan) as having no questions on his form. Cover it anyway — the retake form is randomized.
Overview
Domain V is the smallest of the three weak domains by weight (17%) but covers the most distinct business surface: what happens after the model passes the operationalization gate (IV.6) and goes into production. Every task in this domain begins with an oversight verb: manage, oversee, prepare. The PM is responsible for ensuring the deployment plan is built, the deployment is managed, governance is overseen, metrics are monitored, contingencies are planned, transitions are managed, and lessons are captured.
The unifying pattern: Domain V tests whether the project manager understands that an AI project doesn't end at deployment. Models drift. Data shifts. Real-world conditions diverge from training. A project that "deployed and is done" is exactly the failure mode PMI is testing for. The right answers always involve continuous monitoring, governance, contingency, and iteration.
Domain V has no formal go/no-go gate (the model→ops gate is IV.6, in Domain IV). But Domain V questions often test the readiness checklist that comes before deployment can be authorized — and the wrong-answer traps are typically "deploy and assume the team will handle it" or "the data scientist owns post-deployment performance."
Table of Contents
- Module 1: Operationalization Foundations (Lessons 1-4)
- Module 2: The Four AI Technology Environments (Lessons 5-9)
- Module 3: Deployment — How and Where (Lessons 10-20)
- Module 4: Continuous Operations and Life Cycle (Lessons 21-29)
- Module 5: Governance, Monitoring, and Trustworthy AI in Production (Lessons 30-37)
- Module 6: Closeout — Reporting, Limits, and Next Iteration (Lessons 38-40)
- Quick Reference: Model Monitoring Checklist
- Quick Reference: Trustworthy AI in Production
- Cross-Domain Links
- Knowledge Check
- Memory Aids & Mnemonics Summary
Module 1: Operationalization Foundations
Lessons 1-4 | What operationalization means and why it's different from app deployment.Lesson 1: ECO Task V.1 — Manage Creation of AI Solution Deployment Plan
The PM's first job in Domain V: ensure a deployment plan exists before any model is pushed into production. Deployment planning isn't a check-the-box step — it's a documented artifact answering: how will the model serve predictions, where will it run, who will operate it, what performance is expected, what fails-over, and what monitoring is in place from day one.
The PM does not write the deployment plan alone. The data scientist, ML engineer, platform team, and operations team all contribute. The PM coordinates the contributions, ensures the plan is documented, and ensures stakeholders sign off before deployment begins.
KEY TAKEAWAYS
- The output of V.1 is a documented deployment plan with stakeholder sign-off — not a slide deck.
- The PM coordinates plan creation; the team builds it; stakeholders authorize it.
💡 Memory Aid — HOPE-MS Plan
A deployment plan answers: How served (batch / real-time / microservice / stream), Operation location (on-prem / edge / cloud / hybrid), Performance expected, Escalation path on failure, Monitoring from day one, Stakeholder sign-off.
PM Oversight Angle
- PM owns: Coordinating the cross-team effort to produce a documented deployment plan that covers serving method, location, performance criteria, monitoring, fail-over, and stakeholder authorization.
- Deliverable: AI Solution Deployment Plan — typically a workbook section in CPMAI. Includes serving architecture, environment selection, monitoring approach, ownership, performance baseline, contingency triggers.
- Iteration trigger: Deployment plan reveals an architectural assumption from Domain IV that doesn't fit production constraints → loop back to IV.1 (technique selection) or IV.6 (operationalization gate).
- Escalation trigger: Deployment plan requires infrastructure, security, or regulatory approval beyond the project's authority.
- Wrong-answer trap: "Have the ML engineer deploy the model and write the plan after." Plans precede deployment, not the reverse — production incidents often trace back to "we didn't plan for this."
- Question pattern signal: Stems mentioning "the model is ready to deploy," "the team is preparing for production," "before deployment begins" — testing whether you stop and confirm a plan exists.
- ECO task tag: Domain V, Task 1 — Manage creation of AI solution deployment plan
Lesson 2: The "Inference" Phase of an AI Project
ML projects have a different cycle from traditional software:
| Traditional App | ML Model | |
|---|---|---|
| Phases | Design → Build → Test → Deploy → Manage | Training → Inference |
| Output | Code that runs deterministic logic | Model that returns probabilistic predictions |
| Production lifecycle | Stable until updated | Continuously evolves with data drift |
KEY TAKEAWAYS
- ML has two phases (training + inference), not five (design/build/test/deploy/manage).
- Inference is where the operational lifetime begins — and where most production failures surface.
Lesson 3: What Is Operationalization?
Operationalization is "the term AI practitioners use to describe putting machine learning models into real-world environments." Note PMI's distinction:
- Operationalization ≠ deployment. Deployment is the act of placing the model. Operationalization is the broader process of ensuring it works, is monitored, governed, and maintained.
A model can be placed anywhere it needs to provide predictions: in a mobile app, on a server, in the cloud, on a desktop, in a web browser (via JavaScript), on an edge device, or even in a medical imaging system. The location is a deployment-plan decision, not a default.
KEY TAKEAWAYS
- Operationalization is the broader process; deployment is the placement act.
- The placement choice is explicit — anywhere from a mobile app to medical imaging.
Lesson 4: Model Operationalization and Life Cycle Questions
PMI lists the questions a project must answer before it's "successfully operationalized." Two categories:
Operationalization Questions:- How will you put the model into operation? (Batch / real-time / microservices / hybrid)
- Where will you operationalize? (On-prem / cloud / edge / hybrid)
- How will you manage versioning? (Multiple model versions in production simultaneously)
- What guidance for developers on model scaffolding?
- How will you implement GenAI in production?
- Approach for life cycle management?
- Integration with DevOps/DevSecOps?
- Tools for MLOps?
- Methods for monitoring (usage, abuse, performance, drift)?
- How to manage the data life cycle in production?
- Approaches for governance and security?
The PM doesn't answer these technically — the PM ensures they're answered, documented, and signed off as part of V.1.
KEY TAKEAWAYS
- Two categories of pre-deployment questions: operationalization (how/where/versioning/scaffolding/GenAI) and life cycle (lifecycle/DevOps/MLOps/monitoring/data/governance).
- All questions must be documented and answered before deployment, even if the answer is "iterative — start simple."
Module 2: The Four AI Technology Environments
Lessons 5-9 | The four distinct technology environments AI requires, and why one platform can't do it all.Lesson 5: The Four AI Environments — Overview
There is no "one platform to rule them all" for AI. Different roles and stages have different requirements. PMI defines four core environments:
- Model Development and Training Environment — where data scientists build and train models
- Big Data / Data Engineering Environment — where data is processed and pipelines run
- Model Scaffolding Environment — where models are integrated into apps
- Model Operationalization Environment — where models run in production
The PM coordinates whichever combination the project needs.
KEY TAKEAWAYS
- Four environments: Development, Big Data/Data Engineering, Scaffolding, Operationalization.
- No single platform fits all four — expect a multi-tool stack.
💡 Memory Aid — DBSO (Four Environments)
Development (build the model), Big data/engineering (move and prep data), Scaffolding (integrate the model into apps), Operationalization (run the model in production). "Don't Build Stuff Outside" — build it across all four.Lesson 6: Model Development and Training Environment
Where data scientists build, experiment, and train models. Typically: Jupyter notebooks, ML frameworks (TensorFlow, PyTorch, scikit-learn), GPU compute, experiment tracking tools (MLflow, Weights & Biases), and access to staged training data.
PM concern: ensuring the development environment has the compute, data access, and tooling the data science team needs — coordinated as part of V.1 / aligns with III.4 (workspace).KEY TAKEAWAYS
- Development environment = build/train tooling (notebooks, frameworks, GPUs, tracking).
- PM coordinates access and capacity, doesn't pick the framework.
Lesson 7: Big Data / Data Engineering Environment
Where data is gathered, cleaned, transformed, and pipelined. Typically: data warehouses, data lakes, ETL/ELT tools (Spark, Airflow, dbt), streaming platforms (Kafka, Kinesis), and data catalogs.
PM concern: ensuring data pipelines are built and operate reliably. Pipeline ownership is a deployment-plan question (V.1) and an ongoing operational concern (V.4).KEY TAKEAWAYS
- Big data / data engineering environment = data pipelines, warehouses, lakes, streaming.
- Pipeline ownership is named in the deployment plan.
Lesson 8: Model Scaffolding Environment
The "scaffolding" environment is where the model gets integrated into the application or system that consumes it. This is the layer where developers (often without ML backgrounds) interact with the model — APIs, SDKs, client libraries, integration patterns.
PM concern: providing developer guidance on how to consume the model, version it, handle errors, and fall back gracefully. Often overlooked until production incidents reveal that consumers don't know what to do when the model returns unexpected results.KEY TAKEAWAYS
- Scaffolding environment = integration layer between the model and the app/system that uses it.
- Developer guidance is PM-coordinated, not assumed to exist.
Lesson 9: Model Operationalization Environment
The production environment where the model runs and serves predictions. Typically: model-serving infrastructure (TensorFlow Serving, TorchServe, Seldon, SageMaker, Vertex AI, custom REST APIs), monitoring stack, logging, and version routing.
PM concern: ensuring the operationalization environment matches the deployment plan's choices for serving method (batch/real-time/microservice/stream) and location (on-prem/cloud/edge/hybrid).KEY TAKEAWAYS
- Operationalization environment = production serving infrastructure + monitoring + logging + version routing.
- It must match the V.1 deployment plan choices.
Module 3: Deployment — How and Where
Lessons 10-20 | The how (serving methods) and where (locations) of model deployment.Lesson 10: ECO Task V.2 — Manage AI Solution Deployment
Deployment is the act of moving the model from staging into production according to the V.1 plan. The PM doesn't push the model — the ML engineer or DevOps team does. The PM manages the deployment — coordinates timing, ensures the plan is followed, surfaces blockers, and confirms success criteria are met before declaring deployment complete.
A subtle but exam-critical point: deployment success is not "the model is running." Deployment success is "the model is running, monitored, performant against the success criteria, and in compliance with governance." The PM declares completion against the full criteria, not the runtime check.
KEY TAKEAWAYS
- The PM manages deployment — coordinates and confirms — but does not push.
- Deployment success = runtime + monitoring + performance + compliance, not just runtime.
PM Oversight Angle
- PM owns: Coordinating execution of the V.1 deployment plan; surfacing blockers; confirming success against documented criteria; declaring deployment complete.
- Deliverable: Deployment Status Report — documented confirmation that the plan was executed, monitoring is live, performance baseline met, governance in place.
- Iteration trigger: Deployment reveals an unmet plan requirement (monitoring gap, performance miss, governance gap) → halt deployment, address, then resume — do not declare complete with gaps.
- Escalation trigger: Deployment failure that cannot be remedied within the plan; rollback decision; security or compliance incident.
- Wrong-answer trap: "Declare deployment successful as soon as the model is serving requests." Misses monitoring, performance, and governance criteria.
- Question pattern signal: Stems mentioning "the model has been deployed," "deployment is in progress," "the team is ready to declare deployment complete."
- ECO task tag: Domain V, Task 2 — Manage AI solution deployment
Lesson 11: Batch Prediction
The model runs on a schedule (e.g., nightly) against a batch of inputs and produces a batch of predictions. Used when:
- Predictions don't need to be real-time.
- Inputs are accumulated over time.
- Cost optimization matters (batch is typically cheaper than real-time).
Examples: monthly customer churn predictions, weekly fraud risk scoring, daily demand forecasts.
KEY TAKEAWAYS
- Batch = scheduled, non-real-time, cost-efficient prediction mode.
- Best fit when inputs accumulate and predictions can wait.
Lesson 12: Microservices for AI
The model is exposed as an API that other services call on demand. Each call returns a prediction synchronously. Used when:
- Predictions are needed on-demand but volume is moderate.
- The model needs to integrate with existing application architecture.
- Different consumers need different SLAs.
Examples: a recommendation API called by the product page, a sentiment-analysis service called by support tooling.
KEY TAKEAWAYS
- Microservices = on-demand API, synchronous, integrates with app architecture.
- Best when predictions are moderate-volume and consumer-driven.
Lesson 13: Real-Time Prediction
The model produces predictions in real time on continuously incoming data, typically with strict latency requirements. Used when:
- Predictions must be returned in milliseconds.
- The application can't function without immediate prediction results.
- Volume is high and latency-sensitive.
Examples: fraud detection at point-of-sale, autonomous vehicle perception, real-time bidding.
KEY TAKEAWAYS
- Real-time = millisecond-latency, high-volume, latency is the success criteria.
- Best when the app fails without immediate predictions.
Lesson 14: Stream Learning
Stream learning continuously updates the model as new data arrives — the model both serves predictions and learns from incoming data simultaneously. Distinct from real-time prediction, which serves predictions but doesn't update the model.
Used when:
- Data distributions change rapidly (concept drift).
- The model must adapt without explicit retraining cycles.
- Examples: recommendation engines with rapidly changing taste, financial trading models, adaptive ad targeting.
KEY TAKEAWAYS
- Stream learning = continuous model update + serving in one stream.
- Different from real-time: real-time serves; stream serves AND learns.
Lesson 15: Cold Path vs Hot Path Analytics
Not unique to AI but shows up in deployment-plan questions:
| Path | Latency | Use Case |
|---|---|---|
| Hot path | Low — milliseconds to seconds | Real-time alerts, fraud detection, dashboards |
| Cold path | High — minutes to hours | Long-term aggregation, reporting, historical analysis |
A typical AI architecture uses both: hot path for immediate response, cold path for retraining and longer-horizon analytics.
💡 Memory Aid — Hot vs Cold
Hot = Hurry (milliseconds, alerts, real-time). Cold = Consider (hours, aggregation, retraining).KEY TAKEAWAYS
- Hot path = low latency, immediate response.
- Cold path = high latency, aggregation/reporting/retraining.
- Most AI systems run both.
Lesson 16: On-Premises Deployment
The model runs on infrastructure the organization owns and operates. Reasons:
- Data residency / regulatory requirements (some jurisdictions require on-prem).
- Sensitive data that can't leave organizational boundaries.
- Existing on-prem infrastructure investment.
- Latency-critical applications where round-trip to cloud is too slow.
Trade-offs: higher operational burden, harder to scale, more capital expense.
KEY TAKEAWAYS
- On-prem deployment = organizational control, regulatory fit, latency-sensitive.
- Trade-off: higher operational burden, harder scaling, capital expense.
Lesson 17: Edge Device Deployment
The model runs on a device close to the data source — a phone, an IoT sensor, a self-driving vehicle, a medical device. Reasons:
- Latency: predictions can't wait for a cloud round-trip.
- Bandwidth: too much data to send to cloud.
- Privacy: data shouldn't leave the device.
- Connectivity: device may operate offline.
Trade-offs: limited compute and memory, harder to update, model must be smaller.
KEY TAKEAWAYS
- Edge deployment = on the device — phones, IoT, vehicles, medical equipment.
- Trade-offs: constrained compute, harder updates, smaller model footprint.
Lesson 18: Cloud ML Deployment
The model runs in a cloud provider's managed environment — AWS SageMaker, Google Vertex AI, Azure ML, or similar. Reasons:
- Scale on demand.
- Managed infrastructure (less ops burden).
- Integrated tooling (training, serving, monitoring).
- Pay-as-you-go cost model.
Trade-offs: vendor lock-in, data egress costs, regulatory considerations.
KEY TAKEAWAYS
- Cloud deployment = scale, managed services, integrated tools, pay-as-you-go.
- Trade-offs: lock-in, egress, regulatory fit.
Lesson 19: Self-Hosted vs API-Hosted GenAI Models
Specific to GenAI, two production patterns:
| Self-Hosted | API-Hosted | |
|---|---|---|
| Where | Your infrastructure | Vendor's infrastructure (OpenAI, Anthropic, etc.) |
| Control | Full — model, data, latency, privacy | Limited — vendor's terms apply |
| Cost | High up-front (compute, storage, ops) | Pay-per-token, no infra cost |
| Privacy | Data never leaves your boundary | Data goes through vendor (review terms) |
| Best for | Regulated data, custom fine-tuning, high-volume | Prototyping, low-volume, varied use cases |
The PM coordinates this decision against trustworthy-AI constraints (Domain I) and the deployment plan (V.1).
KEY TAKEAWAYS
- Self-hosted = control, privacy, capital cost.
- API-hosted = speed, scale, vendor terms.
- The choice ties to trustworthy-AI constraints and operational scale.
Lesson 20: Risks of GenAI in Production
PMI lists specific GenAI production risks (Phase I covers these, but Domain V tests deployment-time mitigations):
- Hallucination — confidently producing false information.
- IP misappropriation — model trained on copyrighted material.
- Inappropriate responses — harmful, biased, or offensive content.
- Prompt injection — malicious users manipulating model behavior.
- Private data sharing — sensitive data exposed via prompts to public LLMs.
Mitigations belong in the deployment plan (V.1) and ongoing monitoring (V.4): content filters, prompt validation, output review, escalation paths, audit logging.
💡 Memory Aid — HIIPP (5 GenAI Risks)
Hallucination, IP misappropriation, Inappropriate responses, Prompt injection, Private data sharing. (Same mnemonic as Phase I — same risks, deployment-stage tests.)KEY TAKEAWAYS
- 5 GenAI risks: HIIPP — hallucination, IP, inappropriate, prompt injection, private data.
- Mitigations are deployment-plan items (V.1) and ongoing monitoring (V.4), not afterthoughts.
Module 4: Continuous Operations and Life Cycle
Lessons 21-29 | Why deployment isn't done — managing ongoing operation, drift, and contingency.Lesson 21: Failure Reason — AI Life Cycles Are Continuous
A common AI project failure: organizations treat the model as a "one-and-done" deliverable. They deploy, declare success, and move on. Six months later, the model is producing degraded predictions — and nobody is monitoring.
PMI's framing: AI project life cycles are continuous. Models drift. Data shifts. The world changes. The deployment plan must include continuous management, monitoring, and iteration — or the project is failing on a delayed timer.
KEY TAKEAWAYS
- AI projects don't end at deployment — they continue through monitoring, drift detection, and iteration.
- "Deployed and done" is a project failure mode on a delay.
Lesson 22: AI Life Cycle Challenges — COVID-19 E-Commerce Example
PMI's classic real-life example: e-commerce demand forecasting models trained pre-2020 dramatically failed during COVID-19. Consumer behavior shifted overnight; training data no longer reflected reality.
The lesson: external shocks invalidate model assumptions. The deployment plan must include monitoring for distribution shift and a contingency plan for retraining or rollback when shift is detected.
KEY TAKEAWAYS
- External shocks can invalidate model assumptions overnight (COVID-19 / e-commerce).
- Monitoring for distribution shift + contingency plan = required, not optional.
Lesson 23: ECO Task V.4 — Oversee AI Solution Metrics
The PM oversees metrics that track AI solution health in production. Metrics include:
- Business KPIs — what the project was supposed to deliver (revenue, cost reduction, accuracy of operational decisions).
- Model performance — accuracy, precision, recall, F1, latency, throughput, drift indicators.
- Operational metrics — uptime, error rate, request volume, prediction-to-action conversion.
Metrics must tie back to the success criteria from Domain II (Task II.8). Without that linkage, you're measuring without baseline.
KEY TAKEAWAYS
- Metrics span business KPIs, model performance, operational health.
- Metrics must link to Domain II success criteria for context.
💡 Memory Aid — BMO Metrics
Business KPIs, Model performance, Operational health. Three tiers, all monitored, all linked to Phase I/II success criteria.PM Oversight Angle
- PM owns: Overseeing the metrics regime — ensuring metrics are defined, instrumented, monitored, and reviewed against success criteria.
- Deliverable: Metrics Plan + dashboards / reports per the deployment plan. Reviewed cadence (weekly/monthly), with stakeholder access.
- Iteration trigger: Metrics consistently below threshold → loop back to investigate root cause (data drift, model decay, scope miss). May trigger retraining or rescoping.
- Escalation trigger: Sustained metric breach affecting business KPIs; trustworthy-AI metric breach (bias, privacy incident); SLA violation.
- Wrong-answer trap: "Have the data scientist watch model performance and let the PM know if there's an issue." Metric ownership is PM-driven, not data-scientist-on-call.
- Question pattern signal: Stems mentioning "model performance has degraded," "the business KPIs are below target," "the metrics show drift."
- ECO task tag: Domain V, Task 4 — Oversee AI solution metrics
Lesson 24: Model Life Cycle Management
Model life cycle management covers the model's entire production lifetime: deployment, monitoring, versioning, retraining, retirement. The PM coordinates the cadence, owners, and triggers for each:
- Deployment — V.2 task, executed via plan.
- Monitoring — V.4 task, ongoing.
- Versioning — version control of model artifacts, training data, and serving infrastructure.
- Retraining — triggered by drift detection or scheduled refreshes.
- Retirement — sunset old versions; document the replacement.
KEY TAKEAWAYS
- 5 life cycle phases: deployment, monitoring, versioning, retraining, retirement.
- Each has owners, triggers, and documentation requirements.
Lesson 25: Managing the Data Life Cycle (in Production)
Data drift — the production data slowly diverging from training data — is a primary cause of model decay. The PM coordinates:
- Monitoring data inputs for distribution shift.
- Updating data pipelines as new data types or formats appear.
- Adapting data prep to changes.
- Triggering retraining when shift exceeds tolerance.
This is the production extension of the data life cycle you learned in Domain III (Lesson 29).
KEY TAKEAWAYS
- Data drift = production data diverging from training distribution.
- Monitoring + adaptation + retraining triggers = continuous data life cycle in production.
Lesson 26: ECO Task V.6 — Manage AI Solution Transition Plan
Transition plans cover handoffs: model handed from project team to operations team, model handed from one operations team to another (reorg, vendor change), or AI solution transitioned to a successor project. The PM owns the transition plan — what's documented, what's transferred, what knowledge needs to move with the artifact.
Asterisked task: the first attempt form had no V.6 questions. The retake form may or may not. Cover it.
KEY TAKEAWAYS
- Transition plan = handoff documentation + knowledge transfer + ownership change.
- Often triggered by team change, vendor change, or successor project.
PM Oversight Angle
- PM owns: Producing a transition plan documenting handoff, ownership, knowledge transfer, and ongoing maintenance arrangements.
- Deliverable: AI Solution Transition Plan — recipient teams, documented artifacts, training/onboarding, escalation paths, sign-off from receiving team.
- Iteration trigger: Transition reveals undocumented dependencies → produce documentation, do not transfer until complete.
- Escalation trigger: No receiving team identified or willing; capability gap that prevents safe transition.
- Wrong-answer trap: "Email the operations team and let them figure it out." Transition is documented, signed off, and the receiving team confirms readiness.
- Question pattern signal: Stems mentioning "the project team is wrapping up," "the model is being handed to operations," "the vendor contract is ending and we need to bring it in-house."
- ECO task tag: Domain V, Task 6 — Manage AI solution transition plan
Lesson 27: DevOps for AI
DevOps integrates development with operations — continuous integration, continuous deployment, monitoring, automation. For AI projects, DevOps practices apply but with added complexity from model artifacts and data dependencies.
DevSecOps adds security as a first-class concern. AI projects often need DevSecOps because of regulated data and trustworthy-AI requirements.
KEY TAKEAWAYS
- DevOps = CI/CD/monitoring/automation for software.
- DevSecOps adds security as first-class.
- AI projects benefit from DevOps practices, with extensions for model and data artifacts.
Lesson 28: MLOps — Machine Learning Operations
MLOps is DevOps adapted for ML. It addresses:- Model versioning and lineage — tracking which model version was trained on which data with which hyperparameters.
- Reproducibility — being able to recreate any production model.
- Continuous training — automated retraining pipelines.
- Continuous deployment — automated rollout of new model versions.
- Monitoring — model and data drift detection, performance metrics.
The PM doesn't build the MLOps stack but ensures the project plan accounts for it as a capability the operations team must have.
💡 Memory Aid — MLOps vs DevOps
DevOps = code-driven CI/CD. MLOps = code + model + data CI/CD. MLOps adds model versioning, data lineage, retraining automation.KEY TAKEAWAYS
- MLOps extends DevOps with model + data + retraining + drift monitoring.
- The PM ensures the deployment plan accounts for MLOps capability.
Lesson 29: ECO Task V.7 — Oversee AI Solution Contingency Plan
A contingency plan documents what happens when things go wrong: the model breaks, predictions are unreliable, an incident occurs, a data feed fails. The PM oversees creation of the plan and ensures it's tested and ready before production goes live.
Common contingency scenarios:
- Model failure → fallback to previous version, rule-based system, or human handoff.
- Data feed failure → alternative source, cached predictions, graceful degradation.
- Performance breach → automated rollback or pause.
- Trustworthy-AI incident → containment, audit, stakeholder notification, regulatory reporting.
KEY TAKEAWAYS
- Contingency plans = what to do when AI breaks.
- Tested and ready before production, not improvised after.
PM Oversight Angle
- PM owns: Overseeing creation of the contingency plan covering model failure, data feed failure, performance breach, trustworthy-AI incidents.
- Deliverable: AI Solution Contingency Plan — scenarios, triggers, response procedures, owners, escalation paths, tested.
- Iteration trigger: Contingency plan reveals an unmitigated risk → loop back to V.1 deployment plan to address before production.
- Escalation trigger: A scenario that exceeds the project's mitigation capability and requires external dependency or executive decision.
- Wrong-answer trap: "Document the contingencies and trust the operations team to respond." Plans are tested, not just documented, before production.
- Question pattern signal: Stems mentioning "the model has stopped working," "predictions are unreliable," "a data feed has failed," "an incident has occurred."
- ECO task tag: Domain V, Task 7 — Oversee AI solution contingency plan
Module 5: Governance, Monitoring, and Trustworthy AI in Production
Lessons 30-37 | The governance, monitoring, and trustworthy-AI controls that run continuously in production.Lesson 30: ECO Task V.3 — Oversee Model Governance
Model governance "provides controls, processes, procedures, and organizational guidance on how models are built, iterated, used, and shared." The PM oversees governance throughout production — ensuring it's defined, applied, and audited.Model governance covers:
- Access control, authorization, security — who can use which model, with what authorization.
- Provenance and auditing — documentation of how the model was trained, tested, deployed.
- Audit logs — usage and iteration history.
- Version control — supporting multiple model versions in production simultaneously.
- Bias monitoring — measuring and addressing informational bias.
- Sharing and extension controls — governing how others use, fine-tune, or extend the model.
KEY TAKEAWAYS
- Model governance = controls + processes + procedures + organizational guidance for production models.
- 6 components: access, provenance/auditing, audit logs, versioning, bias monitoring, extension controls.
💡 Memory Aid — APAVBE Governance
Access control, Provenance/auditing, Audit logs, Version control, Bias monitoring, Extension controls.PM Oversight Angle
- PM owns: Overseeing the model governance program — ensuring governance is defined, applied, audited; coordinating governance with Trustworthy AI Framework (Domain I) and organizational policy.
- Deliverable: Model Governance Plan + ongoing governance reports (audits, access reviews, version logs).
- Iteration trigger: Governance gap detected (unauthorized access, version drift, missing audit) → halt, address, audit before resuming.
- Escalation trigger: Governance incident (bias breach, security incident, audit failure); regulatory inquiry.
- Wrong-answer trap: "Have the ML engineer maintain governance documentation." Governance is PM-overseen, organizationally-aligned, not engineer-maintained.
- Question pattern signal: Stems mentioning "an updated model was deployed," "access to the model is being granted," "the model needs an audit," "different versions of the model exist in production."
- ECO task tag: Domain V, Task 3 — Oversee model governance
Lesson 31: Model Deployment with Governance Framework
Why governance is needed during deployment and beyond:
- Models change. Even if the model itself doesn't, the world, data, and environment do. Models are versioned and iterated continuously.
- Thresholds shift. Initially-acceptable accuracy (e.g., 92%) may rise (95%).
- Sensitivity changes. False-positive vs false-negative tolerance evolves.
- Use cases expand. A batch model may need real-time. An old API may need new capabilities.
- External users emerge. Other teams may extend or fine-tune your model in unintended ways.
The governance framework includes: model versioning, continuous evaluation, deployment management, iteration controls, A/B testing, access control, security, extension controls.
KEY TAKEAWAYS
- Models need governance because models change, thresholds shift, use cases expand, external users emerge.
- Governance framework spans versioning, evaluation, deployment, iteration, A/B, access, security, extension.
Lesson 32: Model Monitoring
Model monitoring verifies that the production model performs as expected. Key aspects:
- Performance measures — latency, response time, request volume, prediction errors, accuracy, recall, F1, model-specific ML measures.
- Data visibility — monitoring inputs to detect drift; logging failures.
- Usage visibility — how is the model actually being used? Useful for training future versions.
- Drift measurement — both model drift (model performance degrading) and data drift (input data distribution shifting).
- Multi-dimensional dashboards — performance across user segments, time periods, geographies.
Model drift and data drift are inevitable, not exceptional. Monitoring quantifies how much.
💡 Memory Aid — Model Drift vs Data Drift
Model drift = the model's predictions degrade over time. Data drift = the inputs to the model shift over time. Drift is inevitable; monitoring is the response.KEY TAKEAWAYS
- Monitoring covers performance, data, usage, drift, dimensional analysis.
- Both drifts are inevitable — monitoring quantifies them.
Lesson 33: Trustworthy AI Considerations in Production
Production AI must be:
- Compliant — regulatory frameworks and legal requirements in operating jurisdictions.
- Safe, Reliable, Secure — protected from malicious AI; doesn't cause harm.
- Ethical — addresses ethical and trustworthy concerns.
- Privacy-respecting — data privacy and security maintained.
This pulls heavily on Domain I (Trustworthy AI). Production is where Domain I requirements are actually enforced.
KEY TAKEAWAYS
- Production trustworthy-AI = compliant, safe/reliable/secure, ethical, privacy-respecting.
- Production is where Domain I is enforced, not theorized.
Lesson 34: AI System Safety and Reliability
Two distinct properties:
- Safety — AI must not endanger humans (through neglect or carelessness).
- Reliability — AI must operate as intended throughout its life cycle (software or hardware).
Both require: defined acceptable boundaries, fallback plans for when boundaries are crossed, resilience to operational failures.
PMI's example issue table (autonomous vehicles, facial recognition, content moderation) shows specific failure modes and mitigations — the PM understands these are project-level risks, not just technical concerns.
KEY TAKEAWAYS
- Safety = doesn't endanger. Reliability = operates as intended.
- Both require defined boundaries + fallback plans + resilience.
Lesson 35: Malicious AI
Malicious AI is the intentional use of AI for criminal, unethical, dangerous, or harmful purposes. Two main categories:- Cyberthreats — attacks via intelligent bots, model poisoning, automated zero-day discovery.
- Physical threats — autonomous-vehicle hijacking, voice-assistant social engineering, physical denial-of-service.
Specific patterns:
- Adversarial attacks — manipulating input data to deceive ML models (e.g., turtle classified as rifle).
- Model poisoning — corrupting training data so the model learns wrong patterns.
- Deepfakes — AI-generated fake media for fraud, public-figure impersonation, social engineering.
- AI-powered misinformation — generating and spreading fake content.
The PM coordinates defense via the contingency plan (V.7), monitoring (V.4), and trustworthy-AI framework (Domain I).
KEY TAKEAWAYS
- Malicious AI = cyberthreats + physical threats.
- Patterns: adversarial attacks, model poisoning, deepfakes, AI misinformation.
- Defense is layered across V.7 contingency, V.4 monitoring, Domain I framework.
Lesson 36: Securing Machine Learning Models
Security factors for production ML models:
- Protect the model — code, weights, architecture from theft or tampering.
- Protect training data — sensitive data shouldn't leak via the model or training pipeline.
- Protect inference data — runtime inputs may contain sensitive data.
- Detect adversarial inputs — flag and reject manipulated inputs.
- Audit access — who used the model, when, with what input.
Security is part of governance (V.3) and trustworthy-AI (Domain I), enforced through monitoring (V.4) and contingency (V.7).
KEY TAKEAWAYS
- ML security covers model + training data + inference data + adversarial detection + access audit.
- Enforced through governance, monitoring, and contingency.
Lesson 37: Resolving Issues of Ethical and Trustworthy AI
When ethical or trustworthy-AI issues arise in production, the PM coordinates resolution:
- Detect — monitoring surfaces the issue (bias, privacy breach, malicious use).
- Contain — pause or roll back the model if needed.
- Audit — document what happened, who was affected, what the cause was.
- Notify — stakeholders, regulators, affected users (per legal requirements).
- Remediate — fix the underlying issue (data, model, deployment).
- Document — feed lessons learned (V.5) for future projects.
The Trustworthy AI Framework (Domain I) provides the structural reference; Domain V is where execution happens.
KEY TAKEAWAYS
- 6 steps when ethical issue arises: Detect, Contain, Audit, Notify, Remediate, Document.
- Trustworthy AI Framework = reference. Domain V = execution.
Module 6: Closeout — Reporting, Limits, and Next Iteration
Lessons 38-40 | Closing the project loop with lessons learned, recognizing AI limits, and preparing for the next iteration.Lesson 38: ECO Task V.5 — Prepare Final Report / Lessons Learned
The final Domain V deliverable: a formal report capturing what was built, what was learned, what worked, what didn't, and what's recommended for future iterations.
Required content per CPMAI methodology:
- Project summary — what was built and delivered.
- Performance against success criteria — how the project measured against Domain II success criteria.
- Lessons learned — what worked, what didn't, what to do differently.
- Recommendations — for next iterations, related projects, organizational learning.
- Outstanding risks and dependencies — what remains for ongoing operations.
Lessons learned are not "the end" — they're input to the next iteration's Phase I. CPMAI is iterative; the final report is the bridge.
KEY TAKEAWAYS
- Final report = project summary, performance vs criteria, lessons learned, recommendations, outstanding risks.
- Lessons learned feed the next iteration's Phase I, not the trash.
PM Oversight Angle
- PM owns: Producing the final report and lessons learned, ensuring it's reviewed by stakeholders and contributes to organizational AI learning.
- Deliverable: Final Report / Lessons Learned document — typically a workbook section in CPMAI plus a stakeholder briefing.
- Iteration trigger: Lessons learned reveal a fundamental scope or strategy issue → input to next iteration's Phase I.
- Escalation trigger: Lessons learned reveal organizational learning gaps that need program-level attention.
- Wrong-answer trap: "Send the deployment status report and call it the final report." Final report is broader — includes lessons learned, outstanding risks, recommendations.
- Question pattern signal: Stems mentioning "the project is wrapping up," "the deployment is complete," "the team is preparing closeout."
- ECO task tag: Domain V, Task 5 — Prepare final report / lessons learned
Lesson 39: The Limits of AI Technology
Even with successful operationalization, AI has hard limits:
- AI doesn't understand — pattern recognition isn't comprehension.
- AI doesn't reason causally — correlation, not causation.
- AI fails on out-of-distribution inputs — anything outside training data is unreliable.
- AI requires data — no data, no learning, no model.
- AI has no values — humans must define what "right" looks like.
- AI can't explain itself fully — even XAI gives explanations, not full understanding.
Recognizing limits is a Phase VI competency: it informs lessons learned, recommendations for the next iteration, and decisions about expanding or contracting the AI program.
KEY TAKEAWAYS
- AI's hard limits: doesn't understand, doesn't reason causally, fails OOD, needs data, has no values, can't fully explain.
- Recognition of limits is a Phase VI competency, not a defeatist stance.
Lesson 40: Phase VI Go/No-Go — Ready for Next Iteration
CPMAI is iterative. Phase VI ends not with "the project is done" but with "what's the next iteration?" The Phase VI go/no-go is a softer decision than Phase II/IV gates — it asks:
- Was the iteration's goal achieved?
- What did we learn?
- What's the next valuable iteration?
- Should the project continue, pivot, or pause?
The output is a recommendation, not a hard gate. The recommendation feeds back into Phase I (Business Understanding) for the next iteration.
KEY TAKEAWAYS
- Phase VI ends with a recommendation for the next iteration, not a hard gate.
- CPMAI's iterative design: Phase VI → next iteration's Phase I.
Quick Reference: Model Monitoring Checklist
| Category | What to Monitor | Why |
|---|---|---|
| Performance | Latency, throughput, error rate, accuracy, F1, recall | SLA + project success criteria |
| Drift — Model | Predictions degrading over time | Retraining trigger |
| Drift — Data | Input distribution shifting | Data pipeline + retraining trigger |
| Usage | Who is using the model, how, how often | Future training + governance |
| Bias | Fairness across user segments | Trustworthy AI compliance |
| Security | Adversarial inputs, unauthorized access, audit logs | Governance + safety |
Quick Reference: Trustworthy AI in Production
| Pillar | What to Verify |
|---|---|
| Compliant | Regulatory + legal in all operating jurisdictions |
| Safe | Doesn't endanger humans through neglect or carelessness |
| Reliable | Operates as intended throughout life cycle |
| Secure | Protected from malicious AI and adversarial attacks |
| Ethical | Addresses ethical and fairness concerns |
| Privacy-Respecting | Data privacy and security maintained |
Cross-Domain Links
- V.1 (Deployment Plan) ↔ Domain IV (Tasks IV.5, IV.6 — Gates): Deployment plan responds to the operationalization gate. Plan is the bridge from "ready for ops" to "running in ops."
- V.3 (Model Governance) ↔ Domain I (all 5 tasks): Privacy plan (I.1), transparency (I.2), bias checks (I.3), regulatory compliance (I.4), accountability (I.5) all enforce in production via V.3.
- V.4 (Solution Metrics) ↔ Domain II (Task II.8 — Success Criteria): Production metrics tie back to the success criteria from Domain II. Without that linkage, metrics are abstract.
- V.5 (Lessons Learned) ↔ Phase I (next iteration): Final report's lessons learned become input to the next iteration's Domain II business needs.
- V.7 (Contingency Plan) ↔ Domain I + Domain V tasks: Contingency covers trustworthy-AI incidents (Domain I), data feed failure (V.4 monitoring), model failure (V.4 + V.3).
- V.6 (Transition Plan) ↔ V.1 (Deployment Plan): Transition responds to changes in the deployment plan (new ops team, new vendor, new scope).
Knowledge Check
Question 1
The data scientist informs the PM that the model is ready to deploy and asks when production rollout can begin. The deployment plan has not yet been created. What should the PM do?
A. Approve deployment and have the team document the plan in parallel
B. Pause and coordinate creation of the deployment plan before deployment begins, including stakeholder sign-off
C. Have the ML engineer deploy to a staging environment while the plan is written
D. Schedule the deployment and have the data scientist write the plan that day
Click for answer and rationale
Correct: BECO Task V.1 requires a documented deployment plan with stakeholder sign-off before deployment. Production incidents trace back to "we didn't plan for this."
- A wrong: Concurrent deploy/plan = no plan when needed.
- C wrong: Wrong-answer trap — partial deployment without a plan still violates V.1.
- D wrong: Rushed plan without stakeholder sign-off doesn't meet the bar.
Question 2
A model has been running in production for 3 months. The data scientist notices that prediction accuracy has degraded by 8 percentage points. What should the PM do?
A. Have the data scientist retrain the model immediately
B. Investigate root cause (data drift, model decay, scope shift), engage stakeholders, and decide between retraining, rolling back, or rescoping per the contingency plan
C. Continue monitoring for another month to see if it's a temporary issue
D. Roll back to the previous version
Click for answer and rationale
Correct: BECO Tasks V.4 (metrics oversight) + V.7 (contingency) intersect. Performance breach triggers the contingency plan, not unilateral retraining.
- A wrong: Retraining without root-cause analysis is a technical workaround.
- C wrong: Wrong-answer trap — passive monitoring while production degrades.
- D wrong: Rollback may be right but should follow contingency-plan triage, not be the first move.
Question 3
A PM is told that the operationalization environment is ready and the model can be pushed at any time. Which is the BEST sequence?
A. Deploy → monitor → declare success
B. Deploy → declare success
C. Verify deployment plan executed → confirm monitoring is live → confirm performance baseline met → confirm governance in place → declare deployment complete
D. Deploy → wait 30 days → declare success
Click for answer and rationale
Correct: CECO Task V.2 requires deployment success against the full criteria — runtime + monitoring + performance + governance.
- A wrong: Misses governance and performance baseline confirmation.
- B wrong: Wrong-answer trap — runtime alone doesn't equal success.
- D wrong: Time isn't the criterion — verified criteria are.
Question 4
True or False: Model drift and data drift are exceptional events that indicate the deployment was flawed.
Click for answer and rationale
Correct: FALSEPMI's framing: model drift and data drift are inevitable, not exceptional. Monitoring exists because drift happens. A deployment that assumes no drift is the flawed one.
Question 5
During a routine governance review, the PM discovers that an updated model version was deployed last month without going through change control. What is the BEST response?
A. Document the deployment retroactively to close the gap
B. Treat as a governance and accountability incident: validate the deployed version against requirements, document the deviation, escalate per accountability procedures, and reinforce change control
C. Roll back to the previous version immediately
D. Note it for next quarter's governance review
Click for answer and rationale
Correct: BECO Task V.3 (governance) crosses to Domain I Task 5 (accountability/audit trail). A bypass of change control is a governance incident requiring containment + audit + escalation, not retroactive documentation.
- A wrong: Retroactive documentation papers over the bypass.
- C wrong: Rollback may not be needed — validate first.
- D wrong: Wrong-answer trap — passive deferral allows the bypass pattern to repeat.
Question 6
The PM is finalizing the deployment plan and is asked whether to pre-define a contingency plan for model failure. What's the BEST PM response?
A. Defer contingency planning until after deployment and observe production for actual failure modes
B. Create the contingency plan now; test it pre-production; ensure response procedures, owners, and escalation paths are documented before going live
C. Have the operations team handle contingencies as they arise
D. Skip contingency planning since modern monitoring tools auto-detect failures
Click for answer and rationale
Correct: BECO Task V.7 requires contingency plans to be created, tested, and ready before production.
- A wrong: Wrong-answer trap — observing production for failures is not "planning."
- C wrong: Operations doesn't own AI-specific contingencies (model failure, data drift, trustworthy-AI incidents).
- D wrong: Auto-detection ≠ contingency response.
Question 7
A model has been running successfully for 6 months. A new vendor will take over operations of the model from the current team. What should the PM do?
A. Email the operations team and let them figure out the handoff
B. Coordinate a transition plan including documented artifacts, knowledge transfer, training, escalation paths, and sign-off from the receiving team
C. Have the data scientist train the new vendor's team on the model
D. Have the new vendor inherit the model as-is and start fresh on operations
Click for answer and rationale
Correct: BECO Task V.6 — transition plans are documented, signed off, and confirmed ready by the receiving team.
- A wrong: Wrong-answer trap — undocumented handoffs lose institutional knowledge.
- C wrong: Knowledge transfer is part of the plan, not a substitute for it.
- D wrong: "Start fresh" abandons documented governance, monitoring, and history.
Question 8
True or False: Lessons learned from a completed AI project are filed away for organizational record but don't influence ongoing or future projects.
Click for answer and rationale
Correct: FALSECPMAI is iterative. ECO Task V.5 — lessons learned are input to the next iteration's Phase I. Filing them away without integration breaks the methodology's iterative loop.
Question 9
A regulator inquires about how an AI-driven decision was made for a specific customer. The PM is asked to provide documentation. What's the BEST response?
A. Have the data scientist explain the model architecture
B. Provide the audit trail from model governance: input, model version, prediction, timestamp, decision rationale, and any human-in-the-loop overrides — sourced from the documented governance program
C. Decline since model decisions are confidential
D. Re-run the prediction and provide the new result
Click for answer and rationale
Correct: BECO Task V.3 (governance) crosses Domain I Task 5 (accountability). The audit trail is the prepared answer; this is exactly what governance documentation exists for.
- A wrong: Architecture explanation isn't decision-specific accountability.
- C wrong: Regulators have inquiry rights; declining is rarely correct.
- D wrong: Wrong-answer trap — re-running doesn't reproduce the original decision context.
Question 10
The team has deployed an AI model and is monitoring its performance. After 3 months, the model is still meeting performance targets. The PM is asked whether to declare the project complete. What's the BEST response?
A. Declare complete since performance targets are being met
B. Continue monitoring indefinitely — projects don't complete
C. Convene Phase VI close-out: document final report and lessons learned, decide on transition to operations or next iteration, capture outstanding risks and dependencies, formal stakeholder sign-off
D. Have the operations team take over and let them decide when to declare complete
Click for answer and rationale
Correct: CECO Task V.5 + Phase VI go/no-go. Project closeout has a structure: final report, lessons learned, transition plan, sign-off. Performance targets met = ready for closeout, not "automatically done."
- A wrong: Skips the closeout deliverables.
- B wrong: Projects do complete — operations continues, but the project closes formally.
- D wrong: Wrong-answer trap — closeout decision is PM-coordinated with stakeholders, not delegated to operations.
Memory Aids & Mnemonics Summary
| Mnemonic | What to Remember |
|---|---|
| HOPE-MS (Deployment Plan) | How served, Operation location, Performance, Escalation, Monitoring, Stakeholder sign-off |
| DBSO (4 AI Environments) | Development, Big data/engineering, Scaffolding, Operationalization |
| HIIPP (5 GenAI Risks) | Hallucination, IP misappropriation, Inappropriate, Prompt injection, Private data |
| BMO Metrics | Business KPIs, Model performance, Operational health |
| APAVBE Governance | Access, Provenance, Audit logs, Versioning, Bias, Extension controls |
| Hot vs Cold Path | Hot = Hurry (ms, alerts, real-time). Cold = Consider (hours, aggregation, retraining) |
| Model Drift vs Data Drift | Model drift = predictions degrade. Data drift = inputs shift. Both inevitable, monitor both. |
| MLOps vs DevOps | DevOps = code CI/CD. MLOps = code + model + data CI/CD + drift monitoring |
| Detect-Contain-Audit-Notify-Remediate-Document | 6 steps when an ethical AI issue arises in production |
| Limits of AI | No understanding, no causal reasoning, OOD failure, data-dependent, no values, no full self-explanation |
Closing reminders for Domain V
- AI projects don't end at deployment. "Deployed and done" is a project failure mode on a delay. Continuous monitoring, governance, and contingency are the methodology.
- The PM doesn't push the model. The PM ensures the deployment plan exists, the deployment is managed against it, governance is in place, metrics are monitored, and contingencies are tested.
- Domain V cross-domain links to Domain I are heavy. Trustworthy AI principles enforce in production. When a stem mentions privacy breach, malicious AI, regulatory inquiry, or bias incident — Domain I + Domain V together.
- Drift is inevitable. Model drift and data drift are guaranteed; monitoring is the response. Stems that frame drift as a "surprise" are testing whether you treat it as exceptional (wrong) or expected (right).
Next:
domain-IV-model-dev-eval.md (PRIORITY 3 — Domain IV has the two highest-density gates: IV.5 and IV.6)