Domain III: Identify Data Needs — Comprehensive Study Guide

Exam weight: 26% of PMI-CPMAI exam (~31 scored questions) Score-report framing: ❌ Below Target — PRIORITY 1 for rebuild Maps to CPMAI methodology phase: Phase II — Data Understanding (and the front of Phase III — Data Preparation) Number of ECO tasks: 9 (III.1 through III.9) Estimated study time: 14 hours

Overview

Domain III is the largest weight in the exam (tied with Domain II) — and the one scored Below Target on first attempt. Every task in this domain begins with an oversight verb: define, identify, coordinate, gather, check, oversee, determine if, convey. Not one task says "build the dataset," "ingest the data," or "engineer the features." That work belongs to the data scientist, the data engineer, or the data steward — the project manager facilitates, documents, and decides.

The unifying pattern: Domain III is about ensuring the right data is identified, made available, and validated against the success criteria from Domain II — before the team commits to building anything. Most wrong-answer traps in Domain III are technically-correct moves that bypass either a documented requirement, a stakeholder decision, or a go/no-go gate.

This domain contains the first of three explicit go/no-go gates in the ECO (Task III.8 — Determine if data meets solution needs). Ten to fifteen exam questions concentrate around the three gates combined. Master the gate decision pattern and you secure outsized point value.

Table of Contents


Module 1: Foundation — The Role of Data in AI

Lessons 1-7 | Why data is the foundation of AI projects, and what "required data" actually means.

Lesson 1: ECO Task III.1 — Define Required Data

Before a single byte is collected, the project manager owns one question: what data does this AI project actually need? ECO Task III.1 — the entry point of Domain III — turns the business needs from Phase I into a documented, verifiable specification. Skipping this step is one of the most common reasons AI projects either fail or rework themselves into oblivion.

PMI's prep course breaks the activity into four sequenced steps: identify the required data type (driven by the AI pattern from Phase I), specify the attributes and field-level detail, identify the data sources (internal, external, mix), and identify the means to aggregate the data. The output of this task is a documented requirements specification — not a dataset.

KEY TAKEAWAYS

💡 Memory Aid — DRIP

When defining required data, run DRIP:

PM Oversight Angle


Lesson 2: Data Fuels Intelligence

Data is the lifeblood of AI and has been since the field's beginnings in 1956. AI learns from data — without data, there is no learning, no generalization, no model. The implication for the PM: data scarcity, quality issues, or access blockers are project-level risks, not technical inconveniences. They need to surface in risk registers and Phase II go/no-go discussions, not get buried in data-engineering tickets.

KEY TAKEAWAYS


Lesson 3: The Data-First Approach

Modern AI assumes a "data-first" posture: identify the right data before committing to algorithms, infrastructure, or architecture. Choosing the model, framework, or compute first is backwards — those are downstream of what data exists, where it lives, what quality it has, and what trustworthy-AI constraints it carries.

A common failure mode you should recognize on the exam: "the team chose [framework / cloud platform / model architecture] and then went looking for data" — this scenario almost always tests Phase I/II discipline. The right answer reframes back to data understanding.

KEY TAKEAWAYS


Lesson 4: What Is Big Data?

Big Data is data at a scale where traditional storage, processing, and analysis tools no longer fit. PMI defines big data along four Vs: Volume, Velocity, Variety, and Veracity. Each "V" is not just a size descriptor — it's a challenge category that creates AI-project risk if not understood up front.

The PM's job here is not to know the technology stack that handles big data — it's to recognize when the project is dealing with big-data conditions and ensure the right SMEs and infrastructure are coordinated (Tasks III.2, III.4).

KEY TAKEAWAYS


Lesson 5: The 4 Vs of Big Data

VWhat It MeansExample
VolumeMassive amounts — petabytes, exabytes, zettabytes — often spread across locations. We're in the zettabyte era (2010s onward).A retailer with 10 years of transaction logs across global regions
VelocityData changing rapidly OR moving from one place to another quickly. Streaming, real-time updates.Stock tick data, IoT sensor streams, airplane engine telemetry
VarietyDifferent formats — structured (databases), unstructured (images, video, text), semistructured (JSON, XML). One system can't handle all three well.Customer record + email transcripts + product photos + sensor logs
VeracityDifferent levels of quality, accuracy, trustworthiness, and consistency. Hard to assess at scale.Multi-source data where some sources are accurate, some outdated, some incomplete

💡 Memory Aid — VVVV (the 4 Vs)

Volume, Velocity, Variety, Veracity. Volume of data, Velocity of change, Variety of formats, Veracity of quality. Big data = big problems across all four.

KEY TAKEAWAYS


Lesson 6: Big Data — Lessons Learned

Decades of big data experience yielded a few hard-won lessons relevant to AI:

  1. Success requires scale AND speed. Traditional databases didn't scale to internet-era data growth. New approaches handle both.
  2. Data visualization is key. Humans can't interpret raw tables at scale — visualization is required for understanding.
  3. Big data success requires reliability and recovery. Data systems fail; the question is how quickly they recover and how much they protect.
  4. Big data requires a multi-platform, multi-tool approach. No single product solves the big-data problem.

For the PM, the practical takeaway: budget for data tooling, visualization, and resilience as first-class project line items — not as afterthoughts.

KEY TAKEAWAYS


Lesson 7: Applying Big Data Approaches to AI

AI and big data are deeply interconnected — success in one often depends on the other. AI projects pull from big-data infrastructure to access training data; big-data systems use AI for pattern detection and anomaly surfacing. The PM's job: ensure the AI project's data needs are mapped to the organization's existing big-data capabilities (or surface the gap if those capabilities don't exist yet).

KEY TAKEAWAYS


Module 2: Data Quality, Quantity, and AI-Specific Aspects

Lessons 8-11 | What "enough data" and "good data" actually mean for AI projects.

Lesson 8: Failure Reason — Data Quantity and Quality Issues

A common reason AI projects fail is lack of data understanding — specifically, not knowing whether you have enough data of high enough quality. Both dimensions matter:

Domain III's job is to detect these issues during evaluation (III.7), before the gate (III.8). Catching them at the gate prevents expensive Phase III/IV rework.

KEY TAKEAWAYS


Lesson 9: Data Quantity Issues

Specific quantity problems you should recognize on the exam:

Mitigations the PM may surface: data augmentation, synthetic data generation, transfer learning from pretrained models, descope the project to a use case the available data covers.

KEY TAKEAWAYS


Lesson 10: Data Quality Issues

PMI lists data quality dimensions as: accuracy, completeness, consistency, timeliness, uniqueness, validity, integrity. Issues across any of these dimensions degrade model performance. Specific failure modes:

💡 Memory Aid — ACCTUVI (Data Quality Dimensions)

Accuracy, Completeness, Consistency, Timeliness, Uniqueness, Validity, Integrity. "A Cat Caught Two Unwary Voles Inside." Each is a separate quality dimension to evaluate.

KEY TAKEAWAYS


Lesson 11: AI-Specific Aspects of Data Understanding

Beyond general data quality and quantity, AI projects have specific considerations that traditional analytics projects don't:

KEY TAKEAWAYS


Module 3: Data Sets, Sources, and Gathering

Lessons 12-21 | Identifying sources, the data types AI consumes, and gathering data into the project.

Lesson 12: ECO Task III.3 — Identify Data Sources and Locations

Once required data is defined (III.1), the PM coordinates the team's identification of where the data lives. Sources are typically: internal systems (ERP, CRM, transactional databases), external feeds (vendor APIs, public datasets), partner data, IoT/sensor streams, or generated/synthetic data. Each carries different cost, access, and trustworthy-AI implications.

The output is a data source inventory that maps each required data element from III.1 to one or more candidate sources, including access mechanism, ownership, refresh cadence, format, and any compliance constraints.

KEY TAKEAWAYS

💡 Memory Aid — SCALE (Source Identification)

Source type (internal/external), Cost & access mechanism, Accuracy & cadence, Legal/license constraints, Endpoint or location.

PM Oversight Angle


Lesson 13: Identifying Data Sets for ML — Activities

PMI's prep course lists four key activities for data collection:

  1. Identify required data for training — the dataset itself (e.g., faces for facial recognition).
  2. Identify specific attributes — fields, features, labels needed within that dataset.
  3. Identify data sources — internal, external, or both.
  4. Identify the means to aggregate — how data from multiple sources will be combined into one usable form.

These four are not separate Domain III tasks — they're activities that together execute Tasks III.1 (define) and III.3 (identify sources).

KEY TAKEAWAYS


Lesson 14: ECO Task III.5 — Gather Required Data

After requirements are defined (III.1) and sources identified (III.3), the team executes the actual data gathering. The PM does not run extraction queries or write ingestion code — the PM coordinates the team's effort, tracks completion against the Data Source Inventory, and surfaces blockers (access denied, source unavailable, format mismatch) for resolution.

A subtle but exam-relevant point: gathering happens before full evaluation (III.7). The PM ensures gathering proceeds as planned and produces the dataset that evaluation will then assess.

KEY TAKEAWAYS

PM Oversight Angle


Lesson 15: Training Data — Definition and Role

Training data is "a data set of prepared, cleaned, and appropriately labeled data — used to incrementally train a machine learning model to perform a particular task." Three properties matter:

Common training data types: Image/Video, Text/Conversational, Structured/Quantified, Sensor/IoT, Behavioral. Match data type to AI pattern (Lesson 5 of Phase I).

KEY TAKEAWAYS


Lesson 16: Structured, Unstructured, and Semi-Structured Data

TypeDefinitionExamples
StructuredDefined format and schema.Tables, databases, spreadsheets
UnstructuredNo schema, highly variable.Images, video, text, audio
Semi-StructuredSome schema + variability.JSON, XML, invoices

Roughly 80% of organizational data is unstructured — and most of it is untapped for analytics. AI (especially modern deep learning and generative AI) is the primary tool for extracting value from unstructured data.

KEY TAKEAWAYS


Lesson 17: The Untapped Value of Unstructured Data

Most enterprises have decades of unstructured data — emails, support tickets, contracts, product photos, customer call recordings, scanned documents — that's never been systematically analyzed. AI provides the capability to extract value: NLP for text, computer vision for images, speech recognition for audio.

For the exam: a stem mentioning "the company has years of [emails / support tickets / scanned forms / call recordings] but has never used them" is signaling unstructured-data territory and probably testing AI pattern selection.

KEY TAKEAWAYS


Lesson 18: Does AI Need a Lot of Data?

The famous "it depends" answer. Data needs are driven by:

KEY TAKEAWAYS


Lesson 19: Running AI Projects with Small Data

It is possible to run AI with limited data when:

The PM's job is to escalate the small-data approach decision to ensure stakeholders understand the trade-offs (less robust generalization, narrower use case fit).

KEY TAKEAWAYS


Lesson 20: Ground Truth Data

Ground truth data is "data derived from real-world observations that serves as the definitive reference" for evaluating model performance. Ground truth is the standard the model is measured against — without it, you have no objective accuracy metric.

For supervised learning, ground truth typically takes the form of human-labeled data ("this image is a cat"). For unsupervised problems, ground truth may be derived from external signals or expert review.

KEY TAKEAWAYS


Lesson 21: Data Management and Data Management Plans

Data management is the practice of collecting, organizing, mining, and storing organizational data. A Data Management Plan (DMP) documents how data will be handled across the project: collection, storage, access, transformation, retention, and security.

The PM owns ensuring a DMP exists (often a workbook section in CPMAI), is signed off by stakeholders, and aligns with the data life cycle (Lesson 29). For regulated industries, the DMP is also a compliance artifact.

KEY TAKEAWAYS


Module 4: Privacy, Compliance, Roles, and Infrastructure

Lessons 22-32 | The roles, governance, and infrastructure that make data usable and compliant.

Lesson 22: ECO Task III.6 — Check Data Privacy, Compliance, and Access

Privacy, compliance, and access are not Phase III problems — they are requirements-stage checks. The PM coordinates the team to verify that every required data element passes:

This task pulls heavily on Domain I — Trustworthy AI (privacy/security plan, regulatory compliance, accountability documentation).

KEY TAKEAWAYS

PM Oversight Angle


Lesson 23: Data Governance

Data governance is "the set of processes, procedures, and standards that ensure data is accurate, accessible, secure, and used responsibly." It's an organizational capability, not a project artifact — but the project consumes governance policy and contributes new artifacts (data lineage, access logs, etc.) to it.

Key governance components: data ownership (who owns each dataset), data classification (sensitivity tiers), access policies, retention policies, audit requirements.

KEY TAKEAWAYS


Lesson 24: Data Stewardship

Data stewardship is "the practice of ensuring an organization's data is accessible, trustworthy, usable, and secure." Stewardship operationalizes governance — it's the set of practices that make policy real day-to-day.

KEY TAKEAWAYS


Lesson 25: ECO Task III.2 — Identify Data SMEs

Every required dataset has subject-matter experts who know it best. The PM identifies these SMEs early and ensures they're engaged in requirements, source selection, evaluation, and gate decisions. Roles include data stewards, data custodians, data owners, business-domain SMEs, and external SMEs (vendors, consultants).

KEY TAKEAWAYS

PM Oversight Angle


Lesson 26: Data Stewards vs. Data Custodians

These two roles are distinct and commonly confused on the exam:

RoleResponsibilitiesSkill Mix
Data StewardEnforces policy. Establishes data lineage, cataloging, monitoring, advocacy. Strategic — works with IT and business.Data management + soft skills (communication, collaboration)
Data CustodianSafe storage, transfer, and use of data. Administrative — not the data owner.Operational/technical
The key distinction: stewards enforce policy and curate; custodians operationally protect. Stewards are strategic and cross-functional; custodians are operational.

💡 Memory Aid — Steward vs. Custodian

Steward = Strategic policy enforcer. Custodian = Custodial protection (storage/transfer/use). "Stewards Set policy, Custodians Carry it."

KEY TAKEAWAYS


Lesson 27: Informational Bias

Three usages of the word "bias" in AI — and only one is "informational bias":

  1. Bias in neural networks — adjustment factor for fine-tuning model performance. Nothing to do with fairness.
  2. Bias vs. variance — model's tendency to underfit or overfit.
  3. Informational bias — overrepresentation or underrepresentation of categories in the data, with fairness implications.

Common types of informational bias: reporting bias (only some aspects recorded), recall bias (recent data weighted more), classification bias (data categorized in ways that misrepresent groups).

To build trustworthy AI, bias must be measurable, monitored, and managed.

💡 Memory Aid — Three Biases (NVI)

Neural-network bias (adjustment factor), Variance bias (under/overfit), Informational bias (fairness — the one that matters here).

KEY TAKEAWAYS


Lesson 28: ECO Task III.4 — Coordinate AI Workspace and Infrastructure

The PM coordinates the technical infrastructure required for the project: compute environments, storage platforms, data pipelines, dev/test/prod separation, security controls, and access provisioning. The PM doesn't build any of it — the PM ensures it exists, fits the project, and is ready when needed.

This often means working with platform teams, cloud architects, IT security, and the data engineering team well before the data scientist starts.

KEY TAKEAWAYS

PM Oversight Angle


Lesson 29: The Data Life Cycle

PMI's data life cycle has 10 stages, each with PM-relevant questions:

  1. Generation — How is the data generated?
  2. Collection — How will it be collected? From which sources?
  3. Storage — What storage methods? Safe and accessible?
  4. Access — Who gets access? Who manages access?
  5. Usage — What's the data used for? Specific purposes documented?
  6. Transfer — How is data transferred between systems?
  7. Security — How is data secured? Different levels for different data?
  8. Deletion — When/how is data deleted when no longer needed?
  9. Archival — How is long-term-retained data archived? Still secure and accessible?
  10. Privacy — Privacy policies and regulatory requirements, throughout.

💡 Memory Aid — Data Life Cycle (10 Stages)

Generation → Collection → Storage → Access → Usage → Transfer → Security → Deletion → Archival → Privacy. "Good Cats Sit Around Until Their Suppers Drop And Pause."

KEY TAKEAWAYS


Lesson 30: Data Quality Management

Data quality management is the ongoing process of measuring, improving, and maintaining data quality across the seven dimensions (Lesson 10). It's a practice, not a one-time activity.

For the AI project, quality management means:

KEY TAKEAWAYS


Lesson 31: Analytics — Definition and Scope

Analytics involves "using statistical and other methods to gain insights from data." Analytics is broader than AI — it includes descriptive (what happened), diagnostic (why), predictive (what will happen), and prescriptive (what should we do) analytics. AI overlaps mostly with predictive and prescriptive.

KEY TAKEAWAYS


Lesson 32: Data Science vs. Data Analytics

Closely related, distinct purposes:

Data ScienceData Analytics
FocusBuild predictive/prescriptive models, often involving MLAnalyze historical data for insight
OutputModels that make decisions or predictionsReports, dashboards, recommendations
Skill MixStatistics + ML + programming + domainStatistics + business + visualization
Tool BiasPython/R, ML frameworksSQL, BI tools, statistical packages

For an AI project, you typically need both — a data scientist to build, a data analyst to monitor and interpret production performance.

KEY TAKEAWAYS


Module 5: Evaluation, the Gate, and Conveying Findings

Lessons 33-38 | Closing out Domain III — evaluation, the go/no-go gate, and reporting to leadership.

Lesson 33: ECO Task III.7 — Oversee Data Evaluation

Once data has been gathered (III.5), the team evaluates it against the requirements (III.1) and the success criteria from Domain II (II.8). The PM doesn't perform the evaluation — the PM oversees it, ensures it's documented, and ensures the result feeds into the gate decision (III.8).

Evaluation typically covers: completeness against requirements, quality across the seven dimensions, alignment with operational data, fitness for the AI pattern, identification of remaining gaps.

KEY TAKEAWAYS

PM Oversight Angle


Lesson 34: Moving Beyond CPMAI Phase II

You're ready to move past Phase II (Domain III work) and into Phase III (Data Preparation) when:

  1. Data requirements are adequately determined.
  2. PMI-CPMAI Workbook items for Phase II have adequate responses (well-defined answers, understanding of known and unknown).
  3. No critical roadblocks remain (data location, quality, format, access, permissions, regulations, compliance).
  4. Key Phase II questions have answers.

Partial answers are okay if they're sufficient to address the Phase I business requirements. The bar is adequate, not complete.

KEY TAKEAWAYS


Lesson 35: ECO Task III.8 — Determine if Data Meets Solution Needs (THE GATE)

This is the first explicit go/no-go gate in the ECO and one of the most heavily tested concepts on the exam. The PM facilitates a documented decision with stakeholders covering three areas:

Data Sources — Do you know what data you need? Have you identified sources, access methods, and ownership? Are there pretrained models you could use to reduce data needs? Data Description — Have you considered all 4 Vs? Is the quantity, type, change rate, and quality understood? Is the data on-premise / cloud / hybrid known? Data Quality — Do you know the data's quality? Are labeling/augmentation requirements defined? Is the time/cost to prepare understood? Is a collection/ingestion pipeline defined?

The decision has three outcomes:

  1. GO — Proceed to Phase III (Data Preparation).
  2. ITERATE — Loop back to Phase I or earlier in Phase II to address gaps.
  3. DESCOPE — Reduce project scope to what the available data supports.

PMI's key concept: "If you can confidently say, 'We have the data and know the problem,' move to Phase III. If not, pause here. Clarify these points first to avoid issues later."

KEY TAKEAWAYS

💡 Memory Aid — SDQ Gate (Sources, Description, Quality)

Sources known? Description complete (4 Vs)? Quality understood? Three checkboxes for the gate. Three outcomes (Go/Iterate/Descope).

PM Oversight Angle


Lesson 36: The Phase II Go/No-Go Decision

Specific questions PMI's prep course lists for the gate, mapped to the three areas:

Data Sources:

Data Description:

Data Quality:

KEY TAKEAWAYS


Lesson 37: When to Iterate Back to Previous CPMAI Phases

PMI lists 12 scenarios where iterating back to Phase I (Business Understanding) is the right call:

  1. The business problem has shifted since Phase I.
  2. You need data you do not have and cannot reasonably obtain.
  3. You have the wrong type of data and cannot get the right type.
  4. Identified data is too little, no augmentation possible.
  5. Identified data is too much, selection requires rescoping.
  6. Training data and operational data differ materially.
  7. Any of the 4 Vs (Volume, Velocity, Variety, Veracity) creates a roadblock.
  8. Improving data quality would take too long.
  9. Organizational challenges in data collection or ingestion.
  10. Legal, compliance, security, or risk issues block access.
  11. Foundation / pretrained model / GenAI approach requires Phase I changes.
  12. Realizing the project is a Proof of Concept (PoC) rather than a Pilot.

The exam-critical insight: iterating back is not failure — it's the methodology working. The cost of unresolved issues compounds in Phase III/IV/V. CPMAI's iterative design specifically allows backing up "without penalty."

💡 Memory Aid — 12 Iteration Triggers (mnemonic: BIG WORDS)

Business shift, Infeasible data, Gap-too-big quantity → Wrong type / volume too high → Operational mismatch → Roadblock from any V → Data quality time → Stakeholder/legal blocks. (Plus pretrained-model and PoC-vs-pilot misalignment.)

KEY TAKEAWAYS


Lesson 38: ECO Task III.9 — Convey Data Understanding to Leadership

The final Domain III task: the PM communicates Phase II findings to leadership. This is not optional — without the briefing, leadership lacks visibility into data risk and can't make informed decisions about Phase III/IV/V investment.

The conveyance covers: data understanding state, gate decision, key risks identified, iteration recommendations (if any), changes to project scope or schedule that flow from data understanding.

KEY TAKEAWAYS

PM Oversight Angle


Quick Reference: Data Quality Dimensions

DimensionWhat to Check
AccuracyDoes the data match reality?
CompletenessAre all required fields present?
ConsistencyAre formats/conventions uniform across sources?
TimelinessIs the data current enough for the use case?
UniquenessAre duplicates resolved?
ValidityDoes the data conform to defined rules/schemas?
IntegrityAre relationships between data elements maintained?

Mnemonic: ACCTUVI"A Cat Caught Two Unwary Voles Inside."


Quick Reference: Domain III Go/No-Go Gate (Task III.8)

Three areas:
AreaKey Questions
SourcesDo you know what data you need? Sources identified? Access defined? Pretrained alternatives considered? Owner identified?
Description4 Vs understood (Volume, Velocity, Variety, Veracity)? Edge-device needs? On-premise/cloud/hybrid?
QualityQuality known? Labeling/augmentation defined? Time/cost to prep estimated? Pipeline defined and owned?
Three outcomes: GO (proceed to Phase III) · ITERATE (loop back) · DESCOPE (reduce scope to what data supports) Decision rule: "We have the data and know the problem" = GO. "We don't" = pause. Wrong-answer traps to recognize:


Cross-Domain Links


Knowledge Check

Question 1

A project team is preparing to start data collection for a new AI initiative. The data scientist asks the PM what data should be pulled. What should the project manager do?

A. Approve the data scientist's choice and let them begin

B. Direct the data scientist to start with the most accessible internal data

C. Pause and produce a documented data requirements specification tying Phase I success criteria to specific data attributes, sources, and aggregation method, then engage stakeholders to confirm

D. Schedule a project review for the following sprint

Click for answer and rationale Correct: C

ECO Task III.1 requires the PM to define required data — produce a documented specification — before collection begins. The data scientist's question is signaling a missing requirements step.

  • A wrong: Approves a choice with no documented requirements basis.
  • B wrong: Wrong-answer trap — convenience-based source selection bypasses requirements definition.
  • D wrong: Schedules without addressing the immediate gap.

Question 2

After data evaluation, the team finds that 20% of required features are unavailable from any source. What is the project manager's BEST next step?

A. Have the data scientist engineer proxy features for the missing 20%

B. Cancel the project

C. Conduct a formal go/no-go assessment, document the gap and impact on success criteria, and engage stakeholders to decide between proceed-with-mitigation, iterate, or descope

D. Proceed to data preparation and address the gaps in Phase III

Click for answer and rationale Correct: C

ECO Task III.8 — the gate. The PM facilitates a documented stakeholder decision. Three outcomes: GO/ITERATE/DESCOPE. C reflects all three.

  • A wrong: Wrong-answer trap — technical workaround that bypasses governance.
  • B wrong: Premature without an impact assessment. Cancellation is a possible outcome of the gate, not a step that skips it.
  • D wrong: Bypasses the gate. Compounds rework in later phases.

Question 3

True or False: A project manager can confirm data privacy and compliance requirements during Phase III (Data Preparation), since that's when the data is actually transformed.

Click for answer and rationale Correct: FALSE

ECO Task III.6 — privacy/compliance/access checks belong in Phase II (Data Understanding), not Phase III. Compliance is a requirements concern. Discovering a compliance gap in Phase III is rework. Domain I (Trustworthy AI) reinforces this — privacy/security plan (I.1) and regulatory compliance (I.4) run throughout the project, starting in Phase I.

Question 4

The data scientist informs the PM that the available data is materially different from the operational data the model will encounter in production. What should the PM do?

A. Direct the data scientist to use the available data and adjust the model later

B. Iterate back to Phase I to reconsider the business problem and project scope; document the misalignment and engage stakeholders

C. Proceed and let the model's performance in production guide adjustments

D. Have the team augment the available data to better match operational data

Click for answer and rationale Correct: B

This is one of PMI's 12 documented iteration triggers (Lesson 37, scenario 6: "training data and operational data differ materially"). The methodology specifically allows iterating back to Phase I "without penalty."

  • A wrong: Pushes forward with a known fundamental issue.
  • C wrong: Letting production discover the issue is the worst outcome.
  • D wrong: Wrong-answer trap — augmentation is a technical fix; the misalignment is a Phase I scope question.

Question 5

What's the difference between a data steward and a data custodian?

Click for answer Steward = strategic, policy-enforcing, cross-functional. Establishes data lineage, cataloging, monitoring, advocacy. Works with both IT and business. Custodian = operational, administrative. Safe storage, transfer, and use of data. Not the data owner — administrative role over the data.

Mnemonic: "Stewards Set policy, Custodians Carry it."

Question 6

The team identifies that required data exists but is held by a partner organization that requires a Business Associate Agreement (BAA) under HIPAA. The legal team estimates a 6-week delay to execute the BAA. What should the PM do?

A. Proceed with project planning under the assumption the BAA will be signed

B. Replace the partner data with a non-regulated substitute that the data scientist suggests

C. Document the delay as a risk, escalate to leadership with options (proceed-and-wait, iterate to alternative sources, or descope), and engage stakeholders for the decision

D. Cancel the project

Click for answer and rationale Correct: C

ECO Task III.6 (privacy/compliance) intersects with III.8 (the gate). A multi-week compliance dependency that affects schedule and scope is a leadership decision, not a technical workaround.

  • A wrong: Optimistic and risky — BAAs can fail to execute.
  • B wrong: Wrong-answer trap — substitute selection should be requirements-driven (III.1) and SME-validated (III.2), not data-scientist-suggested.
  • D wrong: Cancellation is a stakeholder decision, not a unilateral PM call.

Question 7

True or False: The PM's deliverable for ECO Task III.9 (Convey Data Understanding to Leadership) is a copy of the data evaluation report.

Click for answer and rationale Correct: FALSE

The data evaluation report (III.7's deliverable) is input to III.9, not the output. III.9's deliverable is a leadership briefing artifact: gate decision, risks, recommendations, scope/schedule impacts — packaged for leadership audience and decision-making, not raw evaluation data.

Question 8

A team is using a pretrained foundation model and only needs a small amount of task-specific data for fine-tuning. The PM is asked whether to skip Domain III's gate (III.8). What should the PM do?

A. Skip the gate — III.8 is for projects with custom training data

B. Run the gate as written — sources, description, and quality questions still apply, even when data needs are smaller

C. Defer the gate to Phase IV when fine-tuning happens

D. Have the data scientist run the gate informally

Click for answer and rationale Correct: B

The gate's three areas (Sources, Description, Quality) apply regardless of data volume. A small fine-tuning dataset still needs identified sources, described characteristics, and assessed quality — especially for trustworthy-AI considerations.

  • A wrong: Misreads the gate as data-volume-dependent.
  • C wrong: Phase IV gates are different gates (IV.5, IV.6) testing different things.
  • D wrong: Gate decisions are stakeholder-engaged and PM-facilitated, not informally run by the data scientist.

Question 9

The team's data evaluation reveals that 80% of required data has high quality but the remaining 20% is incomplete and would take 3 months to clean. The project's success criteria specify that all required features are needed. What's the PM's BEST move?

A. Proceed to data preparation and clean the 20% in parallel with model development

B. Iterate back to II.8 (success criteria) to determine if the success criteria can be revised to accept partial coverage, OR loop to III.1 to consider alternative data

C. Have the data scientist begin training on the 80% while the 20% is cleaned

D. Approve a 3-month schedule extension

Click for answer and rationale Correct: B

This is a cross-domain pull. Success criteria are set in II.8. If 100% of features can't be met within tolerance, the right move is to revisit success criteria and required-data definition, not workaround the gap technically.

  • A wrong: Parallel work on incomplete data foundation = compounding risk.
  • C wrong: Wrong-answer trap — training on partial data inserts a quality issue into the model.
  • D wrong: Schedule extension without scope or success-criteria review is a unilateral decision that should be stakeholder-engaged.

Question 10

The PM is documenting data sources and notices that one critical source is held by a vendor whose contract expires in 6 months — and renewal is uncertain. What should the PM do?

A. Document the source and proceed; renewal is the procurement team's problem

B. Document the source AND escalate the contract risk for stakeholder visibility, since data availability is a project-level dependency

C. Replace the source proactively with an internal alternative

D. Wait until renewal is decided before completing the source inventory

Click for answer and rationale Correct: B

Data source dependencies that affect project viability are PM-level concerns. III.3 (source identification) intersects with risk management — escalation is the right move.

  • A wrong: "Procurement's problem" abdicates PM accountability for project dependencies.
  • C wrong: Replacing without stakeholder engagement is a unilateral scope decision.
  • D wrong: Blocks the inventory progress on an external decision.


Memory Aids & Mnemonics Summary

MnemonicWhat to Remember
DRIP (Define Required Data)Determine pattern, Required attributes, Identify sources, Plan aggregation
VVVV (4 Vs of Big Data)Volume, Velocity, Variety, Veracity
ACCTUVI (Data Quality)Accuracy, Completeness, Consistency, Timeliness, Uniqueness, Validity, Integrity. "A Cat Caught Two Unwary Voles Inside."
SCALE (Source Identification)Source type, Cost & access, Accuracy & cadence, Legal/license, Endpoint
Steward vs CustodianSteward = Strategic policy. Custodian = Custodial protection. "Stewards Set, Custodians Carry."
NVI (Three Biases)Neural-net bias (adjustment), Variance bias (over/underfit), Informational bias (fairness — the exam one)
Data Life Cycle (10)Generation, Collection, Storage, Access, Usage, Transfer, Security, Deletion, Archival, Privacy
SDQ Gate (III.8)Sources known, Description complete, Quality understood. Outcomes: GO / ITERATE / DESCOPE
3 Gate OutcomesGO (proceed) · ITERATE (loop back) · DESCOPE (reduce scope)
12 Iteration TriggersBusiness shift, infeasible data, wrong type, too little / too much, training-vs-operational mismatch, V-roadblock, quality time, org block, legal block, pretrained-model misfit, PoC-vs-pilot misalignment

Closing reminders for Domain III


Next: domain-V-operationalize.md (PRIORITY 2)