Domain III: Identify Data Needs — Comprehensive Study Guide

Exam weight: 26% of PMI-CPMAI exam (~31 scored questions) Score-report framing: ❌ Below Target — PRIORITY 1 for rebuild Maps to CPMAI methodology phase: Phase II — Data Understanding (and the front of Phase III — Data Preparation) Number of ECO tasks: 9 (III.1 through III.9) Estimated study time: 14 hours

Overview

Domain III is the largest weight in the exam (tied with Domain II) — and the one scored Below Target on first attempt. Every task in this domain begins with an oversight verb: define, identify, coordinate, gather, check, oversee, determine if, convey. Not one task says "build the dataset," "ingest the data," or "engineer the features." That work belongs to the data scientist, the data engineer, or the data steward — the project manager facilitates, documents, and decides.

The unifying pattern: Domain III is about ensuring the right data is identified, made available, and validated against the success criteria from Domain II — before the team commits to building anything. Most wrong-answer traps in Domain III are technically-correct moves that bypass either a documented requirement, a stakeholder decision, or a go/no-go gate.

This domain contains the first of three explicit go/no-go gates in the ECO (Task III.8 — Determine if data meets solution needs). Ten to fifteen exam questions concentrate around the three gates combined. Master the gate decision pattern and you secure outsized point value.

Module 1: Foundation — The Role of Data in AI (Lessons 1-7)
Module 2: Data Quality, Quantity, and AI-Specific Aspects (Lessons 8-11)
Module 3: Data Sets, Sources, and Gathering (Lessons 12-21)
Module 4: Privacy, Compliance, Roles, and Infrastructure (Lessons 22-32)
Module 5: Evaluation, the Gate, and Conveying Findings (Lessons 33-38)
Quick Reference: Data Quality Dimensions
Quick Reference: Domain III Go/No-Go Gate (Task III.8)
Cross-Domain Links
Knowledge Check
Memory Aids & Mnemonics Summary

Module 1: Foundation — The Role of Data in AI

Lessons 1-7 | Why data is the foundation of AI projects, and what "required data" actually means.

Lesson 1: ECO Task III.1 — Define Required Data

Before a single byte is collected, the project manager owns one question: what data does this AI project actually need? ECO Task III.1 — the entry point of Domain III — turns the business needs from Phase I into a documented, verifiable specification. Skipping this step is one of the most common reasons AI projects either fail or rework themselves into oblivion.

PMI's prep course breaks the activity into four sequenced steps: identify the required data type (driven by the AI pattern from Phase I), specify the attributes and field-level detail, identify the data sources (internal, external, mix), and identify the means to aggregate the data. The output of this task is a documented requirements specification — not a dataset.

KEY TAKEAWAYS

Required data is defined before collection, not discovered during it.
The AI pattern from Phase I drives the data type. Recognition needs labeled images; Predictive Analytics needs structured time-series; Conversational needs text and dialogue.
Trustworthy AI constraints (privacy, regulated data, bias risk) belong in the requirements, not the post-collection cleanup.

💡 Memory Aid — DRIP

When defining required data, run DRIP:

Determine the AI pattern (set in Phase I)
Required attributes and detail (field-level)
Identify sources (internal / external / mix)
Plan aggregation (pipelines, formats, ETL)

PM Oversight Angle

PM owns: Producing a documented data-requirements specification that ties Phase I's success criteria to specific data attributes, sources, and aggregation method. The PM does not gather data — the PM ensures the team knows what to gather and why before they start.
Deliverable: Data Requirements Specification (or equivalent section of the CPMAI workbook). Includes AI pattern reference, required data types, attribute-level detail, source list, aggregation plan, and trustworthy-AI constraints.
Iteration trigger: If the AI pattern from Phase I doesn't fit the available data, loop back to Phase I — do not redefine the pattern unilaterally inside Domain III.
Escalation trigger: Required data is unavailable from internal sources AND external acquisition exceeds project budget, license tolerance, or compliance posture.
Wrong-answer trap: "Direct the data scientist to begin collecting data so the team has something to work with." Bypasses requirements specification, creates rework when collected data doesn't match real needs.
Question pattern signal: Stems describing the team as "ready to begin data collection," "about to start gathering data," or asking "what data does the project need" — testing whether you stop and define requirements first.
ECO task tag: Domain III, Task 1 — Define required data

Lesson 2: Data Fuels Intelligence

Data is the lifeblood of AI and has been since the field's beginnings in 1956. AI learns from data — without data, there is no learning, no generalization, no model. The implication for the PM: data scarcity, quality issues, or access blockers are project-level risks, not technical inconveniences. They need to surface in risk registers and Phase II go/no-go discussions, not get buried in data-engineering tickets.

KEY TAKEAWAYS

AI is inherently data-hungry. Data scarcity is a project risk, not a side note.
The connection between data and AI capability is direct — better data, better learning, better outcomes.

Lesson 3: The Data-First Approach

Modern AI assumes a "data-first" posture: identify the right data before committing to algorithms, infrastructure, or architecture. Choosing the model, framework, or compute first is backwards — those are downstream of what data exists, where it lives, what quality it has, and what trustworthy-AI constraints it carries.

A common failure mode you should recognize on the exam: "the team chose [framework / cloud platform / model architecture] and then went looking for data" — this scenario almost always tests Phase I/II discipline. The right answer reframes back to data understanding.

KEY TAKEAWAYS

"Data-first" means data-understanding before solution-design. Choose data first, then architecture.
A project that picks tools before data is testing your discipline to iterate back to Phase I/II.

Lesson 4: What Is Big Data?

Big Data is data at a scale where traditional storage, processing, and analysis tools no longer fit. PMI defines big data along four Vs: Volume, Velocity, Variety, and Veracity. Each "V" is not just a size descriptor — it's a challenge category that creates AI-project risk if not understood up front.

The PM's job here is not to know the technology stack that handles big data — it's to recognize when the project is dealing with big-data conditions and ensure the right SMEs and infrastructure are coordinated (Tasks III.2, III.4).

KEY TAKEAWAYS

Big data = data at a scale where traditional tools no longer fit.
The 4 Vs aren't just size — each one creates a distinct project risk.

Lesson 5: The 4 Vs of Big Data

V	What It Means	Example
Volume	Massive amounts — petabytes, exabytes, zettabytes — often spread across locations. We're in the zettabyte era (2010s onward).	A retailer with 10 years of transaction logs across global regions
Velocity	Data changing rapidly OR moving from one place to another quickly. Streaming, real-time updates.	Stock tick data, IoT sensor streams, airplane engine telemetry
Variety	Different formats — structured (databases), unstructured (images, video, text), semistructured (JSON, XML). One system can't handle all three well.	Customer record + email transcripts + product photos + sensor logs
Veracity	Different levels of quality, accuracy, trustworthiness, and consistency. Hard to assess at scale.	Multi-source data where some sources are accurate, some outdated, some incomplete

💡 Memory Aid — VVVV (the 4 Vs)

Volume, Velocity, Variety, Veracity. Volume of data, Velocity of change, Variety of formats, Veracity of quality. Big data = big problems across all four.

KEY TAKEAWAYS

Volume is about scale; Velocity is about change rate / movement; Variety is about format diversity; Veracity is about trustworthiness.
The 4 Vs each surface different downstream costs — storage (V1), pipelines (V2), schema design (V3), data cleaning (V4).

Lesson 6: Big Data — Lessons Learned

Decades of big data experience yielded a few hard-won lessons relevant to AI:

Success requires scale AND speed. Traditional databases didn't scale to internet-era data growth. New approaches handle both.
Data visualization is key. Humans can't interpret raw tables at scale — visualization is required for understanding.
Big data success requires reliability and recovery. Data systems fail; the question is how quickly they recover and how much they protect.
Big data requires a multi-platform, multi-tool approach. No single product solves the big-data problem.

For the PM, the practical takeaway: budget for data tooling, visualization, and resilience as first-class project line items — not as afterthoughts.

KEY TAKEAWAYS

Big data success demands scale + speed + reliability + multi-tool approach.
Visualization is a requirement for human understanding, not a nice-to-have.

Lesson 7: Applying Big Data Approaches to AI

AI and big data are deeply interconnected — success in one often depends on the other. AI projects pull from big-data infrastructure to access training data; big-data systems use AI for pattern detection and anomaly surfacing. The PM's job: ensure the AI project's data needs are mapped to the organization's existing big-data capabilities (or surface the gap if those capabilities don't exist yet).

KEY TAKEAWAYS

AI and big data co-depend. Treat them as one capability map, not two silos.
A project that needs big-data scale but has no big-data infrastructure is an escalation, not a hidden risk.

Module 2: Data Quality, Quantity, and AI-Specific Aspects

Lessons 8-11 | What "enough data" and "good data" actually mean for AI projects.

Lesson 8: Failure Reason — Data Quantity and Quality Issues

A common reason AI projects fail is lack of data understanding — specifically, not knowing whether you have enough data of high enough quality. Both dimensions matter:

Quantity: Not enough samples to train a model that generalizes. Class imbalance (one category vastly over-represented). Insufficient examples for rare events.
Quality: Mislabeled data, missing fields, inconsistent formats across sources, outdated values, biased sampling.

Domain III's job is to detect these issues during evaluation (III.7), before the gate (III.8). Catching them at the gate prevents expensive Phase III/IV rework.

KEY TAKEAWAYS

Most AI project failures trace back to data understanding gaps, not algorithm choice.
Both quantity and quality can kill a project. Track them as separate risks.

Lesson 9: Data Quantity Issues

Specific quantity problems you should recognize on the exam:

Not enough total data to train a model that generalizes (especially deep learning).
Class imbalance — one outcome dominates the dataset (e.g., 99% non-fraud, 1% fraud). Models train to the majority and miss the minority.
Insufficient examples for rare cases — edge cases the business cares about most are statistically underrepresented.
Data exists but isn't accessible — silos, permissions, or licensing limits effective data quantity even when nominal volume is high.

Mitigations the PM may surface: data augmentation, synthetic data generation, transfer learning from pretrained models, descope the project to a use case the available data covers.

KEY TAKEAWAYS

Quantity isn't just how many rows — it's how many rows of the right kind, accessible, in the time you have.
Class imbalance and rare-event scarcity are exam-likely scenarios.

Lesson 10: Data Quality Issues

PMI lists data quality dimensions as: accuracy, completeness, consistency, timeliness, uniqueness, validity, integrity. Issues across any of these dimensions degrade model performance. Specific failure modes:

Mislabeled data — even famous datasets have ~5.8% label errors (PMI cites ImageNet).
Missing fields — rows present but key features absent.
Inconsistent formats — same field, different conventions across sources (date formats, currency, units).
Outdated data — values that were correct at collection but no longer reflect reality.
Biased sampling — overrepresentation or underrepresentation of demographic, geographic, or temporal categories.

💡 Memory Aid — ACCTUVI (Data Quality Dimensions)

Accuracy, Completeness, Consistency, Timeliness, Uniqueness, Validity, Integrity. "A Cat Caught Two Unwary Voles Inside." Each is a separate quality dimension to evaluate.

KEY TAKEAWAYS

Quality isn't one thing — it's seven dimensions that each need separate evaluation.
Even premier datasets (ImageNet) have ~5-6% label error rates. Quality is a constant.

Lesson 11: AI-Specific Aspects of Data Understanding

Beyond general data quality and quantity, AI projects have specific considerations that traditional analytics projects don't:

Training data vs operational data must be aligned. If training data and real-world inference data differ materially, the model will fail in production.
Data labeling is often required — supervised learning needs labels, and labeling is expensive.
Pretrained models change data needs. Using a foundation model or fine-tuning approach reduces the data needed for training, but you still need data for evaluation and validation.
Data has a life cycle — generation, collection, storage, access, usage, transfer, security, deletion, archival, privacy. Each stage has implications.

KEY TAKEAWAYS

AI data understanding goes beyond traditional analytics — alignment with operational data, labeling cost, pretrained model implications, life cycle.
Training data ≠ operational data is a top reason production AI fails.

Module 3: Data Sets, Sources, and Gathering

Lessons 12-21 | Identifying sources, the data types AI consumes, and gathering data into the project.

Lesson 12: ECO Task III.3 — Identify Data Sources and Locations

Once required data is defined (III.1), the PM coordinates the team's identification of where the data lives. Sources are typically: internal systems (ERP, CRM, transactional databases), external feeds (vendor APIs, public datasets), partner data, IoT/sensor streams, or generated/synthetic data. Each carries different cost, access, and trustworthy-AI implications.

The output is a data source inventory that maps each required data element from III.1 to one or more candidate sources, including access mechanism, ownership, refresh cadence, format, and any compliance constraints.

KEY TAKEAWAYS

Source identification = mapping required data → where it physically lives.
Output is a documented inventory — not a list in someone's head.

💡 Memory Aid — SCALE (Source Identification)

Source type (internal/external), Cost & access mechanism, Accuracy & cadence, Legal/license constraints, Endpoint or location.

PM Oversight Angle

PM owns: Coordinating the team to produce a documented data source inventory tying every required data element to its source(s), with access, ownership, and compliance metadata.
Deliverable: Data Source Inventory (table or section in the CPMAI workbook).
Iteration trigger: No identified source for a required data element → loop back to III.1 to question whether the requirement can be satisfied with alternative data, or back to Phase I to reconsider scope.
Escalation trigger: Required source carries unexpected licensing, compliance, or cost burden that breaches project constraints.
Wrong-answer trap: "Have the data scientist start ingesting from the most convenient source." Sources should be selected against requirements and constraints, not convenience.
Question pattern signal: Stems mentioning "the team is unsure where to find data," "data sources have not been identified," or "where should the project look for this data."
ECO task tag: Domain III, Task 3 — Identify data sources and locations

Lesson 13: Identifying Data Sets for ML — Activities

PMI's prep course lists four key activities for data collection:

Identify required data for training — the dataset itself (e.g., faces for facial recognition).
Identify specific attributes — fields, features, labels needed within that dataset.
Identify data sources — internal, external, or both.
Identify the means to aggregate — how data from multiple sources will be combined into one usable form.

These four are not separate Domain III tasks — they're activities that together execute Tasks III.1 (define) and III.3 (identify sources).

KEY TAKEAWAYS

The four activities execute III.1 and III.3 together — define required data, identify attributes, identify sources, plan aggregation.

Lesson 14: ECO Task III.5 — Gather Required Data

After requirements are defined (III.1) and sources identified (III.3), the team executes the actual data gathering. The PM does not run extraction queries or write ingestion code — the PM coordinates the team's effort, tracks completion against the Data Source Inventory, and surfaces blockers (access denied, source unavailable, format mismatch) for resolution.

A subtle but exam-relevant point: gathering happens before full evaluation (III.7). The PM ensures gathering proceeds as planned and produces the dataset that evaluation will then assess.

KEY TAKEAWAYS

Gathering = executing the inventory, tracked by the PM, performed by the team.
Blockers (access, format, completeness) are PM-escalated, not technically worked around.

PM Oversight Angle

PM owns: Coordinating execution of data gathering against the Data Source Inventory; tracking completion; surfacing blockers.
Deliverable: Updated inventory with status per source (gathered / partial / blocked) plus the staged dataset itself (handed to the data scientist for evaluation).
Iteration trigger: A required source becomes inaccessible mid-gather → loop back to III.3 to identify alternative sources or III.1 to question the requirement.
Escalation trigger: Multiple blockers indicating systemic data access issues; cost or timeline impact exceeding tolerance.
Wrong-answer trap: "Have the data engineer extract data from whatever source is fastest." Gathering must follow the documented inventory, not technical convenience.
Question pattern signal: Stems mentioning "the team is gathering data," "data collection is in progress," or "a data source is unavailable."
ECO task tag: Domain III, Task 5 — Gather required data

Lesson 15: Training Data — Definition and Role

Training data is "a data set of prepared, cleaned, and appropriately labeled data — used to incrementally train a machine learning model to perform a particular task." Three properties matter:

Prepared and cleaned — raw data isn't training data until it's cleaned (Phase III work, but the requirement is identified in Phase II).
Labeled — for supervised learning. Unsupervised learning doesn't need labels; reinforcement learning uses reward signals.
Representative — training data should reflect the operational data the model will see in production.

Common training data types: Image/Video, Text/Conversational, Structured/Quantified, Sensor/IoT, Behavioral. Match data type to AI pattern (Lesson 5 of Phase I).

KEY TAKEAWAYS

Training data is a specific stage of data — prepared, cleaned, labeled, representative.
Match data type to AI pattern: image/video for Recognition, text for Conversational, structured for Predictive.

Lesson 16: Structured, Unstructured, and Semi-Structured Data

Type	Definition	Examples
Structured	Defined format and schema.	Tables, databases, spreadsheets
Unstructured	No schema, highly variable.	Images, video, text, audio
Semi-Structured	Some schema + variability.	JSON, XML, invoices

Roughly 80% of organizational data is unstructured — and most of it is untapped for analytics. AI (especially modern deep learning and generative AI) is the primary tool for extracting value from unstructured data.

KEY TAKEAWAYS

Three types: structured (schema), unstructured (no schema), semi-structured (partial schema).
~80% of org data is unstructured. AI is the lever to make it useful.

Lesson 17: The Untapped Value of Unstructured Data

Most enterprises have decades of unstructured data — emails, support tickets, contracts, product photos, customer call recordings, scanned documents — that's never been systematically analyzed. AI provides the capability to extract value: NLP for text, computer vision for images, speech recognition for audio.

For the exam: a stem mentioning "the company has years of [emails / support tickets / scanned forms / call recordings] but has never used them" is signaling unstructured-data territory and probably testing AI pattern selection.

KEY TAKEAWAYS

Unstructured data is the largest untapped source of AI training material in most organizations.
Stems referencing "years of [emails/calls/documents]" are signaling unstructured data → test for AI pattern selection.

Lesson 18: Does AI Need a Lot of Data?

The famous "it depends" answer. Data needs are driven by:

AI pattern. Recognition typically needs more data than Predictive Analytics.
Algorithm complexity. Deep learning needs more; simpler algorithms can work with less.
Problem complexity. Subtle distinctions need more examples.
Use of pretrained models. Foundation models / fine-tuning dramatically reduces required data.
Data quality. High-quality data goes further than low-quality at greater volume.

KEY TAKEAWAYS

"How much data?" → it depends on pattern, algorithm, problem complexity, pretrained model use, and quality.
More high-quality data is almost always better than more low-quality data.

Lesson 19: Running AI Projects with Small Data

It is possible to run AI with limited data when:

The use case is narrowly scoped.
A pretrained model can be fine-tuned with a small task-specific dataset.
Synthetic data generation can supplement real data (with quality controls).
Transfer learning leverages models trained on related domains.
Data augmentation (rotation, cropping, paraphrasing) expands a small dataset.

The PM's job is to escalate the small-data approach decision to ensure stakeholders understand the trade-offs (less robust generalization, narrower use case fit).

KEY TAKEAWAYS

Small-data AI is feasible via pretrained models, fine-tuning, synthetic data, transfer learning, augmentation.
These approaches are stakeholder decisions, not technical fallbacks. PM facilitates.

Lesson 20: Ground Truth Data

Ground truth data is "data derived from real-world observations that serves as the definitive reference" for evaluating model performance. Ground truth is the standard the model is measured against — without it, you have no objective accuracy metric.

For supervised learning, ground truth typically takes the form of human-labeled data ("this image is a cat"). For unsupervised problems, ground truth may be derived from external signals or expert review.

KEY TAKEAWAYS

Ground truth = definitive reference the model is measured against.
No ground truth → no objective evaluation. The team needs to establish ground truth before model evaluation begins.

Lesson 21: Data Management and Data Management Plans

Data management is the practice of collecting, organizing, mining, and storing organizational data. A Data Management Plan (DMP) documents how data will be handled across the project: collection, storage, access, transformation, retention, and security.

The PM owns ensuring a DMP exists (often a workbook section in CPMAI), is signed off by stakeholders, and aligns with the data life cycle (Lesson 29). For regulated industries, the DMP is also a compliance artifact.

KEY TAKEAWAYS

A Data Management Plan is a documented artifact, often signed off by stakeholders and required by compliance.
DMP scope spans the full data life cycle, not just collection.

Module 4: Privacy, Compliance, Roles, and Infrastructure

Lessons 22-32 | The roles, governance, and infrastructure that make data usable and compliant.

Lesson 22: ECO Task III.6 — Check Data Privacy, Compliance, and Access

Privacy, compliance, and access are not Phase III problems — they are requirements-stage checks. The PM coordinates the team to verify that every required data element passes:

Privacy: Does the data contain PII? Is consent documented? Are anonymization/pseudonymization controls in place?
Compliance: GDPR, HIPAA, SOX, industry-specific regulations. Is the use case lawful?
Access: Are the right roles authorized? Is data lineage traceable? Are audit trails in place?

This task pulls heavily on Domain I — Trustworthy AI (privacy/security plan, regulatory compliance, accountability documentation).

KEY TAKEAWAYS

Privacy, compliance, and access are Phase II requirements, not Phase III cleanup.
Heavy cross-domain pull from Domain I — Tasks I.1, I.4, I.5.

PM Oversight Angle

PM owns: Coordinating verification that every required data element meets privacy, compliance, and access requirements before gathering progresses.
Deliverable: Privacy/Compliance/Access checklist or section in the CPMAI workbook, with sign-off from compliance/legal/security stakeholders where required.
Iteration trigger: A required data element fails privacy or compliance → loop back to III.1 to find an alternative or to Phase I to reconsider scope.
Escalation trigger: Compliance gap that the team cannot resolve at the project level (e.g., GDPR cross-border transfer issue, HIPAA business-associate agreement missing).
Wrong-answer trap: "Anonymize the PII and proceed." Anonymization is a technical fix that the privacy/compliance check identifies as required — it doesn't bypass the check, it satisfies it after stakeholder review.
Question pattern signal: Stems mentioning "PII," "personal data," "regulated data," "GDPR/HIPAA," "the data contains sensitive information."
ECO task tag: Domain III, Task 6 — Check data privacy, compliance, and access

Lesson 23: Data Governance

Data governance is "the set of processes, procedures, and standards that ensure data is accurate, accessible, secure, and used responsibly." It's an organizational capability, not a project artifact — but the project consumes governance policy and contributes new artifacts (data lineage, access logs, etc.) to it.

Key governance components: data ownership (who owns each dataset), data classification (sensitivity tiers), access policies, retention policies, audit requirements.

KEY TAKEAWAYS

Data governance is organizational, not project-level. The project operates within it.
Governance defines ownership, classification, access, retention, audit.

Lesson 24: Data Stewardship

Data stewardship is "the practice of ensuring an organization's data is accessible, trustworthy, usable, and secure." Stewardship operationalizes governance — it's the set of practices that make policy real day-to-day.

KEY TAKEAWAYS

Stewardship = operationalizing governance. Practices, not policies.

Lesson 25: ECO Task III.2 — Identify Data SMEs

Every required dataset has subject-matter experts who know it best. The PM identifies these SMEs early and ensures they're engaged in requirements, source selection, evaluation, and gate decisions. Roles include data stewards, data custodians, data owners, business-domain SMEs, and external SMEs (vendors, consultants).

KEY TAKEAWAYS

Data SMEs include stewards, custodians, owners, business-domain experts, external SMEs.
Engagement is early and continuous, not "ask once at the end."

PM Oversight Angle

PM owns: Identifying data SMEs for every required data element and securing their engagement across the relevant Domain III tasks.
Deliverable: SME assignment table — required data element × SME role × name × engagement scope.
Iteration trigger: No SME exists for a required data element → loop back to III.1 to question the requirement or escalate the gap.
Escalation trigger: Required SME unavailable (left the company, declined engagement, conflicting priorities) and no substitute exists.
Wrong-answer trap: "The data scientist can determine the data's quality without an SME." Domain SMEs have context the data scientist doesn't — engagement isn't optional.
Question pattern signal: Stems mentioning "the team is unsure about the data's meaning," "no one understands the historical context of this dataset," or "the data was created by [a person who left]."
ECO task tag: Domain III, Task 2 — Identify data SMEs

Lesson 26: Data Stewards vs. Data Custodians

These two roles are distinct and commonly confused on the exam:

Role	Responsibilities	Skill Mix
Data Steward	Enforces policy. Establishes data lineage, cataloging, monitoring, advocacy. Strategic — works with IT and business.	Data management + soft skills (communication, collaboration)
Data Custodian	Safe storage, transfer, and use of data. Administrative — not the data owner.	Operational/technical

The key distinction: stewards enforce policy and curate; custodians operationally protect. Stewards are strategic and cross-functional; custodians are operational.

💡 Memory Aid — Steward vs. Custodian

Steward = Strategic policy enforcer. Custodian = Custodial protection (storage/transfer/use). "Stewards Set policy, Custodians Carry it."

KEY TAKEAWAYS

Steward = strategic, policy-enforcing, cross-functional.
Custodian = operational, storage/transfer/use, NOT the data owner.
Confusing the two is a top exam trap.

Lesson 27: Informational Bias

Three usages of the word "bias" in AI — and only one is "informational bias":

Bias in neural networks — adjustment factor for fine-tuning model performance. Nothing to do with fairness.
Bias vs. variance — model's tendency to underfit or overfit.
Informational bias — overrepresentation or underrepresentation of categories in the data, with fairness implications.

Common types of informational bias: reporting bias (only some aspects recorded), recall bias (recent data weighted more), classification bias (data categorized in ways that misrepresent groups).

To build trustworthy AI, bias must be measurable, monitored, and managed.

💡 Memory Aid — Three Biases (NVI)

Neural-network bias (adjustment factor), Variance bias (under/overfit), Informational bias (fairness — the one that matters here).

KEY TAKEAWAYS

Three "biases" in AI — the exam-relevant one is informational bias (fairness).
Informational bias must be measurable, monitored, managed — not just acknowledged.

Lesson 28: ECO Task III.4 — Coordinate AI Workspace and Infrastructure

The PM coordinates the technical infrastructure required for the project: compute environments, storage platforms, data pipelines, dev/test/prod separation, security controls, and access provisioning. The PM doesn't build any of it — the PM ensures it exists, fits the project, and is ready when needed.

This often means working with platform teams, cloud architects, IT security, and the data engineering team well before the data scientist starts.

KEY TAKEAWAYS

Infrastructure coordination = ensuring the workspace exists, fits, and is ready — not building it.
Coordination spans compute, storage, pipelines, environments, security, access.

PM Oversight Angle

PM owns: Coordinating the cross-team effort to provision and configure AI workspace and infrastructure aligned to the project's data requirements and trustworthy-AI constraints.
Deliverable: Workspace provisioning plan + status tracker (compute, storage, pipelines, environments, access). Stakeholder sign-off from platform/security teams.
Iteration trigger: Required infrastructure cannot be provisioned in time (capacity, budget, regulatory) → loop back to III.1 or Phase I to rescope.
Escalation trigger: Infrastructure gap that requires architectural decisions or budget approval beyond the project.
Wrong-answer trap: "Have the data engineer set up infrastructure as needed." Infrastructure decisions touch security, compliance, and shared services — they need PM-coordinated cross-team engagement, not ad-hoc setup.
Question pattern signal: Stems mentioning "the team needs a development environment," "data needs to be staged for analysis," "compute resources are not yet available."
ECO task tag: Domain III, Task 4 — Coordinate AI workspace and infrastructure

Lesson 29: The Data Life Cycle

PMI's data life cycle has 10 stages, each with PM-relevant questions:

Generation — How is the data generated?
Collection — How will it be collected? From which sources?
Storage — What storage methods? Safe and accessible?
Access — Who gets access? Who manages access?
Usage — What's the data used for? Specific purposes documented?
Transfer — How is data transferred between systems?
Security — How is data secured? Different levels for different data?
Deletion — When/how is data deleted when no longer needed?
Archival — How is long-term-retained data archived? Still secure and accessible?
Privacy — Privacy policies and regulatory requirements, throughout.

💡 Memory Aid — Data Life Cycle (10 Stages)

Generation → Collection → Storage → Access → Usage → Transfer → Security → Deletion → Archival → Privacy. "Good Cats Sit Around Until Their Suppers Drop And Pause."

KEY TAKEAWAYS

10 stages: Generation, Collection, Storage, Access, Usage, Transfer, Security, Deletion, Archival, Privacy.
Privacy isn't a stage — it's a dimension that runs through all 10.

Lesson 30: Data Quality Management

Data quality management is the ongoing process of measuring, improving, and maintaining data quality across the seven dimensions (Lesson 10). It's a practice, not a one-time activity.

For the AI project, quality management means:

Establishing baseline quality metrics during evaluation (III.7).
Monitoring quality as data is gathered and used.
Triaging quality issues — fix at source, transform at ingestion, or flag and exclude.

KEY TAKEAWAYS

Quality management is ongoing, not one-time. Baseline → monitor → triage.

Lesson 31: Analytics — Definition and Scope

Analytics involves "using statistical and other methods to gain insights from data." Analytics is broader than AI — it includes descriptive (what happened), diagnostic (why), predictive (what will happen), and prescriptive (what should we do) analytics. AI overlaps mostly with predictive and prescriptive.

KEY TAKEAWAYS

Analytics = statistical methods for insight, broader than AI.
AI overlaps mostly with predictive and prescriptive analytics.

Lesson 32: Data Science vs. Data Analytics

Closely related, distinct purposes:

	Data Science	Data Analytics
Focus	Build predictive/prescriptive models, often involving ML	Analyze historical data for insight
Output	Models that make decisions or predictions	Reports, dashboards, recommendations
Skill Mix	Statistics + ML + programming + domain	Statistics + business + visualization
Tool Bias	Python/R, ML frameworks	SQL, BI tools, statistical packages

For an AI project, you typically need both — a data scientist to build, a data analyst to monitor and interpret production performance.

KEY TAKEAWAYS

Data scientists build models; data analysts interpret data.
AI projects usually need both roles.

Module 5: Evaluation, the Gate, and Conveying Findings

Lessons 33-38 | Closing out Domain III — evaluation, the go/no-go gate, and reporting to leadership.

Lesson 33: ECO Task III.7 — Oversee Data Evaluation

Once data has been gathered (III.5), the team evaluates it against the requirements (III.1) and the success criteria from Domain II (II.8). The PM doesn't perform the evaluation — the PM oversees it, ensures it's documented, and ensures the result feeds into the gate decision (III.8).

Evaluation typically covers: completeness against requirements, quality across the seven dimensions, alignment with operational data, fitness for the AI pattern, identification of remaining gaps.

KEY TAKEAWAYS

Evaluation is performed by the team (data scientists, SMEs, stewards) and overseen by the PM.
Output is a documented evaluation that feeds the gate decision.

PM Oversight Angle

PM owns: Overseeing the team's data evaluation effort, ensuring it produces a documented assessment against requirements and success criteria, and that the result is ready for the gate decision in III.8.
Deliverable: Data Evaluation Report — completeness, quality, alignment, fitness, gaps.
Iteration trigger: Evaluation reveals fundamental fitness issues that requirements (III.1) or sources (III.3) need to address → loop back as required.
Escalation trigger: Evaluation reveals gaps that exceed the project's iteration tolerance.
Wrong-answer trap: "Have the data scientist proceed to model training while finishing the evaluation." Evaluation must complete and inform the gate before model work begins.
Question pattern signal: Stems mentioning "the data has been gathered and is being assessed," "evaluation is in progress," "the team is ready to begin training."
ECO task tag: Domain III, Task 7 — Oversee data evaluation

Lesson 34: Moving Beyond CPMAI Phase II

You're ready to move past Phase II (Domain III work) and into Phase III (Data Preparation) when:

Data requirements are adequately determined.
PMI-CPMAI Workbook items for Phase II have adequate responses (well-defined answers, understanding of known and unknown).
No critical roadblocks remain (data location, quality, format, access, permissions, regulations, compliance).
Key Phase II questions have answers.

Partial answers are okay if they're sufficient to address the Phase I business requirements. The bar is adequate, not complete.

KEY TAKEAWAYS

"Ready for Phase III" = adequate answers + no critical roadblocks, not complete answers.
Adequate is judged against Phase I business requirements.

Lesson 35: ECO Task III.8 — Determine if Data Meets Solution Needs (THE GATE)

This is the first explicit go/no-go gate in the ECO and one of the most heavily tested concepts on the exam. The PM facilitates a documented decision with stakeholders covering three areas:

Data Sources — Do you know what data you need? Have you identified sources, access methods, and ownership? Are there pretrained models you could use to reduce data needs? Data Description — Have you considered all 4 Vs? Is the quantity, type, change rate, and quality understood? Is the data on-premise / cloud / hybrid known? Data Quality — Do you know the data's quality? Are labeling/augmentation requirements defined? Is the time/cost to prepare understood? Is a collection/ingestion pipeline defined?

The decision has three outcomes:

GO — Proceed to Phase III (Data Preparation).
ITERATE — Loop back to Phase I or earlier in Phase II to address gaps.
DESCOPE — Reduce project scope to what the available data supports.

PMI's key concept: "If you can confidently say, 'We have the data and know the problem,' move to Phase III. If not, pause here. Clarify these points first to avoid issues later."

KEY TAKEAWAYS

The gate has three outcomes: GO, ITERATE, DESCOPE.
Three areas evaluated: Data Sources, Data Description, Data Quality.
The decision is documented and stakeholder-engaged — not made by the PM alone, not made by the data scientist alone.

💡 Memory Aid — SDQ Gate (Sources, Description, Quality)

Sources known? Description complete (4 Vs)? Quality understood? Three checkboxes for the gate. Three outcomes (Go/Iterate/Descope).

PM Oversight Angle

PM owns: Facilitating the documented go/no-go decision with stakeholders. Compiling evaluation findings (III.7) into a gate decision package. Communicating the decision and its rationale.
Deliverable: Phase II Go/No-Go Decision Document — sources, description, quality findings; decision (GO/ITERATE/DESCOPE); stakeholders engaged; rationale; downstream actions.
Iteration trigger: Decision = ITERATE → triggered by inadequate sources, description, or quality. Loop back to the relevant Phase II task or Phase I.
Escalation trigger: DESCOPE recommendation that materially changes the project's value proposition; ITERATE that requires Phase I changes the team can't make alone.
Wrong-answer trap: "Proceed to data preparation and address gaps later." Bypasses the gate. The gate exists because unaddressed gaps compound into Phase III/IV/V failures.
Question pattern signal: "The team has gathered and evaluated data. What should the project manager do?" / "Data evaluation is complete. The team is ready to proceed." / "20% of required features are unavailable."
ECO task tag: Domain III, Task 8 — Determine if data meets solution needs

Lesson 36: The Phase II Go/No-Go Decision

Specific questions PMI's prep course lists for the gate, mapped to the three areas:

Data Sources:

Do you know what data you need for this iteration?
Have you identified sources for training and inference data, plus access methods?
Can you use pretrained models or foundation models to reduce data needs?
Have you defined differences between training data and real-world operational data?
Who has ownership and stewardship of the data?
Could synthetic / GenAI-generated data substitute?

Data Description:

All 4 Vs understood (Volume, Velocity, Variety, Veracity)?
Edge-device data needs determined?
On-premise / cloud / hybrid environment decided?

Data Quality:

Quality known? Visibility / understanding / analysis?
Labeling / augmentation requirements defined?
Time / resources / money for data prep estimated?
Collection and ingestion pipeline defined?
Pipeline ownership assigned?
If using synthetic / AI-generated data, quality assurance defined?

KEY TAKEAWAYS

The gate is a structured Q&A across three areas, not a vibe check.
"We have the data and know the problem" = GO. "We don't" = pause.

Lesson 37: When to Iterate Back to Previous CPMAI Phases

PMI lists 12 scenarios where iterating back to Phase I (Business Understanding) is the right call:

The business problem has shifted since Phase I.
You need data you do not have and cannot reasonably obtain.
You have the wrong type of data and cannot get the right type.
Identified data is too little, no augmentation possible.
Identified data is too much, selection requires rescoping.
Training data and operational data differ materially.
Any of the 4 Vs (Volume, Velocity, Variety, Veracity) creates a roadblock.
Improving data quality would take too long.
Organizational challenges in data collection or ingestion.
Legal, compliance, security, or risk issues block access.
Foundation / pretrained model / GenAI approach requires Phase I changes.
Realizing the project is a Proof of Concept (PoC) rather than a Pilot.

The exam-critical insight: iterating back is not failure — it's the methodology working. The cost of unresolved issues compounds in Phase III/IV/V. CPMAI's iterative design specifically allows backing up "without penalty."

💡 Memory Aid — 12 Iteration Triggers (mnemonic: BIG WORDS)

Business shift, Infeasible data, Gap-too-big quantity → Wrong type / volume too high → Operational mismatch → Roadblock from any V → Data quality time → Stakeholder/legal blocks. (Plus pretrained-model and PoC-vs-pilot misalignment.)

KEY TAKEAWAYS

12 distinct scenarios trigger iteration back to Phase I.
Iterating back is methodology-correct, not project failure.
The cost of not iterating compounds in later phases.

Lesson 38: ECO Task III.9 — Convey Data Understanding to Leadership

The final Domain III task: the PM communicates Phase II findings to leadership. This is not optional — without the briefing, leadership lacks visibility into data risk and can't make informed decisions about Phase III/IV/V investment.

The conveyance covers: data understanding state, gate decision, key risks identified, iteration recommendations (if any), changes to project scope or schedule that flow from data understanding.

KEY TAKEAWAYS

Conveyance is mandatory leadership communication, not a formality.
Content: data state, gate decision, risks, iteration recommendations, scope/schedule impacts.

PM Oversight Angle

PM owns: Conveying Phase II data understanding to leadership in a documented, stakeholder-engaged way. Ensuring leadership has the context to authorize Phase III work or alternative actions.
Deliverable: Leadership briefing artifact — typically a written summary, often paired with a meeting. Includes gate decision, risks, recommendations.
Iteration trigger: Leadership response indicates the project's business value or scope has shifted → loop back to Phase I.
Escalation trigger: Leadership cannot authorize Phase III given the findings; project pause or cancellation required.
Wrong-answer trap: "Send the data evaluation report to the data scientist and proceed to Phase III." The gate decision is a leadership-engaged decision, not an internal team decision.
Question pattern signal: "The team has completed data evaluation. The data scientist is ready to begin model design." / "What does the project manager do before Phase III begins?"
ECO task tag: Domain III, Task 9 — Convey data understanding to leadership

Quick Reference: Data Quality Dimensions

Dimension	What to Check
Accuracy	Does the data match reality?
Completeness	Are all required fields present?
Consistency	Are formats/conventions uniform across sources?
Timeliness	Is the data current enough for the use case?
Uniqueness	Are duplicates resolved?
Validity	Does the data conform to defined rules/schemas?
Integrity	Are relationships between data elements maintained?

Mnemonic: ACCTUVI — "A Cat Caught Two Unwary Voles Inside."

Quick Reference: Domain III Go/No-Go Gate (Task III.8)

Three areas:

Area	Key Questions
Sources	Do you know what data you need? Sources identified? Access defined? Pretrained alternatives considered? Owner identified?
Description	4 Vs understood (Volume, Velocity, Variety, Veracity)? Edge-device needs? On-premise/cloud/hybrid?
Quality	Quality known? Labeling/augmentation defined? Time/cost to prep estimated? Pipeline defined and owned?

Three outcomes: GO (proceed to Phase III) · ITERATE (loop back) · DESCOPE (reduce scope to what data supports) Decision rule: "We have the data and know the problem" = GO. "We don't" = pause. Wrong-answer traps to recognize:

"Proceed to data preparation and fix later" → bypasses the gate
"Have the data scientist tune the approach to fit the available data" → technical workaround instead of stakeholder decision
"PM decides alone whether to proceed" → gate requires stakeholder engagement

Cross-Domain Links

III.1 (Define Required Data) ↔ Phase I / Domain II: AI pattern, persona, and success criteria all originate in Phase I and constrain what data is required.
III.6 (Privacy/Compliance/Access) ↔ Domain I (Trustworthy AI): Privacy plan (I.1), regulatory compliance (I.4), accountability documentation (I.5) all flow into III.6's checks.
III.7 (Oversee Data Evaluation) ↔ II.8 (Define Success Criteria): Evaluation measures data against the success criteria set in Domain II.
III.8 (The Gate) ↔ Domain II + Domain I: The gate decision is informed by Phase I business requirements (Domain II) and trustworthy-AI constraints (Domain I).
III.5 (Gather Data) ↔ III.4 (Coordinate Infrastructure): Gathering depends on infrastructure being provisioned; mismatch causes blockers.
III.2 (Identify SMEs) ↔ every other Domain III task: SMEs are the source of truth for requirements, evaluation, and gate input.
III.9 (Convey to Leadership) ↔ V.5 (Lessons Learned): Domain III's conveyance feeds Domain V's final report; gate decisions become lessons-learned content.

Knowledge Check

Question 1

A project team is preparing to start data collection for a new AI initiative. The data scientist asks the PM what data should be pulled. What should the project manager do?

A. Approve the data scientist's choice and let them begin

B. Direct the data scientist to start with the most accessible internal data

C. Pause and produce a documented data requirements specification tying Phase I success criteria to specific data attributes, sources, and aggregation method, then engage stakeholders to confirm

D. Schedule a project review for the following sprint

Click for answer and rationale

Correct: C

ECO Task III.1 requires the PM to define required data — produce a documented specification — before collection begins. The data scientist's question is signaling a missing requirements step.

A wrong: Approves a choice with no documented requirements basis.
B wrong: Wrong-answer trap — convenience-based source selection bypasses requirements definition.
D wrong: Schedules without addressing the immediate gap.

Question 2

After data evaluation, the team finds that 20% of required features are unavailable from any source. What is the project manager's BEST next step?

A. Have the data scientist engineer proxy features for the missing 20%

B. Cancel the project

C. Conduct a formal go/no-go assessment, document the gap and impact on success criteria, and engage stakeholders to decide between proceed-with-mitigation, iterate, or descope

D. Proceed to data preparation and address the gaps in Phase III

Click for answer and rationale

Correct: C

ECO Task III.8 — the gate. The PM facilitates a documented stakeholder decision. Three outcomes: GO/ITERATE/DESCOPE. C reflects all three.

A wrong: Wrong-answer trap — technical workaround that bypasses governance.
B wrong: Premature without an impact assessment. Cancellation is a possible outcome of the gate, not a step that skips it.
D wrong: Bypasses the gate. Compounds rework in later phases.

Question 3

True or False: A project manager can confirm data privacy and compliance requirements during Phase III (Data Preparation), since that's when the data is actually transformed.

Click for answer and rationale

Correct: FALSE

ECO Task III.6 — privacy/compliance/access checks belong in Phase II (Data Understanding), not Phase III. Compliance is a requirements concern. Discovering a compliance gap in Phase III is rework. Domain I (Trustworthy AI) reinforces this — privacy/security plan (I.1) and regulatory compliance (I.4) run throughout the project, starting in Phase I.

Question 4

The data scientist informs the PM that the available data is materially different from the operational data the model will encounter in production. What should the PM do?

A. Direct the data scientist to use the available data and adjust the model later

B. Iterate back to Phase I to reconsider the business problem and project scope; document the misalignment and engage stakeholders

C. Proceed and let the model's performance in production guide adjustments

D. Have the team augment the available data to better match operational data

Click for answer and rationale

Correct: B

This is one of PMI's 12 documented iteration triggers (Lesson 37, scenario 6: "training data and operational data differ materially"). The methodology specifically allows iterating back to Phase I "without penalty."

A wrong: Pushes forward with a known fundamental issue.
C wrong: Letting production discover the issue is the worst outcome.
D wrong: Wrong-answer trap — augmentation is a technical fix; the misalignment is a Phase I scope question.

Question 5

What's the difference between a data steward and a data custodian?

Click for answer

Steward = strategic, policy-enforcing, cross-functional. Establishes data lineage, cataloging, monitoring, advocacy. Works with both IT and business. Custodian = operational, administrative. Safe storage, transfer, and use of data. Not the data owner — administrative role over the data.

Mnemonic: "Stewards Set policy, Custodians Carry it."

Question 6

The team identifies that required data exists but is held by a partner organization that requires a Business Associate Agreement (BAA) under HIPAA. The legal team estimates a 6-week delay to execute the BAA. What should the PM do?

A. Proceed with project planning under the assumption the BAA will be signed

B. Replace the partner data with a non-regulated substitute that the data scientist suggests

C. Document the delay as a risk, escalate to leadership with options (proceed-and-wait, iterate to alternative sources, or descope), and engage stakeholders for the decision

D. Cancel the project

Click for answer and rationale

Correct: C

ECO Task III.6 (privacy/compliance) intersects with III.8 (the gate). A multi-week compliance dependency that affects schedule and scope is a leadership decision, not a technical workaround.

A wrong: Optimistic and risky — BAAs can fail to execute.
B wrong: Wrong-answer trap — substitute selection should be requirements-driven (III.1) and SME-validated (III.2), not data-scientist-suggested.
D wrong: Cancellation is a stakeholder decision, not a unilateral PM call.

Question 7

True or False: The PM's deliverable for ECO Task III.9 (Convey Data Understanding to Leadership) is a copy of the data evaluation report.

Click for answer and rationale

Correct: FALSE

The data evaluation report (III.7's deliverable) is input to III.9, not the output. III.9's deliverable is a leadership briefing artifact: gate decision, risks, recommendations, scope/schedule impacts — packaged for leadership audience and decision-making, not raw evaluation data.

Question 8

A team is using a pretrained foundation model and only needs a small amount of task-specific data for fine-tuning. The PM is asked whether to skip Domain III's gate (III.8). What should the PM do?

A. Skip the gate — III.8 is for projects with custom training data

B. Run the gate as written — sources, description, and quality questions still apply, even when data needs are smaller

C. Defer the gate to Phase IV when fine-tuning happens

D. Have the data scientist run the gate informally

Click for answer and rationale

Correct: B

The gate's three areas (Sources, Description, Quality) apply regardless of data volume. A small fine-tuning dataset still needs identified sources, described characteristics, and assessed quality — especially for trustworthy-AI considerations.

A wrong: Misreads the gate as data-volume-dependent.
C wrong: Phase IV gates are different gates (IV.5, IV.6) testing different things.
D wrong: Gate decisions are stakeholder-engaged and PM-facilitated, not informally run by the data scientist.

Question 9

The team's data evaluation reveals that 80% of required data has high quality but the remaining 20% is incomplete and would take 3 months to clean. The project's success criteria specify that all required features are needed. What's the PM's BEST move?

A. Proceed to data preparation and clean the 20% in parallel with model development

B. Iterate back to II.8 (success criteria) to determine if the success criteria can be revised to accept partial coverage, OR loop to III.1 to consider alternative data

C. Have the data scientist begin training on the 80% while the 20% is cleaned

D. Approve a 3-month schedule extension

Click for answer and rationale

Correct: B

This is a cross-domain pull. Success criteria are set in II.8. If 100% of features can't be met within tolerance, the right move is to revisit success criteria and required-data definition, not workaround the gap technically.

A wrong: Parallel work on incomplete data foundation = compounding risk.
C wrong: Wrong-answer trap — training on partial data inserts a quality issue into the model.
D wrong: Schedule extension without scope or success-criteria review is a unilateral decision that should be stakeholder-engaged.

Question 10

The PM is documenting data sources and notices that one critical source is held by a vendor whose contract expires in 6 months — and renewal is uncertain. What should the PM do?

A. Document the source and proceed; renewal is the procurement team's problem

B. Document the source AND escalate the contract risk for stakeholder visibility, since data availability is a project-level dependency

C. Replace the source proactively with an internal alternative

D. Wait until renewal is decided before completing the source inventory

Click for answer and rationale

Correct: B

Data source dependencies that affect project viability are PM-level concerns. III.3 (source identification) intersects with risk management — escalation is the right move.

A wrong: "Procurement's problem" abdicates PM accountability for project dependencies.
C wrong: Replacing without stakeholder engagement is a unilateral scope decision.
D wrong: Blocks the inventory progress on an external decision.

Memory Aids & Mnemonics Summary

Mnemonic	What to Remember
DRIP (Define Required Data)	Determine pattern, Required attributes, Identify sources, Plan aggregation
VVVV (4 Vs of Big Data)	Volume, Velocity, Variety, Veracity
ACCTUVI (Data Quality)	Accuracy, Completeness, Consistency, Timeliness, Uniqueness, Validity, Integrity. "A Cat Caught Two Unwary Voles Inside."
SCALE (Source Identification)	Source type, Cost & access, Accuracy & cadence, Legal/license, Endpoint
Steward vs Custodian	Steward = Strategic policy. Custodian = Custodial protection. "Stewards Set, Custodians Carry."
NVI (Three Biases)	Neural-net bias (adjustment), Variance bias (over/underfit), Informational bias (fairness — the exam one)
Data Life Cycle (10)	Generation, Collection, Storage, Access, Usage, Transfer, Security, Deletion, Archival, Privacy
SDQ Gate (III.8)	Sources known, Description complete, Quality understood. Outcomes: GO / ITERATE / DESCOPE
3 Gate Outcomes	GO (proceed) · ITERATE (loop back) · DESCOPE (reduce scope)
12 Iteration Triggers	Business shift, infeasible data, wrong type, too little / too much, training-vs-operational mismatch, V-roadblock, quality time, org block, legal block, pretrained-model misfit, PoC-vs-pilot misalignment

Closing reminders for Domain III

The PM does not gather, ingest, or transform data. The PM ensures the team knows what to do, has what they need, and the result is documented, evaluated, and decided on.
The gate (III.8) is the highest-leverage exam concept in Domain III. Master the three areas (Sources, Description, Quality), the three outcomes (GO/ITERATE/DESCOPE), and the 12 iteration triggers.
Cross-domain pulls are dense in III.6 and III.8. Domain I privacy/compliance and Domain II success criteria flow into Domain III decisions. When a stem mentions PII, regulatory compliance, or "the team is ready to proceed to data preparation," check the cross-domain link before answering.

Next: domain-V-operationalize.md (PRIORITY 2)

Domain III: Identify Data Needs — Comprehensive Study Guide

Overview

Table of Contents

Module 1: Foundation — The Role of Data in AI

Lesson 1: ECO Task III.1 — Define Required Data

KEY TAKEAWAYS

💡 Memory Aid — DRIP

PM Oversight Angle

Lesson 2: Data Fuels Intelligence

KEY TAKEAWAYS

Lesson 3: The Data-First Approach

KEY TAKEAWAYS

Lesson 4: What Is Big Data?

KEY TAKEAWAYS

Lesson 5: The 4 Vs of Big Data

💡 Memory Aid — VVVV (the 4 Vs)

KEY TAKEAWAYS

Lesson 6: Big Data — Lessons Learned

KEY TAKEAWAYS

Lesson 7: Applying Big Data Approaches to AI

KEY TAKEAWAYS

Module 2: Data Quality, Quantity, and AI-Specific Aspects

Lesson 8: Failure Reason — Data Quantity and Quality Issues

KEY TAKEAWAYS

Lesson 9: Data Quantity Issues

KEY TAKEAWAYS

Lesson 10: Data Quality Issues

💡 Memory Aid — ACCTUVI (Data Quality Dimensions)

KEY TAKEAWAYS

Lesson 11: AI-Specific Aspects of Data Understanding

KEY TAKEAWAYS

Module 3: Data Sets, Sources, and Gathering

Lesson 12: ECO Task III.3 — Identify Data Sources and Locations

KEY TAKEAWAYS

💡 Memory Aid — SCALE (Source Identification)

PM Oversight Angle

Lesson 13: Identifying Data Sets for ML — Activities

KEY TAKEAWAYS

Lesson 14: ECO Task III.5 — Gather Required Data

KEY TAKEAWAYS

PM Oversight Angle

Lesson 15: Training Data — Definition and Role

KEY TAKEAWAYS

Lesson 16: Structured, Unstructured, and Semi-Structured Data

KEY TAKEAWAYS

Lesson 17: The Untapped Value of Unstructured Data

KEY TAKEAWAYS

Lesson 18: Does AI Need a Lot of Data?

KEY TAKEAWAYS

Lesson 19: Running AI Projects with Small Data

KEY TAKEAWAYS

Lesson 20: Ground Truth Data

KEY TAKEAWAYS

Lesson 21: Data Management and Data Management Plans

KEY TAKEAWAYS

Module 4: Privacy, Compliance, Roles, and Infrastructure

Lesson 22: ECO Task III.6 — Check Data Privacy, Compliance, and Access

KEY TAKEAWAYS

PM Oversight Angle

Lesson 23: Data Governance

KEY TAKEAWAYS

Lesson 24: Data Stewardship

KEY TAKEAWAYS

Lesson 25: ECO Task III.2 — Identify Data SMEs

KEY TAKEAWAYS

PM Oversight Angle

Lesson 26: Data Stewards vs. Data Custodians

💡 Memory Aid — Steward vs. Custodian

KEY TAKEAWAYS

Lesson 27: Informational Bias

💡 Memory Aid — Three Biases (NVI)

KEY TAKEAWAYS

Lesson 28: ECO Task III.4 — Coordinate AI Workspace and Infrastructure

KEY TAKEAWAYS

PM Oversight Angle

Lesson 29: The Data Life Cycle

💡 Memory Aid — Data Life Cycle (10 Stages)

KEY TAKEAWAYS

Lesson 30: Data Quality Management

KEY TAKEAWAYS