Writing Requirements for AI/ML Features

Contributor
Jun 3
5 min read

Traditional requirements assume deterministic behavior. Given the same input, the system produces the same output. AI features break this assumption. The same prompt may produce different outputs across runs. Models change. Outputs degrade. Requirements written as if the system is deterministic produce false expectations and untrustable tests.

This guide is how to write requirements for AI features that account for non-determinism without giving up on quality.

The Specific Challenge

For a traditional feature: "The system shall calculate sales tax." Given an address and amount, the result is deterministic. Test once; passes forever.

For an AI feature: "The system shall summarize customer feedback." Same feedback can produce different summaries each run. Both might be acceptable. Neither matches a fixed expected output.

The requirement still exists. The way to verify it has to change.

Specify Behavior Properties, Not Exact Outputs

Requirements should specify properties the output should have, not the output itself.

Bad: "The system shall produce summary X for input Y."

Better: "The system shall produce summaries that:

Mention all sentiment categories present in the input
Are between 50 and 200 words
Don't fabricate facts not present in the input
Maintain professional tone"

Each property is testable, even if the specific output varies.

Distinguish Hard and Soft Properties

Hard properties: can be verified deterministically.

"Output is JSON"
"Output is between 50 and 200 words"
"Output references only entities mentioned in the input"
"Output doesn't contain explicit content (per detector)"

Soft properties: require subjective evaluation or statistical aggregation.

"Summary captures the main themes"
"Tone is professional"
"Recommendation is relevant"

Both kinds matter. Hard properties get automated checks; soft properties need human evaluation, model-graded evaluation, or aggregate metrics.

Quality Thresholds and Pass Rates

For soft properties, requirements often need percentage thresholds.

Bad: "Summaries are accurate."

Better: "At least 90% of summaries pass human review for accuracy in a sample of 100 representative inputs, with no factual errors of consequence."

The percentage acknowledges non-determinism. Some outputs will be lower-quality; the requirement bounds the rate.

Test Sets and Golden Examples

For AI features, requirements should reference specific test sets.

Requirement: The summarization system shall produce summaries acceptable
to human reviewers at 90%+ pass rate on the standard evaluation set
(eval-set-v1, 200 examples).

The eval set is part of the requirement. When the eval set changes, requirements may need updating.

Golden examples — known-good outputs for specific inputs — anchor expectations even when most outputs vary. They catch regressions in behavior on the cases that matter most.

Failure Mode Requirements

Specify what failure modes are unacceptable:

"Output shall not contain hallucinated facts (entities not in input)"
"Output shall not include personally identifying information"
"Output shall not produce content from the prohibited content list"
"Confidence below threshold X shall produce an 'I don't know' response, not a guess"

Failure modes are often more important than success criteria. The wrong answer can be worse than no answer.

Latency and Cost Requirements

AI features have non-trivial costs in compute and time.

Requirement: The summarization endpoint shall complete in under 5 seconds
at p95 under normal load (10 RPS).

Requirement: Average cost per summarization shall not exceed $0.02.

These constrain implementation choices (model size, batching, caching).

Drift and Degradation Requirements

AI models drift. Their behavior changes when retrained, when underlying APIs change, when inputs shift.

Requirement: The system shall monitor output quality continuously and
alert when the pass rate drops below 85% over a 7-day window.

Requirement: The system shall pin to a specific model version; upgrades
require explicit re-evaluation against the standard test set.

These requirements address the lifecycle of AI features, not just their initial behavior.

Human Oversight Requirements

For high-stakes AI features, human oversight is part of the requirement.

Requirement: Decisions affecting customer accounts shall be reviewable
by a human within 24 hours.

Requirement: When the model's confidence is below threshold, the
recommendation shall be routed to a human for review before action.

Requirement: All AI-driven decisions shall be logged with the model
version, inputs, and outputs for audit.

The "AI does this autonomously" requirement is sometimes wrong even when the AI could; the requirement to include human oversight is part of responsible deployment.

Bias and Fairness Requirements

For AI features that affect people, fairness requirements may apply.

Requirement: Approval rates for the credit decision feature shall not
differ by more than 5 percentage points across protected demographic
groups, measured on the standard evaluation set.

These are easier to state than to satisfy. They typically require dedicated evaluation pipelines.

Privacy Requirements

AI features can leak information.

Requirement: User inputs shall not be included in model training data
without explicit user consent.

Requirement: The system shall not retain prompts containing PII beyond
30 days.

Requirement: Outputs shall not reveal training-set examples or other
users' inputs.

Privacy requirements for AI overlap with general privacy requirements but often need AI-specific phrasing.

Capability vs. Behavior Requirements

Distinguish:

"The model can do X" (capability — whether the trained model is capable)
"The system does X" (behavior — what users actually experience)

Bad: "The system uses GPT-4."

Better: "The system shall summarize text up to 100k tokens, producing summaries that pass the standard evaluation."

The capability is the model's; the behavior is the system's. Requirements specify behavior; the model is one component of the system that delivers it.

Versioning

AI features have multiple things that can change:

Underlying model version
Prompt template
Fine-tuning data
Post-processing logic
Evaluation criteria

Requirements should specify which version they apply to, and explicitly require evaluation when versions change.

Acceptance Criteria for AI Features

Acceptance criteria typically include:

Behavior properties with verification methods
Test set pass rates with thresholds
Failure mode bounds (none of these should happen)
Performance targets (latency, cost)
Monitoring and alerting for drift
Audit and logging requirements
Fallback behavior when AI fails

This is more complex than traditional acceptance criteria. The complexity reflects the underlying complexity of the features.

A Worked Example

For a customer support response suggestion feature:

Functional requirement:
The system shall suggest response drafts for customer support agents
based on incoming customer messages.

Acceptance criteria:

Behavior:
- Suggestions shall be in the same language as the customer message
- Suggestions shall not exceed 300 words
- Suggestions shall not reference customers other than the one in the message
- Suggestions shall not promise specific resolutions (refunds, etc.) without agent edit

Quality:
- At least 80% of suggestions shall be rated "usable with light edits" or 
  better by reviewing agents on the standard 200-example evaluation set
- Less than 2% shall contain factual errors (incorrect product info, made-up
  policies)
- 0% shall contain explicit or offensive content

Performance:
- Suggestion shall be generated within 3 seconds at p95
- Cost shall not exceed $0.01 per suggestion

Monitoring:
- Output quality shall be monitored continuously via agent acceptance rate
- Alert when acceptance rate drops below 60% over 24h

Failure handling:
- If the model is unavailable, no suggestion is shown (no fallback to bad
  suggestions)
- Errors shall not block the agent's normal workflow

Audit:
- All suggestions shall be logged with model version, prompt, response,
  and whether the agent accepted/edited/rejected

This is more detailed than a typical functional requirement, but it captures what's actually expected of the feature.

Communicating to Stakeholders

Stakeholders unfamiliar with AI often have unrealistic expectations.

Helpful framing:

AI is statistical, not deterministic
Quality is measured as pass rates, not pass/fail
Drift requires ongoing monitoring
Edge cases are abundant

Setting realistic expectations early prevents disappointed surprise later.

Key Takeaway

AI requirements differ from traditional requirements because the underlying behavior is non-deterministic. Specify behavior properties, not exact outputs. Distinguish hard properties (deterministically verifiable) from soft (need human or aggregate evaluation). Use test sets and pass rates. Include failure mode bounds, performance targets, monitoring, fallback behavior, and audit requirements. The complexity reflects the underlying reality: AI features need different acceptance criteria than traditional ones, and shortcuts produce false confidence.

ShiftQuality