Tutorial 5: Chain-of-Thought Prompts
- Contributor
- Jun 13
- 3 min read
Chain-of-thought (CoT) prompting asks the model to reason through a problem before answering. For complex tasks, the reasoning step dramatically improves accuracy. This tutorial walks through using it.
What You'll Build
Prompts that use CoT effectively for reasoning-heavy tasks.
Step 1: Identify When CoT Helps (10 min)
CoT helps for:
Math and logic problems
Multi-step decisions
Tasks requiring analysis before conclusion
Problems where the model often gets wrong answers
CoT doesn't help much for:
Simple lookups
Tasks where direct output is correct
Generation tasks (creative writing)
Step 2: The Basic CoT Pattern (5 min)
Add to your prompt: "Think step by step before answering."
You are a customer service analyst. Classify each ticket by
priority (P1-P4).
Think step by step: identify the impact, urgency, and customer
tier. Then assign the priority.
Ticket: {ticket_text}
The model writes its reasoning, then the conclusion.
Step 3: Few-Shot With CoT (15 min)
For complex reasoning, examples with visible reasoning teach the pattern:
Example 1:
Ticket: "Production database is down; all customers affected."
Reasoning:
- Impact: All customers — severe
- Urgency: Service outage — immediate
- Customer tier: Affects all tiers including enterprise
- All three factors at maximum
Priority: P1
Example 2:
Ticket: "Sales rep can't access their dashboard."
Reasoning:
- Impact: Single user affected — limited
- Urgency: Work blocked but workarounds may exist
- Customer tier: Internal user
- Limited impact, internal user
Priority: P3
Then the model patterns its own reasoning on these.
Step 4: Structured Reasoning (15 min)
For more reliable reasoning, use structured templates:
For each ticket, work through:
1. **What's broken?** [identify the actual issue]
2. **Who's affected?** [scope of impact]
3. **Is there a workaround?** [yes/no]
4. **Customer tier?** [enterprise/standard/free]
5. **Priority decision:** [P1/P2/P3/P4]
6. **Justification:** [one sentence]
The structure ensures the model considers each factor.
Step 5: Hidden CoT (10 min)
Sometimes you want the reasoning but not in the final output. Two patterns:
Pattern A: Two-stage:
# Stage 1: Get reasoning
reasoning = call_llm(f"Analyze this ticket. Think step by step: {ticket}")
# Stage 2: Get clean output
final = call_llm(f"Given this analysis: {reasoning}\n\nOutput priority only.")
Pattern B: Markers:
[Reasoning - will be stripped]
{reasoning steps}
[Final Answer]
{just the priority}
Then parse out the final answer section.
Hidden CoT gives you the benefits without cluttering the output.
Step 6: Self-Consistency (advanced, 20 min)
For high-stakes decisions, run CoT multiple times and pick the majority answer:
def consistent_answer(prompt, n=5):
answers = []
for _ in range(n):
response = call_llm(prompt, temperature=0.7)
answer = extract_answer(response)
answers.append(answer)
# Pick the most common
from collections import Counter
return Counter(answers).most_common(1)[0][0]
Each call's reasoning differs slightly; the majority typically improves accuracy.
Costs more (N× the tokens) but for critical decisions, worth it.
Step 7: When to Avoid CoT (5 min)
CoT increases tokens (cost) and latency. Don't use it for:
High-volume simple classification
Real-time user-facing where latency matters
Tasks the model handles well without reasoning
Sometimes simpler prompts produce better outcomes faster.
Step 8: Combine With Structured Output (15 min)
CoT + JSON output:
Analyze this ticket. Output JSON with:
- reasoning: array of analysis steps
- priority: P1/P2/P3/P4
- justification: one-sentence explanation
The reasoning array should have one element per consideration.
Reasoning is captured but in parseable form.
Step 9: Test Effectiveness (15 min)
Compare:
# Run eval cases without CoT
results_simple = run_eval(eval_cases, prompt_simple)
# Run eval cases with CoT
results_cot = run_eval(eval_cases, prompt_cot)
# Compare accuracy
print(f"Without CoT: {results_simple.accuracy}")
print(f"With CoT: {results_cot.accuracy}")
print(f"Token cost ratio: {results_cot.tokens / results_simple.tokens}")
For your task, does CoT help enough to justify the extra cost?
Step 10: Debug Bad CoT Output (varies)
When CoT produces wrong answers:
Read the reasoning. Where did it go wrong?
Often the reasoning is fine but the conclusion doesn't follow
Sometimes the model invents constraints
Sometimes it makes arithmetic errors
The reasoning is visible. Use it to debug the prompt.
What You Just Did
You added CoT to your prompting toolkit. For reasoning-heavy tasks, accuracy improves significantly. For simple tasks, you know to skip it.
Common Failure Modes
CoT on simple tasks. Adds cost without benefit.
Reasoning ignored. Generated but not validated.
Inconsistent reasoning. Each call reasons differently. Use self-consistency.
Wrong format. Reasoning interferes with structured output parsing.
No comparison. Don't measure whether CoT helps your specific task.
Next Tutorial
Code generation is special: Tutorial 6: Prompting for Code Generation.
Related reading
Keep learning. This article is part of the AI in Quality & Delivery path in the ShiftQuality Learning Center. Use AI in delivery — and evaluate it honestly — without the hype.


