top of page

Tutorial 5: Chain-of-Thought Prompts

  • Contributor
  • Jun 13
  • 3 min read

Chain-of-thought (CoT) prompting asks the model to reason through a problem before answering. For complex tasks, the reasoning step dramatically improves accuracy. This tutorial walks through using it.

What You'll Build

Prompts that use CoT effectively for reasoning-heavy tasks.

Step 1: Identify When CoT Helps (10 min)

CoT helps for:

  • Math and logic problems

  • Multi-step decisions

  • Tasks requiring analysis before conclusion

  • Problems where the model often gets wrong answers

CoT doesn't help much for:

  • Simple lookups

  • Tasks where direct output is correct

  • Generation tasks (creative writing)

Step 2: The Basic CoT Pattern (5 min)

Add to your prompt: "Think step by step before answering."

You are a customer service analyst. Classify each ticket by 
priority (P1-P4).

Think step by step: identify the impact, urgency, and customer 
tier. Then assign the priority.

Ticket: {ticket_text}

The model writes its reasoning, then the conclusion.

Step 3: Few-Shot With CoT (15 min)

For complex reasoning, examples with visible reasoning teach the pattern:

Example 1:

Ticket: "Production database is down; all customers affected."

Reasoning: 
- Impact: All customers — severe
- Urgency: Service outage — immediate
- Customer tier: Affects all tiers including enterprise
- All three factors at maximum

Priority: P1

Example 2:

Ticket: "Sales rep can't access their dashboard."

Reasoning:
- Impact: Single user affected — limited
- Urgency: Work blocked but workarounds may exist
- Customer tier: Internal user
- Limited impact, internal user

Priority: P3

Then the model patterns its own reasoning on these.

Step 4: Structured Reasoning (15 min)

For more reliable reasoning, use structured templates:

For each ticket, work through:

1. **What's broken?** [identify the actual issue]
2. **Who's affected?** [scope of impact]
3. **Is there a workaround?** [yes/no]
4. **Customer tier?** [enterprise/standard/free]
5. **Priority decision:** [P1/P2/P3/P4]
6. **Justification:** [one sentence]

The structure ensures the model considers each factor.

Step 5: Hidden CoT (10 min)

Sometimes you want the reasoning but not in the final output. Two patterns:

Pattern A: Two-stage:

# Stage 1: Get reasoning
reasoning = call_llm(f"Analyze this ticket. Think step by step: {ticket}")

# Stage 2: Get clean output
final = call_llm(f"Given this analysis: {reasoning}\n\nOutput priority only.")

Pattern B: Markers:

[Reasoning - will be stripped]
{reasoning steps}
[Final Answer]
{just the priority}

Then parse out the final answer section.

Hidden CoT gives you the benefits without cluttering the output.

Step 6: Self-Consistency (advanced, 20 min)

For high-stakes decisions, run CoT multiple times and pick the majority answer:

def consistent_answer(prompt, n=5):
    answers = []
    for _ in range(n):
        response = call_llm(prompt, temperature=0.7)
        answer = extract_answer(response)
        answers.append(answer)
    
    # Pick the most common
    from collections import Counter
    return Counter(answers).most_common(1)[0][0]

Each call's reasoning differs slightly; the majority typically improves accuracy.

Costs more (N× the tokens) but for critical decisions, worth it.

Step 7: When to Avoid CoT (5 min)

CoT increases tokens (cost) and latency. Don't use it for:

  • High-volume simple classification

  • Real-time user-facing where latency matters

  • Tasks the model handles well without reasoning

Sometimes simpler prompts produce better outcomes faster.

Step 8: Combine With Structured Output (15 min)

CoT + JSON output:

Analyze this ticket. Output JSON with:
- reasoning: array of analysis steps
- priority: P1/P2/P3/P4
- justification: one-sentence explanation

The reasoning array should have one element per consideration.

Reasoning is captured but in parseable form.

Step 9: Test Effectiveness (15 min)

Compare:

# Run eval cases without CoT
results_simple = run_eval(eval_cases, prompt_simple)

# Run eval cases with CoT
results_cot = run_eval(eval_cases, prompt_cot)

# Compare accuracy
print(f"Without CoT: {results_simple.accuracy}")
print(f"With CoT: {results_cot.accuracy}")
print(f"Token cost ratio: {results_cot.tokens / results_simple.tokens}")

For your task, does CoT help enough to justify the extra cost?

Step 10: Debug Bad CoT Output (varies)

When CoT produces wrong answers:

  • Read the reasoning. Where did it go wrong?

  • Often the reasoning is fine but the conclusion doesn't follow

  • Sometimes the model invents constraints

  • Sometimes it makes arithmetic errors

The reasoning is visible. Use it to debug the prompt.

What You Just Did

You added CoT to your prompting toolkit. For reasoning-heavy tasks, accuracy improves significantly. For simple tasks, you know to skip it.

Common Failure Modes

CoT on simple tasks. Adds cost without benefit.

Reasoning ignored. Generated but not validated.

Inconsistent reasoning. Each call reasons differently. Use self-consistency.

Wrong format. Reasoning interferes with structured output parsing.

No comparison. Don't measure whether CoT helps your specific task.

Next Tutorial

Code generation is special: Tutorial 6: Prompting for Code Generation.

Related reading

Keep learning. This article is part of the AI in Quality & Delivery path in the ShiftQuality Learning Center. Use AI in delivery — and evaluate it honestly — without the hype.

bottom of page