Fairness Metrics: What They Measure and Where They Conflict

Contributor
Sep 8, 2025
5 min read

Updated: Jun 22

You built a model. You know responsible AI practice says to measure fairness across subgroups. You open the documentation for a fairness toolkit and encounter a dozen metrics: demographic parity, equalized odds, predictive parity, calibration, individual fairness, counterfactual fairness. Each one has a mathematical definition. Each one encodes a different concept of what "fair" means.

Here is the uncomfortable truth that most introductions to AI fairness avoid: these definitions conflict. Not just in emphasis — mathematically. You cannot satisfy all of them simultaneously except in trivial cases. Choosing a fairness metric is not a technical decision. It is a values decision about which kind of unfairness you are willing to accept.

This post explains the major fairness metrics, what they actually measure, and why the conflicts between them force you to make choices that no algorithm can make for you.

Demographic Parity

Demographic parity — sometimes called statistical parity — requires that the model's positive prediction rate is the same across groups. If 30% of Group A receives a positive outcome, then 30% of Group B should too.

This is the simplest fairness criterion and the one that most closely maps to intuitive notions of equal treatment. If a hiring model recommends 40% of male applicants but only 20% of female applicants for interviews, demographic parity says that's unfair regardless of any other consideration.

The strength of demographic parity is its simplicity and its focus on outcomes. It is easy to measure and easy to explain.

The limitation is that it ignores base rates. If the two groups have different qualification rates — different distributions of the features the model uses — then forcing equal prediction rates means the model is either over-selecting from one group or under-selecting from another. In some contexts, this is acceptable or even desirable (correcting for historical discrimination). In others, it undermines the model's utility.

Equalized Odds

Equalized odds requires that the model's true positive rate and false positive rate are the same across groups. If the model correctly identifies 90% of qualified candidates in Group A, it should correctly identify 90% of qualified candidates in Group B. And if it incorrectly flags 5% of unqualified candidates in Group A, it should incorrectly flag 5% in Group B.

This metric says: the model should be equally accurate for everyone. Groups should not experience different error rates. A loan denial model that misclassifies 2% of white applicants but 15% of Black applicants fails equalized odds — not because the overall accuracy is low, but because the errors are unevenly distributed.

Equalized odds is more nuanced than demographic parity because it accounts for actual qualifications. It does not require equal outcomes — it requires equal treatment conditional on the truth. A model can approve more applicants from one group if that group genuinely has more qualified applicants. What it cannot do is make more mistakes for one group than another.

The limitation: achieving equalized odds can reduce overall model accuracy. Equalizing error rates across groups sometimes means accepting higher error rates for the group the model originally performed well on.

Predictive Parity

Predictive parity requires that when the model says "positive," the probability of actually being positive is the same across groups. If the model predicts that a defendant is high-risk, predictive parity says the actual recidivism rate among those predictions should be the same regardless of the defendant's race.

This metric focuses on the meaning of the model's output. A "high-risk" prediction should mean the same thing for everyone. If "high-risk" means 70% chance of recidivism for Group A but 50% for Group B, the label is misleading — the same prediction carries different information depending on who receives it.

Predictive parity is particularly relevant in criminal justice, lending, and healthcare — domains where the model's score drives consequential decisions and the score needs to be calibrated identically across groups for those decisions to be equitable.

The Impossibility Result

Here is where it gets difficult. In 2016, researchers proved that except in trivial cases (where base rates are identical across groups or the model is perfect), you cannot simultaneously satisfy equalized odds and predictive parity. This is not a limitation of current methods. It is a mathematical impossibility.

The practical consequence: you must choose. And the choice encodes values.

Choosing equalized odds says: the model should make the same proportion of errors for everyone, even if that means the model's positive predictions are more reliable for one group than another.

Choosing predictive parity says: the model's predictions should mean the same thing for everyone, even if that means one group experiences higher error rates.

These are not equivalent commitments. They lead to different outcomes for different people. And neither is objectively "more fair" — they formalize different conceptions of fairness that are both intuitively compelling and mutually exclusive.

How to Choose

Given that you cannot satisfy all fairness criteria simultaneously, how do you choose which ones to prioritize?

Start with the decision context. Who makes decisions based on this model's output? What are the consequences of a false positive versus a false negative? In a spam filter, a false positive (real email marked as spam) is annoying. In a criminal risk assessment, a false positive (low-risk person scored as high-risk) affects someone's liberty.

Consider the stakeholders. Demographic parity focuses on group-level outcomes. Equalized odds focuses on individual accuracy. Predictive parity focuses on the reliability of the score. Different stakeholders may prioritize different metrics — applicants care about equal accuracy, decision-makers care about calibration, regulators care about group-level outcomes.

Be explicit about the trade-off. Document which metric you chose, why, and what you are accepting in trade. "We prioritize equalized odds because unequal error rates in this context cause disproportionate harm to the affected group. We accept that this may reduce predictive parity and overall accuracy." This transparency is itself a responsible AI practice — it makes the values decision visible and debatable rather than hidden in a model configuration.

Monitor all metrics, even the ones you don't optimize for. Choosing equalized odds does not mean ignoring predictive parity. Track every relevant metric. Understand the trade-offs in practice, not just in theory. If a secondary metric degrades beyond an acceptable threshold, revisit the design decision.

The Takeaway

Fairness in machine learning is not a problem with a single correct answer. Different metrics encode different values, and some are provably incompatible. Choosing a fairness metric is a decision about which kind of unfairness you are willing to tolerate — and that is a human judgment, not a technical one.

The responsible practice is not to find the perfect metric. It is to choose deliberately, document transparently, and monitor continuously. The model cannot tell you what is fair. It can only optimize for the definition you give it.

Make sure the definition reflects the values you actually hold.

Next in the "Responsible AI Practice" learning path: We'll cover AI auditing in practice — how to structure a systematic evaluation of an AI system for bias, safety, and compliance, and how to make audit findings actionable rather than decorative.

ShiftQuality