Skip to main content

Responsible AI and Ethics - Building Systems That Don't Cause Harm

Reading time: ~40 minutes | Level: ML System Design | Role: MLE, AI Engineer, MLOps


The Real Interview Moment

In 2016, ProPublica published an analysis of COMPAS - a recidivism prediction algorithm widely used in US courtrooms to help judges decide bail amounts and sentence lengths. The investigation found that Black defendants who did not go on to reoffend were falsely flagged as high risk at nearly twice the rate of white defendants who did not reoffend. The headline spread: "Machine Bias."

Northpointe, the company that built COMPAS, responded with a technical rebuttal. They were not wrong: the algorithm had roughly equal predictive accuracy (AUC) across racial groups. In the sense that mattered to them, the algorithm was fair. ProPublica's researchers agreed on the accuracy metric - and disagreed about everything else. When they looked at the error rates, a stark pattern emerged. Black defendants who were low-risk were classified as high-risk at 44.9% rate. White defendants who were low-risk were classified as high-risk at 23.5% rate. Those numbers held even after controlling for age, criminal history, and recidivism itself.

Both sides were technically correct. They were measuring different things. Northpointe was measuring calibration - the probability that a score of "7 out of 10" corresponds to approximately 70% reoffense rate for all groups. ProPublica was measuring equal opportunity and false positive rate parity. These two objectives, it turns out, are mathematically incompatible when base rates differ between groups.

This is not a quirk of COMPAS. It is a theorem. Chouldechova (2017) proved that when the base rate of the predicted outcome differs between demographic groups - and in recidivism prediction, it does - you cannot simultaneously achieve calibration, equal false positive rate, and equal false negative rate. You must choose which fairness definition to prioritize, and that choice has real consequences for real people.

Every ML engineer building systems that affect human lives - hiring, lending, healthcare triage, content moderation, criminal justice - will eventually face this choice. This lesson is about understanding it before your system is deployed in a courtroom, a bank, or a hospital.


Bias Sources in the ML Pipeline

Bias is not a single thing that enters a model in one place. It is woven into every stage of the ML pipeline, often invisibly. Understanding where bias enters is the first step toward addressing it.

Historical Bias

Historical bias is perhaps the most insidious type because it is invisible in the data itself. The data looks clean - it accurately reflects what happened. The problem is that what happened was shaped by historical discrimination.

A classic example: if you train a hiring model on historical hiring decisions, and those decisions were made by humans who systematically hired fewer women for engineering roles, the model learns that being female is negatively correlated with being hired for engineering. The model is accurate with respect to the historical data. It is replicating discrimination.

Historical bias cannot be removed by cleaning the data. The data is "clean" - the bias is in the world the data was recorded from. Addressing it requires intervention at the modeling or post-processing stage.

Representation Bias

Representation bias occurs when the training set underrepresents certain groups relative to their prevalence in the deployment population, or relative to their importance as stakeholders.

Dermatology AI systems trained primarily on images of light-skinned patients perform significantly worse on dark-skinned patients - not because of any malice, but because the training data came from clinics that served a predominantly light-skinned population. The model never learned to recognize melanoma in darker skin because it rarely saw it.

Representation bias is fixable in principle: collect more data from underrepresented groups, use oversampling, or apply group-specific loss weights. The challenge is knowing which groups are underrepresented - this requires deliberate demographic analysis of your training data.

Measurement Bias

Measurement bias occurs when the ground truth labels themselves contain systematic errors that correlate with protected attributes.

Arrest records are commonly used as a proxy for criminal behavior in recidivism models. But arrest rate ≠ criminal behavior rate. Police over-patrol certain neighborhoods and racial groups, leading to higher arrest rates that do not fully reflect underlying criminal activity. Using arrest records as ground truth bakes this policing bias into the label.

Similarly, clinical diagnoses used as training labels for healthcare AI inherit the biases of the clinicians who made those diagnoses - including documented disparities in pain assessment across racial groups.

Aggregation Bias

Aggregation bias occurs when a single model is trained on data from heterogeneous subpopulations with different underlying relationships, without accounting for the heterogeneity.

A glucose prediction model trained on a mixed population may learn the average relationship between biomarkers and blood glucose - which is different for diabetic and non-diabetic patients. The model works adequately on average but poorly for both subgroups. The solution is either to train separate models per group or to include group membership as an explicit feature.

Deployment Bias

Deployment bias occurs when a model is used in a context that differs systematically from the context it was trained in.

A facial recognition system trained on passport photos (frontal, well-lit, controlled environment) deployed for surveillance in uncontrolled outdoor environments will perform unevenly across the groups that are more commonly in the wild versus the controlled setting. The bias is not in the training data or model - it is in the gap between training distribution and deployment distribution.


Fairness Metrics

The COMPAS controversy exposed a fundamental truth: there is no single definition of fairness. Different stakeholders have different legitimate fairness concerns, and these concerns are often mathematically incompatible. An ML engineer must understand the major definitions and their trade-offs.

Demographic Parity (Statistical Parity)

The positive prediction rate should be equal across protected groups:

P(Y^=1A=0)=P(Y^=1A=1)P(\hat{Y}=1 \mid A=0) = P(\hat{Y}=1 \mid A=1)

where AA is the protected attribute (0 and 1 denoting two groups). If your loan approval model approves 60% of group A applications and 40% of group B applications, it violates demographic parity.

When it makes sense: when you believe the true positive rate should be equal across groups - for example, if you believe loan repayment ability is not causally related to the protected attribute and any difference in approval rates reflects historical discrimination.

Limitation: demographic parity can require actively giving different treatment to individuals with identical qualifications. If group A has genuinely different qualification rates due to historical factors, enforcing demographic parity means approving less-qualified group A candidates and rejecting more-qualified group B candidates. This is Rawlsian fairness (focus on group equity) at the cost of individual fairness.

Equal Opportunity

The true positive rate (recall) should be equal across groups:

P(Y^=1Y=1,A=0)=P(Y^=1Y=1,A=1)P(\hat{Y}=1 \mid Y=1, A=0) = P(\hat{Y}=1 \mid Y=1, A=1)

Among all people who would actually repay the loan (Y=1Y=1), the model should approve equally high proportions from both groups. Equal opportunity focuses on not disadvantaging qualified members of a group.

When it makes sense: when you primarily care about ensuring qualified individuals from all groups are not unfairly denied positive outcomes.

Limitation: it only constrains the true positive rate - it says nothing about false positive rates. Two models can satisfy equal opportunity while having very different false positive rates across groups.

Equalized Odds

Both true positive rate (TPR) and false positive rate (FPR) should be equal across groups:

P(Y^=1Y=1,A=0)=P(Y^=1Y=1,A=1)P(\hat{Y}=1 \mid Y=1, A=0) = P(\hat{Y}=1 \mid Y=1, A=1) P(Y^=1Y=0,A=0)=P(Y^=1Y=0,A=1)P(\hat{Y}=1 \mid Y=0, A=0) = P(\hat{Y}=1 \mid Y=0, A=1)

This is the combination of equal opportunity and equal false positive rate. In the COMPAS context: the same fraction of genuinely low-risk individuals should be falsely flagged as high-risk, regardless of race. This is what ProPublica was measuring when they found the 44.9% vs 23.5% disparity.

Individual Fairness

Similar individuals should receive similar predictions: if two individuals are alike in all relevant respects, the model should treat them alike. Formally, if d(xi,xj)d(x_i, x_j) is a task-relevant similarity metric:

dY^(Y^(xi),Y^(xj))Ld(xi,xj)d_{\hat{Y}}(\hat{Y}(x_i), \hat{Y}(x_j)) \leq L \cdot d(x_i, x_j)

The model must be Lipschitz continuous with respect to the task-relevant metric. Individual fairness is appealing but hard to operationalize: defining the "task-relevant similarity metric" d(xi,xj)d(x_i, x_j) requires specifying what attributes are relevant, which is itself a value judgment.

Calibration

For each predicted probability pp and each group aa:

P(Y=1p^=p,A=a)=pP(Y=1 \mid \hat{p}=p, A=a) = p

A calibrated model is one where "a 70% predicted risk score corresponds to 70% actual risk," regardless of group membership. This is what Northpointe measured in the COMPAS analysis.

The Impossibility Theorem

Chouldechova (2017) proved that when base rates differ between groups - that is, when P(Y=1A=0)P(Y=1A=1)P(Y=1 \mid A=0) \neq P(Y=1 \mid A=1) - it is mathematically impossible to simultaneously achieve:

  • Calibration
  • Equal false positive rates across groups
  • Equal false negative rates across groups

You must sacrifice at least one. There is no free lunch in fairness. The choice is an ethical decision, not a technical one:

If you prioritize...You sacrifice...Stakeholder most affected
CalibrationEqual error ratesGroups with lower base rates get higher false positive rates
Equal false positive ratesCalibration or false negative equalityMay under-detect risk in one group
Equal false negative ratesCalibration or false positive equalityMay over-flag low-risk members of one group

This theorem does not mean fairness is impossible. It means that fairness requires a value judgment about which type of error matters most in your specific application.


Bias Detection in Practice

Understanding fairness metrics abstractly is not enough. You need concrete detection methods that work at production scale.

Slice-Based Evaluation

The most basic and most important detection method: compute all your evaluation metrics separately for each demographic group. Do not report only aggregate AUC or accuracy - those can look fine while hiding severe disparities within subgroups.

For a loan approval model, compute for each group:

GroupApproval RateTrue Positive RateFalse Positive RateAUC
Overall55%78%22%0.82
Group A62%81%19%0.84
Group B43%69%31%0.76

This table immediately reveals that Group B has lower approval rates, lower recall of creditworthy applicants, and higher false positive rates. The aggregate AUC of 0.82 hides a 0.08 AUC gap between groups - which in a high-stakes lending application is substantial.

Disparate Impact Ratio

The EEOC (US Equal Employment Opportunity Commission) 80% rule defines a legal threshold for disparate impact in employment decisions:

DI=P(Y^=1A=1)P(Y^=1A=0)\text{DI} = \frac{P(\hat{Y}=1 \mid A=1)}{P(\hat{Y}=1 \mid A=0)}

If DI<0.8\text{DI} < 0.8 (the favored group receives positive outcomes at more than 1.25x the rate of the disadvantaged group), the process is considered to have disparate impact under US employment law. The 80% rule is widely used in audits of automated hiring systems.

Note: the DI ratio addresses demographic parity (selection rate equality) only. It does not address equalized odds or individual fairness.

Counterfactual Fairness

A model is counterfactually fair if its prediction would not change if the individual's protected attribute were different, all else being equal:

P(Y^A=a(x)=y)=P(Y^A=a(x)=y)P(\hat{Y}_{A=a}(x) = y) = P(\hat{Y}_{A=a'}(x) = y)

In practice, test counterfactual fairness by flipping the protected attribute (e.g., change "gender=female" to "gender=male" while keeping all other features identical) and observing whether predictions change. If the model is truly fair, it should not. If it changes substantially, a proxy for gender is being used.

Proxy Features and Indirect Discrimination

Removing the protected attribute from the feature set does not prevent discrimination. Many features are correlated with protected attributes - ZIP code correlates with race, name correlates with gender and ethnicity, browsing history correlates with age. The model can reconstruct the protected attribute from these proxies with high accuracy and discriminate indirectly.

Test for proxy discrimination by measuring how well you can predict the protected attribute from the model's features. If a simple logistic regression can predict race from ZIP code + income + occupation with 80% accuracy, then your model effectively has access to race even if it is not an explicit feature.


Bias Mitigation Strategies

Once bias is detected, there are three categories of mitigation - each with different trade-offs between implementation complexity and effectiveness.

Pre-Processing Mitigation

Modify the training data before fitting the model. These methods are model-agnostic.

Reweighting samples: assign higher loss weights to underrepresented groups or to samples that are being systematically misclassified. In sklearn, this is the sample_weight parameter in fit().

Resampling: oversample the disadvantaged group (SMOTE for synthetic oversampling) or undersample the advantaged group to rebalance the training distribution.

Disparate impact remover: transform features to remove correlation with the protected attribute while preserving as much predictive signal as possible (Feldman et al., 2015). Available in the aif360 library.

Massaging / relabeling: identify borderline cases near the decision boundary and flip labels for members of the disadvantaged group to increase their positive rate. This is aggressive and legally controversial.

In-Processing Mitigation

Modify the training objective to include a fairness constraint. More powerful than pre-processing but requires modifying the training code.

Fairness-constrained optimization: add a fairness penalty to the loss function:

Lfair=Lbase+λF\mathcal{L}_{\text{fair}} = \mathcal{L}_{\text{base}} + \lambda \cdot \mathcal{F}

where F\mathcal{F} is a differentiable fairness penalty (e.g., difference in positive prediction rates between groups). The trade-off between accuracy and fairness is controlled by λ\lambda.

Adversarial debiasing: train a main predictor alongside an adversarial network that tries to predict the protected attribute from the main predictor's output. The main predictor is trained to fool the adversary - to make predictions that the adversary cannot use to recover the protected attribute.

Reduction methods: Fairlearn's ExponentiatedGradient frames fairness as a constrained optimization problem and solves it by reducing to a sequence of weighted classification problems. Supports equalized odds, demographic parity, and true positive rate parity as constraints.

Post-Processing Mitigation

Adjust the model's output (thresholds) after training to satisfy fairness constraints. The most practical approach - it does not require retraining, works with any model, and is easy to audit.

Threshold optimization: instead of using a single decision threshold (e.g., approve if score > 0.5) across all groups, calibrate separate thresholds per group to equalize the false positive rate or true positive rate:

τa=argminτFPR(τ,groupa)target_FPR\tau_a = \arg\min_\tau \left|\text{FPR}(\tau, \text{group}_a) - \text{target\_FPR}\right|

Fairlearn's ThresholdOptimizer implements this efficiently.

Reject option classification: for predictions near the decision boundary (low confidence), accept the prediction of the disadvantaged group and reject the prediction of the advantaged group. This increases the positive rate for the disadvantaged group without changing confident predictions.


Privacy-Preserving ML

Beyond fairness, production ML systems must also protect the privacy of individuals whose data was used in training. Two major threats:

Model inversion attack: an attacker queries the model many times and reconstructs training data from the model's outputs. In facial recognition, model inversion can reconstruct the training faces from the model weights.

Membership inference attack: given a data point and model access, an attacker determines whether that data point was in the training set. This is particularly concerning for medical data: "Was patient X's record used to train this cancer model?"

Differential Privacy

Differential privacy is the mathematically rigorous framework for preventing both types of attacks. A randomized mechanism MM is (ε,δ)(\varepsilon, \delta)-differentially private if for any two neighboring datasets D1D_1 and D2D_2 (differing in exactly one individual's record) and any output set SS:

P[M(D1)S]eεP[M(D2)S]+δP[M(D_1) \in S] \leq e^\varepsilon \cdot P[M(D_2) \in S] + \delta

Intuitively: the mechanism's outputs look nearly identical whether or not any individual's data is included. An attacker cannot determine whether you were in the training set by looking at the model's outputs.

The parameters:

  • ε\varepsilon (epsilon): the privacy budget. Smaller ε\varepsilon = stronger privacy guarantee. ε=0\varepsilon = 0 means perfect privacy (but zero utility). ε=1\varepsilon = 1 is strong privacy. ε=10\varepsilon = 10 is weak privacy. The Apple and Google deployments of DP use ε\varepsilon in the range of 1–8.
  • δ\delta (delta): the probability that the guarantee breaks down. Typically set to 1/n21/n^2 where nn is the dataset size.

DP-SGD: Training Neural Networks with Differential Privacy

DP-SGD (Abadi et al., 2016) adds differential privacy to gradient descent by:

  1. Per-sample gradient clipping: clip each individual sample's gradient to norm CC, preventing any single training example from having disproportionate influence on the model:

g~t,i=gt,imin ⁣(1,Cgt,i2)\tilde{g}_{t,i} = g_{t,i} \cdot \min\!\left(1, \frac{C}{\|g_{t,i}\|_2}\right)

  1. Gaussian noise addition: add calibrated noise to the average gradient:

g~t=1B(i=1Bg~t,i+N(0,σ2C2I))\tilde{g}_t = \frac{1}{B}\left(\sum_{i=1}^{B} \tilde{g}_{t,i} + \mathcal{N}(0, \sigma^2 C^2 I)\right)

The noise scale σ\sigma is calibrated to the clipping norm CC and the desired privacy budget (ε,δ)(\varepsilon, \delta) using the privacy accountant (moments accountant or Rényi DP accounting).

The privacy-accuracy trade-off is real: DP-SGD typically incurs a 2–5% accuracy reduction on standard benchmarks at reasonable privacy levels (ε3\varepsilon \approx 3). The cost is higher for smaller datasets (less averaging of the noise) and for high-dimensional output spaces.

Federated Learning

Federated learning trains models without centralizing data. The canonical setting:

  1. A central server sends the current model to KK devices (phones, hospitals, edge servers).
  2. Each device trains the model on its local data for a few steps, producing a local model update Δk\Delta_k.
  3. Devices send only the model update (gradient or weight difference) to the server - not the raw data.
  4. The server aggregates updates (typically by weighted averaging proportional to local dataset size):

θt+1=θt+1knkk=1KnkΔk\theta_{t+1} = \theta_t + \frac{1}{\sum_k n_k} \sum_{k=1}^{K} n_k \cdot \Delta_k

Privacy guarantee: raw training data never leaves the device. Medical records stay on hospital servers. Personal browsing history stays on the phone.

Remaining risks: model updates (gradients) can still leak information - gradient inversion attacks can reconstruct training data from gradients. Combining federated learning with differential privacy (add noise to model updates before sending) is the production standard.

Challenges: communication cost (sending model updates over the network repeatedly), statistical heterogeneity (non-IID data distribution across devices leads to slower convergence), and system heterogeneity (different compute capabilities across devices).


Regulatory Landscape

The regulatory environment for AI is evolving rapidly. ML engineers working on systems that affect individuals must understand the key frameworks.

EU AI Act (2024)

The EU AI Act is the world's first comprehensive AI regulation, effective August 2024 with a staggered implementation timeline.

Risk tiers:

  • Unacceptable risk (prohibited): real-time biometric surveillance in public spaces, social scoring systems, AI that exploits vulnerabilities of specific groups, subliminal manipulation. These are banned outright.
  • High risk (regulated): AI in critical infrastructure, education (exam grading), employment and worker management, essential services (credit scoring, health, life insurance), law enforcement, border control, justice administration. High-risk systems require conformity assessment, human oversight, transparency documentation, and registration in a public EU database.
  • Limited risk: systems with specific transparency obligations (e.g., chatbots must disclose they are AI).
  • Minimal risk: most commercial AI - no specific requirements.

Technical requirements for high-risk systems: risk management system throughout lifecycle, data governance (training data quality requirements), technical documentation, automatic logging, transparency to users, human oversight capability, accuracy/robustness requirements.

Penalties: up to €30 million or 6% of global annual turnover for prohibited systems. Up to €20 million or 4% of global annual turnover for other violations.

GDPR Article 22 - Right to Explanation

The EU General Data Protection Regulation (GDPR), effective 2018, gives individuals the right not to be subject to decisions based solely on automated processing that produce legal or similarly significant effects. When automated decisions are made, individuals have the right to:

  • Obtain an explanation of the logic involved
  • Express their point of view
  • Have the decision reviewed by a human

In practice, this means ML systems affecting EU residents in lending, hiring, insurance, or similar domains must provide explanations for individual decisions - not just model-level feature importance, but instance-level explanations (e.g., LIME or SHAP values for the specific decision).

US EEOC and Employment Law

The EEOC (Equal Employment Opportunity Commission) applies disparate impact analysis to employment algorithms. The 80% rule: if the selection rate for any group is less than 80% of the selection rate for the highest-selected group, disparate impact is indicated and the employer must demonstrate business necessity.

In 2023, the EEOC issued technical assistance guidance specifically addressing algorithmic employment decisions, clarifying that employers cannot avoid liability by outsourcing algorithmic decision-making to third-party vendors.

US Fair Credit Reporting Act and Equal Credit Opportunity Act

The ECOA prohibits credit discrimination on the basis of race, color, religion, national origin, sex, marital status, and age. The Fair Housing Act extends similar protections to lending for housing. ML credit scoring models must be audited for disparate impact, and lenders must provide adverse action notices explaining credit denials.

FDA Guidance for AI/ML Medical Devices

The FDA (2021) published "Proposed Regulatory Framework for Modifications to AI/ML-Based Software as a Medical Device." Key requirements: predetermined change control plans (you must specify in advance how the algorithm will be updated), performance monitoring, and real-world performance data requirements. High-risk medical AI (e.g., autonomous diagnostic algorithms) requires premarket approval with clinical evidence.


Code: Bias Detection and Mitigation

Slice-Based Evaluation with pandas

import pandas as pd
import numpy as np
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
roc_auc_score,
confusion_matrix,
)

def evaluate_fairness(
df: pd.DataFrame,
y_true_col: str,
y_pred_col: str,
y_prob_col: str,
protected_col: str,
) -> pd.DataFrame:
"""
Compute fairness metrics by demographic group.

Args:
df: DataFrame with predictions and demographics
y_true_col: Column name for true labels
y_pred_col: Column name for predicted binary labels
y_prob_col: Column name for predicted probabilities
protected_col: Column name for protected attribute

Returns:
DataFrame with metrics per group
"""
groups = df[protected_col].unique()
results = []

for group in sorted(groups):
mask = df[protected_col] == group
y_true = df.loc[mask, y_true_col]
y_pred = df.loc[mask, y_pred_col]
y_prob = df.loc[mask, y_prob_col]

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

results.append({
"group": group,
"n": mask.sum(),
"positive_rate": y_pred.mean(), # demographic parity metric
"accuracy": accuracy_score(y_true, y_pred),
"precision": precision_score(y_true, y_pred, zero_division=0),
"recall_tpr": recall_score(y_true, y_pred, zero_division=0), # equal opportunity metric
"fpr": fp / (fp + tn) if (fp + tn) > 0 else 0, # equalized odds metric
"fnr": fn / (fn + tp) if (fn + tp) > 0 else 0,
"auc": roc_auc_score(y_true, y_prob) if y_true.nunique() > 1 else float("nan"),
"base_rate": y_true.mean(),
})

results_df = pd.DataFrame(results)

# Add disparate impact ratio (relative to highest positive rate group)
max_pos_rate = results_df["positive_rate"].max()
results_df["disparate_impact"] = results_df["positive_rate"] / max_pos_rate
results_df["eeoc_flag"] = results_df["disparate_impact"] < 0.8

print("\n=== Fairness Evaluation ===")
print(results_df.to_string(index=False))

# Identify violations
print("\n=== Fairness Violations ===")
eeoc_violators = results_df[results_df["eeoc_flag"]]
if len(eeoc_violators) > 0:
print(f"EEOC 80% rule violated for groups: {eeoc_violators['group'].tolist()}")
else:
print("No EEOC disparate impact violations detected.")

# Equal opportunity gap
max_tpr = results_df["recall_tpr"].max()
min_tpr = results_df["recall_tpr"].min()
print(f"Equal opportunity gap (max TPR - min TPR): {max_tpr - min_tpr:.3f}")
if max_tpr - min_tpr > 0.1:
print("WARNING: TPR gap exceeds 10% - equal opportunity may be violated.")

# FPR gap (equalized odds component)
max_fpr = results_df["fpr"].max()
min_fpr = results_df["fpr"].min()
print(f"Equalized odds FPR gap: {max_fpr - min_fpr:.3f}")

return results_df


# Generate synthetic loan data with demographic group
np.random.seed(42)
n = 5000
groups = np.random.choice(["Group A", "Group B", "Group C"], n, p=[0.5, 0.3, 0.2])

# Simulate different base rates per group (real-world disparity)
base_rates = {"Group A": 0.35, "Group B": 0.25, "Group C": 0.40}
y_true = np.array([np.random.binomial(1, base_rates[g]) for g in groups])

# Simulate a biased model (slightly over-predicts Group A)
y_prob = np.clip(
y_true * 0.7 + np.random.normal(0, 0.2, n)
+ np.where(groups == "Group A", 0.05, -0.03),
0, 1
)
y_pred = (y_prob > 0.5).astype(int)

df_eval = pd.DataFrame({
"group": groups,
"y_true": y_true,
"y_pred": y_pred,
"y_prob": y_prob,
})

fairness_report = evaluate_fairness(
df_eval, "y_true", "y_pred", "y_prob", "group"
)

Fairlearn ThresholdOptimizer for Post-Processing Fairness

"""
Post-processing fairness with Fairlearn ThresholdOptimizer.
pip install fairlearn scikit-learn
"""
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.metrics import MetricFrame, selection_rate, false_positive_rate, true_positive_rate


def train_fair_model(
X: np.ndarray,
y: np.ndarray,
sensitive_features: np.ndarray,
constraint: str = "equalized_odds",
) -> tuple:
"""
Train a base classifier and apply ThresholdOptimizer for fairness.

ThresholdOptimizer finds the per-group decision thresholds that minimize
a weighted combination of accuracy loss and fairness constraint violation.

Args:
X: Feature matrix
y: Labels
sensitive_features: Protected attribute (group membership)
constraint: Fairness constraint - 'equalized_odds', 'demographic_parity',
'true_positive_rate_parity', 'false_positive_rate_parity'

Returns:
(base_model, fair_model, X_test, y_test, sensitive_test)
"""
X_train, X_test, y_train, y_test, sf_train, sf_test = train_test_split(
X, y, sensitive_features, test_size=0.3, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 1: Train the base model (no fairness constraints)
base_model = LogisticRegression(max_iter=1000, random_state=42)
base_model.fit(X_train_scaled, y_train)

# Step 2: ThresholdOptimizer wraps the base model
# It learns separate thresholds per group at test time
# This is post-processing: the base model is not retrained
fair_model = ThresholdOptimizer(
estimator=base_model,
constraints=constraint, # which fairness definition to enforce
objective="accuracy_score", # what to maximize while satisfying constraint
predict_method="predict_proba",
)
fair_model.fit(X_train_scaled, y_train, sensitive_features=sf_train)

# Evaluate both
print("\n=== Base Model (no fairness constraint) ===")
y_pred_base = base_model.predict(X_test_scaled)
mf_base = MetricFrame(
metrics={
"accuracy": lambda y, yp: (y == yp).mean(),
"selection_rate": selection_rate,
"true_positive_rate": true_positive_rate,
"false_positive_rate": false_positive_rate,
},
y_true=y_test,
y_pred=y_pred_base,
sensitive_features=sf_test,
)
print(mf_base.by_group)

print(f"\n=== Fair Model (ThresholdOptimizer, constraint={constraint}) ===")
y_pred_fair = fair_model.predict(X_test_scaled, sensitive_features=sf_test)
mf_fair = MetricFrame(
metrics={
"accuracy": lambda y, yp: (y == yp).mean(),
"selection_rate": selection_rate,
"true_positive_rate": true_positive_rate,
"false_positive_rate": false_positive_rate,
},
y_true=y_test,
y_pred=y_pred_fair,
sensitive_features=sf_test,
)
print(mf_fair.by_group)

print(f"\nAccuracy difference (base - fair): "
f"{mf_base.overall['accuracy'] - mf_fair.overall['accuracy']:.4f}")
print("(Small accuracy reduction is the cost of fairness)")

return base_model, fair_model, X_test_scaled, y_test, sf_test


# Generate synthetic data
np.random.seed(42)
n = 3000
group = np.random.choice(["A", "B"], n, p=[0.6, 0.4])
X = np.column_stack([
np.random.randn(n, 4), # informative features
(group == "A").astype(float) + np.random.randn(n) * 0.3, # feature correlated with group
])
y = ((X[:, 0] + X[:, 1] * 0.5 + (group == "A") * 0.3 + np.random.randn(n) * 0.5) > 0).astype(int)

base, fair, X_te, y_te, sf_te = train_fair_model(X, y, group, constraint="equalized_odds")

DP-SGD with Opacus

"""
Training with Differential Privacy using Opacus.
pip install opacus torch torchvision
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from opacus import PrivacyEngine
from opacus.validators import ModuleValidator


def train_with_dp(
X_train: torch.Tensor,
y_train: torch.Tensor,
target_epsilon: float = 3.0, # privacy budget: 3.0 is reasonable
target_delta: float = 1e-5, # probability of guarantee breaking
max_grad_norm: float = 1.0, # per-sample gradient clipping norm C
epochs: int = 10,
batch_size: int = 256,
lr: float = 0.05,
) -> nn.Module:
"""
Train a neural network with DP-SGD (Differential Privacy via Opacus).

DP-SGD provides mathematical guarantees that the trained model does not
reveal whether any individual's data was in the training set.

Privacy budget ε=3 is a reasonable practical choice:
- ε < 1: very strong privacy, significant accuracy loss
- ε = 1-5: strong privacy, moderate accuracy loss (2-5%)
- ε = 10-100: weak privacy, minimal accuracy loss
- No DP: no privacy guarantee

Args:
target_epsilon: Maximum privacy budget to spend
target_delta: Probability of guarantee failure (set to 1/n^2 typically)
max_grad_norm: Per-sample gradient clipping norm (C in DP-SGD formula)
"""
input_dim = X_train.shape[1]

# Model: must be compatible with Opacus
# (no batch norm - use group norm instead; Opacus requires per-sample gradients)
model = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1),
nn.Sigmoid(),
)

# Validate and fix any incompatible modules
model = ModuleValidator.fix(model)

dataset = TensorDataset(X_train, y_train.float().unsqueeze(1))
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)

optimizer = optim.SGD(model.parameters(), lr=lr)
criterion = nn.BCELoss()

# Attach Opacus PrivacyEngine to the training loop
# This automatically:
# 1. Computes per-sample gradients (instead of batch gradients)
# 2. Clips per-sample gradients to max_grad_norm
# 3. Adds calibrated Gaussian noise to the clipped gradients
# 4. Tracks the privacy budget spent via Rényi DP accounting
privacy_engine = PrivacyEngine()
model, optimizer, loader = privacy_engine.make_private_with_epsilon(
module=model,
optimizer=optimizer,
data_loader=loader,
epochs=epochs,
target_epsilon=target_epsilon,
target_delta=target_delta,
max_grad_norm=max_grad_norm,
)

print(f"DP-SGD training: target ε={target_epsilon}, δ={target_delta}")
print(f"Noise multiplier σ will be set by Opacus to meet the privacy budget")

model.train()
for epoch in range(epochs):
total_loss = 0.0
for batch_X, batch_y in loader:
optimizer.zero_grad()
output = model(batch_X)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
total_loss += loss.item()

# Report actual privacy budget spent so far
epsilon = privacy_engine.get_epsilon(target_delta)
print(f"Epoch {epoch + 1}/{epochs}: loss={total_loss / len(loader):.4f}, "
f"ε spent so far={epsilon:.2f} (budget={target_epsilon})")

final_epsilon = privacy_engine.get_epsilon(target_delta)
print(f"\nFinal privacy guarantee: (ε={final_epsilon:.2f}, δ={target_delta})")
print(f"Interpretation: outputs look ≤e^{final_epsilon:.2f}{torch.exp(torch.tensor(final_epsilon)):.1f}x "
f"more likely regardless of whether any individual was in training data")

return model


# Synthetic demo
n, d = 2000, 10
X = torch.randn(n, d)
y = ((X[:, 0] + X[:, 1]) > 0).long()

dp_model = train_with_dp(X.float(), y, target_epsilon=3.0, epochs=5)

Common Mistakes

:::danger Omitting Protected Attributes but Keeping Proxies Removing race, gender, or age from the feature set does not prevent discrimination if you keep correlated proxies. ZIP code correlates strongly with race in the US. First name correlates with gender and ethnicity. Job title correlates with gender. A model with access to these proxies can effectively reconstruct the protected attribute and discriminate indirectly - this is called "proxy discrimination" and is equally illegal under disparate impact law. Always test how predictable the protected attribute is from your remaining features. :::

:::danger Evaluating Only Aggregate Metrics Reporting only overall AUC, accuracy, or F1 is insufficient and misleading for fairness-critical systems. A model with overall AUC 0.85 can have AUC 0.90 for Group A and AUC 0.72 for Group B. The aggregate masks the harm. Always compute per-group metrics. The groups to evaluate should include all legally protected classes relevant to your deployment context (race, gender, age, disability, national origin, religion). :::

:::danger Choosing a Fairness Metric Without Ethical Analysis Picking demographic parity because it is easy to implement without analyzing whether it is appropriate for your context can cause harm in the other direction. If your goal is to evaluate loan creditworthiness fairly, demographic parity (equal approval rates across groups) might require approving less-creditworthy applicants from the disadvantaged group to hit the quota. Equal opportunity (equal recall of truly creditworthy applicants) might be more ethically appropriate. The choice is not technical - it requires input from ethicists, legal counsel, and affected community representatives. :::

:::warning Achieving Demographic Parity at the Cost of Catastrophically Degraded Accuracy Fairness is not a binary constraint - it exists on a trade-off curve with accuracy. Fairlearn's ExponentiatedGradient will find the Pareto-efficient frontier of fairness vs accuracy trade-offs. Do not automatically accept the maximum fairness solution if it destroys utility. Present the full trade-off curve to decision-makers and let them choose the operating point based on the application's requirements and the relative harms of different error types. :::

:::tip Use SHAP for Regulatory Explanations GDPR Article 22 and similar regulations require explanations for automated decisions. SHAP (SHapley Additive exPlanations) provides per-instance, per-feature attributions that satisfy the "right to explanation" requirement. For a loan denial, you can tell the applicant: "The three most influential factors in your application were: credit utilization (negative), length of credit history (positive), and number of recent inquiries (negative)." SHAP is the current standard for regulatory-compliant ML explanations. :::

:::tip Establish a Model Card for Every Production System A model card (Mitchell et al., 2019) is a standardized documentation artifact for ML models that includes: intended use, out-of-scope uses, training data description, evaluation results by demographic group, ethical considerations, and limitations. Google, Hugging Face, and major cloud providers have standardized model card formats. Producing a model card forces you to conduct the fairness analysis before deployment rather than after an incident. :::


YouTube Resources

VideoCreatorWhy Watch
21 Fairness DefinitionsArvind NarayananMathematical fairness definitions and the impossibility result
COMPAS Recidivism AnalysisProPublicaThe famous bias case study in full
Differential PrivacyGooglersDP explained accessibly with concrete examples
EU AI Act OverviewEuropean ParliamentRegulatory requirements and risk tiers

Interview Questions and Answers

Q1: Explain the fairness impossibility theorem. Why does it matter in practice?

Chouldechova (2017) proved that when base rates differ between demographic groups - P(Y=1A=0)P(Y=1A=1)P(Y=1 \mid A=0) \neq P(Y=1 \mid A=1) - it is mathematically impossible to simultaneously achieve calibration (scores mean the same thing for all groups), equal false positive rates across groups, and equal false negative rates across groups. You must sacrifice at least one.

In practice: the COMPAS system was calibrated but had unequal false positive rates across racial groups - exactly because Black defendants had higher base rates of reoffending in that dataset (itself a result of historical over-policing and criminalization). A classifier cannot simultaneously say "7 out of 10 means 70% risk for everyone" and "the same fraction of low-risk people are falsely flagged as high-risk regardless of group" when the base rates of actual risk differ.

This matters enormously in practice because different stakeholders legitimately prioritize different fairness criteria. Defendants care about false positive rates (being wrongly detained). Recidivism prediction accuracy at the group level (calibration) is what the justice system cares about. There is no objectively correct answer - the choice is an ethical and political one that must involve all affected stakeholders. Engineers who pretend this is a technical decision are abdicating responsibility.


Q2: What is the difference between demographic parity and equalized odds? When would you choose each?

Demographic parity requires equal positive prediction rates across groups: P(Y^=1A=0)=P(Y^=1A=1)P(\hat{Y}=1 \mid A=0) = P(\hat{Y}=1 \mid A=1). The model approves equal fractions of each group regardless of qualifications.

Equalized odds requires equal true positive rates AND equal false positive rates across groups. Among qualified individuals, equal fractions are approved. Among unqualified individuals, equal fractions are erroneously approved.

Choose demographic parity when: you believe the underlying rate of "deserving positive outcomes" should be equal across groups, and any empirical difference in that rate is due to historical discrimination rather than genuine differential qualification. Example: ensuring equal representation in a job training program regardless of past credentials (which may reflect unequal access to education).

Choose equalized odds when: you believe the underlying qualification rate genuinely differs across groups (for non-discriminatory reasons), but you want to ensure that equally qualified individuals receive equal treatment. Example: a credit model where creditworthiness genuinely differs across income groups, but you want to ensure that creditworthy individuals from all income groups are approved at equal rates.

In practice: equalized odds is more commonly operationally feasible and legally defensible, because demographic parity can require actively preferential treatment that may conflict with anti-discrimination law in some jurisdictions.


Q3: Explain differential privacy and the epsilon parameter intuitively.

Differential privacy is a mathematical framework for quantifying and bounding the privacy risk from releasing information about a dataset. A mechanism MM is ε\varepsilon-differentially private if for any two datasets that differ in exactly one person's record, the mechanism's outputs are nearly indistinguishable: P[M(D1)S]eεP[M(D2)S]P[M(D_1) \in S] \leq e^\varepsilon \cdot P[M(D_2) \in S].

The ε\varepsilon parameter (the privacy budget) controls how indistinguishable the outputs are. At ε=0\varepsilon = 0: perfect indistinguishability - the output reveals nothing about whether you are in the dataset (but typically the mechanism reveals nothing useful either). At ε=1\varepsilon = 1: outputs for neighboring datasets can differ by at most factor e12.7e^1 \approx 2.7 - an attacker gains almost no information about whether you are in the dataset. At ε=10\varepsilon = 10: outputs can differ by factor e1022,000e^{10} \approx 22{,}000 - very weak privacy.

The intuition: DP adds carefully calibrated noise to computations. The noise is just large enough that no statistical test can reliably determine whether your particular record influenced the output. This is what Apple uses for keyboard usage statistics (keyboard auto-correction learning) and Google uses for Chrome browsing trends - they collect aggregate statistics without learning about any individual.

In DP-SGD, the mechanism is gradient descent: each gradient update is perturbed with noise calibrated to spend Δε\Delta\varepsilon of the privacy budget. After TT gradient steps, the total budget spent is tracked by the privacy accountant.


Q4: How would you audit a production ML model for bias?

A systematic bias audit has five components:

Step 1 - Identify protected attributes and proxy features: list all legally protected classes relevant to the deployment context (race, gender, age, disability, etc.) and identify features that proxy for them in your feature set.

Step 2 - Measure base rates: compute the prevalence of the positive outcome (Y=1Y=1) separately for each demographic group. Document base rate differences - these drive the impossibility constraints.

Step 3 - Slice-based evaluation: compute accuracy, TPR, FPR, precision, AUC, and selection rate separately for each demographic group. Compute the disparate impact ratio (EEOC 80% rule). Flag groups with DI below 0.8 or fairness metric gaps above your chosen threshold.

Step 4 - Counterfactual analysis: for a sample of predictions, flip the protected attribute (or a proxy) and observe whether predictions change. High change rate indicates the model is using protected attribute information.

Step 5 - Error analysis by group: for false positives and false negatives, examine whether the errors are concentrated in specific demographic groups. Document the distribution of harmful errors.

Produce a model card documenting findings, the chosen fairness trade-offs, and the ethical rationale. Reaudit after every significant model or data update.


Q5: What are the EU AI Act requirements for high-risk AI systems, and how do they affect ML engineering practice?

The EU AI Act defines high-risk AI systems as those used in: critical infrastructure, education assessment, employment decisions, essential services (credit, health, life insurance), law enforcement, border control, and justice administration. Any ML engineer working on systems in these domains must understand the requirements.

Technical requirements for ML engineers:

  1. Data governance: training data must be documented for relevance, representativeness, and freedom from errors. You must conduct bias analysis on training data.
  2. Technical documentation: comprehensive documentation of the system's purpose, design choices, limitations, and performance characteristics across demographic groups.
  3. Automatic logging: the system must log inputs and outputs for each high-consequence decision, with logs retained for defined periods.
  4. Transparency to operators: non-technical operators must be able to understand the system's output and its limitations.
  5. Human oversight: the system must be designed for human review and override of automated decisions. Fully autonomous consequential decisions are prohibited for high-risk systems.
  6. Accuracy and robustness: the system must meet defined accuracy standards and be robust to adversarial inputs and distribution shift.
  7. Conformity assessment: before deployment, high-risk systems must undergo a conformity assessment (either self-assessment or third-party) demonstrating compliance with the above requirements.

The practical implication: fairness auditing, explainability, logging, and human-in-the-loop design are not optional features. They are regulatory requirements for high-risk EU deployments.


Summary

Responsible AI is not a soft concern bolted onto the side of ML engineering - it is a core technical discipline with rigorous mathematical foundations and growing regulatory teeth.

The key principles:

  1. Bias enters at every stage: historical bias, representation bias, measurement bias, aggregation bias, and deployment bias each require different interventions. Understanding where bias originates is the first step toward addressing it.

  2. There is no single correct fairness metric: demographic parity, equal opportunity, equalized odds, calibration, and individual fairness are all legitimate and mathematically incompatible. Choosing among them is an ethical and political decision requiring stakeholder input, not a technical optimization.

  3. The impossibility theorem is real: when base rates differ between groups, you cannot simultaneously achieve calibration, equal false positive rates, and equal false negative rates. Document which trade-off your system makes and why.

  4. Slice-based evaluation is the minimum baseline: never report only aggregate metrics for systems that affect human lives. Compute and report metrics separately for every relevant demographic group.

  5. Privacy-preserving ML has rigorous guarantees: differential privacy (ε,δ\varepsilon, \delta-DP) and federated learning provide mathematically quantifiable privacy guarantees. For sensitive domains (health, finance, criminal justice), these are not optional.

  6. Regulation is arriving: the EU AI Act, GDPR Article 22, EEOC guidelines, ECOA, and FDA guidance create concrete legal requirements for ML systems affecting individuals. Engineers working in these domains must understand the regulatory requirements, not just the technical ones.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the End-to-End ML Pipeline demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.