:::tip 🎮 Interactive Playground Visualize this concept: Try the Safety and Bias Evaluation demo on the EngineersOfAI Playground - no code required. :::
Privacy and Ethics in Synthetic Data Generation
The synthesis pipeline has been running for three weeks. One hundred thousand training examples generated. The fine-tuned model looks excellent in internal demos. Legal signs off because - as the team explains it - "it's AI-generated, not scraped, so there's no copyright concern." Management approves the budget for the next phase. Deployment is scheduled.
Then a researcher on the red team runs a systematic extraction probe. She constructs prompts that lead the teacher model to complete specific sentence prefixes. She runs two thousand probes and compares the completions to a database of known copyrighted texts. The results land in the team's inbox on a Thursday afternoon: 3,200 of the generated training examples are near-verbatim reproductions of copyrighted material. The first four chapters of a medical reference manual. A substantial portion of a programming textbook. Seventeen verbatim paragraphs from a Wall Street Journal article. The "synthetic" dataset contains clearly infringing content - content the team legally cannot use to train a model they plan to distribute.
This is not a hypothetical. The New York Times v. OpenAI lawsuit includes exhibits showing GPT-4 reproducing NYT articles almost word-for-word when prompted correctly. The EU AI Act (effective 2026 for general-purpose AI systems) requires detailed training data provenance documentation. California AB 2013 requires disclosure of training data sources. The legal and ethical landscape around synthetic data is evolving faster than most engineering teams realize, and the teams that treat it as a non-issue are accumulating liability that will eventually materialize.
What makes synthetic data privacy different from traditional data privacy is that the risks are not where people expect them. There are no real users in your synthetic dataset - but there may be real copyrighted text reproduced from your teacher model's training data. There is no PII you collected - but there may be realistic-looking PII that trains a downstream model to generate similar realistic-looking PII in production. The risks are indirect and invisible until someone looks for them specifically. This lesson is about building the processes that find them before your legal team does.
The Privacy Paradox of Synthetic Data
Synthetic data is not privacy-safe by definition. It is privacy-safe only when you have verified specific properties and built specific safeguards. Understanding the three distinct risk channels is the starting point.
Risk 1: Memorization and Verbatim Reproduction
Language models memorize training data, especially text that appears multiple times or long contiguous sequences. The memorization rate for frontier models is estimated at 1–3% of training sequences under extractive probing (Carlini et al., 2023). When you use a frontier model as a teacher to generate synthetic training data, some generated text will be near-verbatim reproductions of the model's training data. If that training data includes copyrighted text - books, articles, code, private communications - your synthetic dataset is contaminated with potentially infringing content.
The mechanism matters for understanding the risk: memorization is not uniform. It is concentrated in text that appears frequently in pretraining corpora (popular books, widely-cited articles), text with distinctive low-entropy patterns (code with specific APIs, formulaic legal language), and text that was present in fine-tuning data. Prompts that closely echo training data structure are more likely to elicit memorized completions.
Risk 2: Attribute Association
Even without verbatim reproduction, LLMs learn statistical associations between real individuals and their attributes. If a specific doctor's name, hospital affiliation, and specialty appear together frequently in pretraining data (medical directories, journal articles, news), the model learns this association and may reproduce it in generated text - surfacing real personal information without any memorized sequence. This is the quasi-identifier problem: no single piece of information is sensitive, but the combination uniquely identifies a real person.
Risk 3: Model Collapse
An underappreciated long-term risk: if your synthetic data becomes part of the training data for the next generation of models, which then generate the synthetic data for the generation after that, the distribution of generated content degrades. Shumailov et al. (Nature, 2024) demonstrated "model collapse" - iterative training on AI-generated data causes progressive loss of distribution diversity and increased concentration around modal outputs. Your synthetic data pipeline may be contributing to a long-term problem in the broader AI ecosystem.
Detecting Memorized Content
import anthropic
import re
import json
import hashlib
from dataclasses import dataclass
from typing import Optional
from collections import Counter
client = anthropic.Anthropic()
@dataclass
class MemorizationCheckResult:
"""Result of a memorization probe on a piece of generated text."""
text_hash: str
is_likely_memorized: bool
memorization_score: float # 0.0 = clearly generated, 1.0 = clearly memorized
evidence: str
recommendation: str # "ok", "investigate", "reject"
probe_completions: list[str] = None
def compute_rouge_l(text1: str, text2: str) -> float:
"""
Compute ROUGE-L score (longest common subsequence based).
Used to measure overlap between expected and actual completions.
"""
tokens1 = text1.lower().split()
tokens2 = text2.lower().split()
if not tokens1 or not tokens2:
return 0.0
m, n = len(tokens1), len(tokens2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if tokens1[i-1] == tokens2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
lcs_len = dp[m][n]
precision = lcs_len / n if n > 0 else 0
recall = lcs_len / m if m > 0 else 0
if precision + recall == 0:
return 0.0
return 2 * precision * recall / (precision + recall)
def counterfactual_memorization_probe(
generated_text: str,
n_probes: int = 3,
suspicious_threshold: float = 0.25,
model: str = "claude-haiku-4-5-20251001",
) -> MemorizationCheckResult:
"""
Detect memorization via counterfactual probing.
Method: Extract the first sentence of the generated text.
Use the first 5 words as a prefix. Ask the model to complete it.
If completions consistently match the original text, it may be memorized.
Limitation: This is a proxy method. True memorization detection
requires access to the training data, which is not available for
closed-source models. Use this as a first-pass filter.
"""
text_hash = hashlib.md5(generated_text.encode()).hexdigest()[:8]
# Extract test sentence
sentences = [
s.strip()
for s in re.split(r'(?<=[.!?])\s+', generated_text)
if len(s.split()) >= 12
]
if not sentences:
return MemorizationCheckResult(
text_hash=text_hash,
is_likely_memorized=False,
memorization_score=0.0,
evidence="Text too short or fragmented for reliable probing",
recommendation="ok"
)
test_sentence = sentences[0]
words = test_sentence.split()
prefix = " ".join(words[:6])
expected_completion = " ".join(words[6:min(25, len(words))])
completions = []
rouge_scores = []
for probe_idx in range(n_probes):
try:
response = client.messages.create(
model=model,
max_tokens=80,
temperature=0.7, # Some temperature to avoid trivial exact matches from context
messages=[{
"role": "user",
"content": f"Complete this sentence naturally: {prefix}"
}]
)
completion = response.content[0].text.strip()
completions.append(completion)
score = compute_rouge_l(expected_completion, completion)
rouge_scores.append(score)
except Exception:
continue
if not rouge_scores:
return MemorizationCheckResult(
text_hash=text_hash,
is_likely_memorized=False,
memorization_score=0.0,
evidence="Probe failed",
recommendation="ok"
)
avg_score = sum(rouge_scores) / len(rouge_scores)
is_memorized = avg_score > suspicious_threshold
return MemorizationCheckResult(
text_hash=text_hash,
is_likely_memorized=is_memorized,
memorization_score=avg_score,
evidence=f"ROUGE-L: {avg_score:.3f} over {len(rouge_scores)} probes (threshold: {suspicious_threshold})",
recommendation="investigate" if is_memorized else "ok",
probe_completions=completions
)
def detect_pii_in_text(text: str) -> dict[str, list[str]]:
"""
Detect potentially identifiable information in generated text.
Use as a post-generation filter before adding to training set.
"""
patterns = {
"email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
"phone_us": r'\b(\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b',
"ssn": r'\b\d{3}[-\s]?\d{2}[-\s]?\d{4}\b',
"credit_card": r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',
"ip_address": r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b',
"date_of_birth": r'\b(?:born|dob|date of birth|birthday)[:\s]+\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b',
"medical_record": r'\b(?:mrn|patient id|medical record)[:\s]*\d{5,12}\b',
"passport": r'\b[A-Z]{1,2}\d{6,9}\b',
}
found: dict[str, list[str]] = {}
for pii_type, pattern in patterns.items():
matches = re.findall(pattern, text, re.IGNORECASE)
# Flatten tuple matches from groups
flat_matches = [m if isinstance(m, str) else m[0] for m in matches]
if flat_matches:
found[pii_type] = flat_matches
return found
def scrub_pii(text: str) -> str:
"""Replace detected PII with placeholder tokens."""
replacements = [
(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]'),
(r'\b(\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b', '[PHONE]'),
(r'\b\d{3}[-\s]?\d{2}[-\s]?\d{4}\b', '[SSN]'),
(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CREDIT_CARD]'),
]
for pattern, replacement in replacements:
text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
return text
def scan_dataset_for_pii(
examples: list[dict],
text_fields: list[str] = ("instruction", "output", "input"),
reject_on_pii: bool = True,
) -> tuple[list[dict], list[dict]]:
"""
Scan a dataset for PII and split into clean and flagged examples.
Returns:
(clean_examples, flagged_examples)
"""
clean = []
flagged = []
for example in examples:
combined_text = " ".join(
str(example.get(field, "")) for field in text_fields
)
pii_found = detect_pii_in_text(combined_text)
if pii_found:
example["_pii_detected"] = pii_found
flagged.append(example)
else:
clean.append(example)
print(f"PII scan: {len(clean)} clean, {len(flagged)} flagged")
if flagged:
pii_types = Counter()
for ex in flagged:
for pii_type in ex.get("_pii_detected", {}).keys():
pii_types[pii_type] += 1
print(f"PII types found: {dict(pii_types)}")
return clean, flagged
Copyright Risk Assessment and Terms-of-Service Compliance
The most immediate legal risk in synthetic data generation is using a teacher model whose Terms of Service prohibit the intended use. This is not a minor technicality - it is a contractual liability that can invalidate your entire synthetic data pipeline retroactively.
@dataclass
class TeacherModelRisk:
"""Risk assessment for using a specific model as a teacher."""
model_name: str
provider: str
training_data_prohibition: bool # Does ToS prohibit training competing models?
prohibition_text: str # Exact ToS language (keep on file)
review_date: str # Date ToS was reviewed
alternative_if_prohibited: str
# IMPORTANT: These are examples only. Always review current ToS yourself.
# ToS change frequently; the text below is illustrative, not authoritative.
TEACHER_MODEL_RISK_REGISTRY = {
"gpt-4o": TeacherModelRisk(
model_name="gpt-4o",
provider="OpenAI",
training_data_prohibition=True,
prohibition_text="You may not use output from the Services to develop models that compete with OpenAI (ToS Section 2(c)(iii) as of 2024).",
review_date="2024-01-01",
alternative_if_prohibited="Use Anthropic Claude (check current ToS), open-weight models like Llama 3, or Mistral."
),
"claude-opus-4-6": TeacherModelRisk(
model_name="claude-opus-4-6",
provider="Anthropic",
training_data_prohibition=False, # Check current Anthropic ToS
prohibition_text="Review current Anthropic usage policies at anthropic.com/legal/aup",
review_date="2024-01-01",
alternative_if_prohibited="N/A - check current ToS before relying on this"
),
"llama-3-70b": TeacherModelRisk(
model_name="llama-3-70b",
provider="Meta",
training_data_prohibition=False, # Open weights; Meta Llama license allows training
prohibition_text="Meta Llama 3 Community License permits fine-tuning and derivative works with attribution.",
review_date="2024-04-01",
alternative_if_prohibited="N/A - open license"
),
}
@dataclass
class CopyrightRiskProfile:
"""Copyright risk profile for a specific domain."""
domain: str
risk_level: str # low, medium, high, critical
specific_risks: list[str]
mitigations: list[str]
legal_counsel_required: bool
DOMAIN_COPYRIGHT_PROFILES = {
"general_knowledge": CopyrightRiskProfile(
domain="general_knowledge",
risk_level="low",
specific_risks=[
"Minor chance of reproducing widely published facts",
"Low risk of verbatim reproduction from popular books",
],
mitigations=["Standard memorization probe", "Avoid generating from specific book passages"],
legal_counsel_required=False
),
"code": CopyrightRiskProfile(
domain="code",
risk_level="medium",
specific_risks=[
"GitHub Copilot litigation established code can be copyrighted expression",
"Function-level reproduction of MIT/GPL/BSD code may infringe",
"License incompatibility: GPL code cannot appear in a dataset licensed as MIT",
],
mitigations=[
"Fuzzy match generated code against known copyrighted repositories",
"Add license compatibility filter (detect GPL markers in generated code)",
"Prefer generating from first principles rather than referencing specific libraries",
],
legal_counsel_required=False
),
"medical": CopyrightRiskProfile(
domain="medical",
risk_level="high",
specific_risks=[
"Clinical guidelines (UpToDate, AHA, USPSTF) are copyrighted",
"Drug information databases (Micromedex, Lexicomp) have strict licensing",
"Reproduction of FDA label text verbatim requires attribution",
"Clinical trial results in journals carry journal copyright",
],
mitigations=[
"Expert review of all generated clinical content before use",
"Attribution required for any specific clinical guidance generated",
"Source only from documents your organization has licensed for derivative use",
"Memorization detection with lower threshold (0.15 not 0.25) for clinical text",
],
legal_counsel_required=True
),
"legal": CopyrightRiskProfile(
domain="legal",
risk_level="high",
specific_risks=[
"Legal treatises (Westlaw, LexisNexis) are heavily copyrighted",
"Case law summaries may infringe editorial compilation copyright",
"Statutory text is public domain; annotations and headnotes are not",
],
mitigations=[
"Distinguish statutory text (public domain) from commentary (copyrighted)",
"Attribute all legal analysis to source if generated from specific documents",
"Avoid generating from commercial legal database content",
],
legal_counsel_required=True
),
"news": CopyrightRiskProfile(
domain="news",
risk_level="critical",
specific_risks=[
"NYT v. OpenAI established news as high-risk verbatim reproduction domain",
"Hot news doctrine protects recently published news content independently",
"Even paraphrased news content may infringe under some interpretations",
],
mitigations=[
"Avoid news content in synthetic training datasets unless licensed",
"If required: obtain explicit license from publishers",
"Time-delay: avoid generating from content published in past 12 months",
],
legal_counsel_required=True
),
}
def generate_risk_report(
teacher_model: str,
domain: str,
intended_use: str,
geography: list[str],
) -> dict:
"""Generate a risk assessment report for a synthetic data project."""
model_risk = TEACHER_MODEL_RISK_REGISTRY.get(teacher_model)
domain_risk = DOMAIN_COPYRIGHT_PROFILES.get(domain, DOMAIN_COPYRIGHT_PROFILES["general_knowledge"])
amplifiers = []
blockers = []
# Check model ToS
if model_risk and model_risk.training_data_prohibition:
blockers.append(
f"BLOCKER: {teacher_model} ToS prohibits using outputs to train competing models. "
f"Switch to: {model_risk.alternative_if_prohibited}"
)
# Commercial use increases exposure
if "commercial" in intended_use.lower() or "product" in intended_use.lower():
amplifiers.append("Commercial use increases copyright exposure vs. research use")
# EU AI Act
if any(g in geography for g in ["EU", "Europe", "Germany", "France"]):
amplifiers.append("EU AI Act requires training data provenance documentation")
return {
"teacher_model": teacher_model,
"domain": domain,
"copyright_risk_level": domain_risk.risk_level,
"model_tos_compliant": not (model_risk and model_risk.training_data_prohibition) if model_risk else "unknown",
"blockers": blockers,
"amplifiers": amplifiers,
"domain_risks": domain_risk.specific_risks,
"mitigations": domain_risk.mitigations,
"legal_counsel_required": domain_risk.legal_counsel_required or bool(blockers),
"recommendation": "STOP - address blockers first" if blockers else (
"Proceed with mitigations" if domain_risk.risk_level in ("low", "medium")
else "Consult legal counsel before proceeding"
)
}
Differential Privacy for Sensitive Source Documents
When your source documents contain sensitive personal data - medical records, financial transactions, private communications - simply prompting an LLM to generate synthetic versions does not provide formal privacy protection. Differential Privacy (DP) gives mathematical guarantees.
import numpy as np
class DifferentiallyPrivateGenerator:
"""
Generate synthetic training data with differential privacy guarantees.
Key insight: instead of generating from individual documents (which can
reproduce individual-level details), extract aggregate statistics with
DP noise and generate from those statistics. This breaks the direct
link between individuals and outputs.
When to use:
- Source documents contain PHI (protected health information)
- Source data contains financial records with individual identifiers
- Source data contains private communications
When NOT to use:
- Source documents are public (no individual privacy at stake)
- Documents are organizational policies (no personal data)
"""
def __init__(
self,
epsilon: float = 1.0,
delta: float = 1e-5,
):
"""
epsilon: Privacy budget.
0.1 = Very strong privacy (academic research standard)
1.0 = Standard production (good balance of privacy and utility)
10.0 = Weak privacy (use only for low-sensitivity data)
delta: Failure probability. Should be << 1/n_individuals.
For 10,000 individuals: delta should be << 1e-4.
"""
self.epsilon = epsilon
self.delta = delta
def add_laplace_noise(self, value: float, sensitivity: float) -> float:
"""Add Laplace noise calibrated to sensitivity/epsilon."""
scale = sensitivity / self.epsilon
return value + np.random.laplace(0, scale)
def private_count(self, true_count: int, sensitivity: float = 1.0) -> int:
"""Return a differentially private count."""
noisy = self.add_laplace_noise(float(true_count), sensitivity)
return max(0, int(round(noisy)))
def private_histogram(
self,
counts: dict[str, int],
sensitivity: float = 1.0,
) -> dict[str, int]:
"""
Apply DP noise to an entire histogram (dict of counts).
Any category with a noisy count ≤ 0 is suppressed.
"""
return {
category: noisy_count
for category, count in counts.items()
if (noisy_count := self.private_count(count, sensitivity)) > 0
}
def generate_from_private_statistics(
self,
documents: list[str],
task_description: str,
n_examples: int = 100,
) -> list[dict]:
"""
Generate synthetic examples from private aggregate statistics.
Pipeline:
1. Extract aggregate theme/topic frequencies from documents
2. Apply DP noise to the frequencies (private_histogram)
3. Sample topic distribution from noisy frequencies
4. Generate synthetic examples FROM TOPIC DESCRIPTIONS (not from documents)
This breaks the individual-document → output link.
The DP guarantee: including or excluding any single document changes
the output distribution by at most exp(epsilon).
"""
# Step 1: Extract aggregate topic frequencies (no DP yet)
topic_counts = self._extract_topics(documents)
print(f" Extracted {len(topic_counts)} topics from {len(documents)} documents")
# Step 2: Apply DP noise to frequencies
private_topics = self.private_histogram(topic_counts)
print(f" After DP noise (epsilon={self.epsilon}): {len(private_topics)} topics retained")
# Step 3: Sample topics weighted by private counts
total = sum(private_topics.values())
topic_probs = {t: c / total for t, c in private_topics.items()}
# Step 4: Generate synthetic examples from topic descriptions
examples = []
topics = list(topic_probs.keys())
probs = list(topic_probs.values())
for _ in range(n_examples):
sampled_topic = np.random.choice(topics, p=probs)
example = self._generate_example_for_topic(sampled_topic, task_description)
if example:
examples.append(example)
return examples
def _extract_topics(self, documents: list[str]) -> dict[str, int]:
"""
Extract topic frequencies from documents.
In production: use LDA, BERTopic, or embedding clustering.
This simplified version uses keyword frequency.
"""
from collections import Counter
STOP = {"the", "a", "an", "is", "are", "was", "were", "be", "been",
"have", "has", "had", "this", "that", "of", "in", "on",
"for", "to", "and", "or", "with", "by", "from", "at"}
topic_counts: Counter = Counter()
for doc in documents:
words = doc.lower().split()
content_words = [w for w in words if w not in STOP and len(w) > 4]
if content_words:
most_common_word = Counter(content_words).most_common(1)[0][0]
topic_counts[most_common_word] += 1
return dict(topic_counts)
def _generate_example_for_topic(
self,
topic: str,
task_description: str,
) -> Optional[dict]:
"""
Generate a synthetic example for a topic WITHOUT using original documents.
This is the privacy-preserving step: examples come from LLM knowledge
about the topic, not from the individual source documents.
"""
prompt = f"""Generate a synthetic training example for the following task: {task_description}
The example should be related to the topic: "{topic}"
Requirements:
- Clearly fictional - not based on any specific real person or real event
- Realistic and domain-appropriate
- Suitable as training data for a fine-tuned model
Output JSON:
{{"instruction": "...", "input": "...", "output": "..."}}"""
try:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=400,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text.strip()
json_match = re.search(r'\{.*\}', text, re.DOTALL)
if json_match:
return json.loads(json_match.group())
except Exception:
pass
return None
Bias Auditing for Synthetic Datasets
Synthetic data inherits biases from the teacher model and can amplify them. A model that overrepresents male pronouns for doctors or underrepresents non-Western naming conventions will produce a fine-tuned student model with the same biases, potentially stronger ones.
import math
@dataclass
class BiasAuditResult:
"""Result of a single bias dimension audit."""
dimension: str
observed_distribution: dict[str, float]
target_distribution: dict[str, float]
divergence_score: float # 0.0 = no bias, higher = more biased
flagged: bool
flagging_threshold: float
notes: str
def js_divergence(p: dict[str, float], q: dict[str, float]) -> float:
"""
Jensen-Shannon divergence between distributions p and q.
Symmetric, bounded [0, 1]. Higher = more different.
"""
all_keys = set(p.keys()) | set(q.keys())
p_vals = [p.get(k, 1e-10) for k in all_keys]
q_vals = [q.get(k, 1e-10) for k in all_keys]
p_sum = sum(p_vals)
q_sum = sum(q_vals)
p_norm = [v / p_sum for v in p_vals]
q_norm = [v / q_sum for v in q_vals]
m = [(a + b) / 2 for a, b in zip(p_norm, q_norm)]
def kl(dist_a, dist_b):
return sum(
ai * math.log(ai / bi + 1e-10)
for ai, bi in zip(dist_a, dist_b)
if ai > 1e-10
)
return (kl(p_norm, m) + kl(q_norm, m)) / 2
def audit_gender_representation(
examples: list[dict],
text_fields: tuple = ("instruction", "output"),
flag_threshold: float = 0.15, # JSD threshold for flagging
) -> BiasAuditResult:
"""
Check gender pronoun distribution in generated text.
Target: roughly balanced across gendered and neutral pronouns.
"""
he_count = 0
she_count = 0
they_count = 0
for ex in examples:
text = " ".join(str(ex.get(f, "")) for f in text_fields).lower().split()
he_count += text.count("he") + text.count("him") + text.count("his")
she_count += text.count("she") + text.count("her") + text.count("hers")
they_count += text.count("they") + text.count("them") + text.count("their")
total = he_count + she_count + they_count
if total == 0:
return BiasAuditResult(
dimension="gender_pronouns",
observed_distribution={"he": 0.0, "she": 0.0, "they": 0.0},
target_distribution={"he": 0.40, "she": 0.40, "they": 0.20},
divergence_score=0.0,
flagged=False,
flagging_threshold=flag_threshold,
notes="No gendered pronouns found - may indicate avoidance or topic domain"
)
observed = {
"he": he_count / total,
"she": she_count / total,
"they": they_count / total,
}
target = {"he": 0.40, "she": 0.40, "they": 0.20}
divergence = js_divergence(observed, target)
flagged = divergence > flag_threshold or observed["he"] > 0.70 or observed["she"] > 0.70
return BiasAuditResult(
dimension="gender_pronouns",
observed_distribution={k: round(v, 4) for k, v in observed.items()},
target_distribution=target,
divergence_score=round(divergence, 4),
flagged=flagged,
flagging_threshold=flag_threshold,
notes=(
f"Strong gender skew detected: he={observed['he']:.1%}, she={observed['she']:.1%}"
if flagged else "Distribution within acceptable range"
)
)
def audit_profession_gender_bias(
examples: list[dict],
professions: list[str] = None,
flag_threshold: float = 0.75,
) -> dict[str, dict]:
"""
Check if specific professions are systematically paired with one gender.
Example: "doctor" should not be predominantly "he"; "nurse" should not be predominantly "she".
"""
if professions is None:
professions = ["doctor", "nurse", "engineer", "teacher", "scientist",
"lawyer", "pilot", "ceo", "manager", "programmer"]
results = {}
for profession in professions:
he_count = 0
she_count = 0
n_mentions = 0
for ex in examples:
text = " ".join(str(ex.get(f, "")) for f in ("instruction", "output")).lower()
if profession not in text:
continue
n_mentions += 1
words = text.split()
he_count += words.count("he") + words.count("him") + words.count("his")
she_count += words.count("she") + words.count("her") + words.count("hers")
if n_mentions < 5: # Not enough data to audit
continue
total = he_count + she_count
if total == 0:
continue
he_ratio = he_count / total
she_ratio = she_count / total
results[profession] = {
"n_mentions": n_mentions,
"he_ratio": round(he_ratio, 3),
"she_ratio": round(she_ratio, 3),
"flagged": he_ratio > flag_threshold or she_ratio > flag_threshold,
"dominant_gender": "he" if he_ratio > she_ratio else "she",
}
return results
def audit_geographic_representation(
examples: list[dict],
text_fields: tuple = ("instruction", "output"),
) -> BiasAuditResult:
"""
Check whether examples are disproportionately Western/US-centric.
"""
us_signals = ["dollar", "usd", "$", "united states", "usa", "american",
"federal", "irs", "social security", "zip code", "state of"]
eu_signals = ["euro", "eur", "european", "gdpr", "vat", "uk", "germany",
"france", "parliament", "postal code"]
asia_signals = ["china", "india", "japan", "korea", "yuan", "yen", "rupee",
"asia", "chinese", "japanese", "korean", "hindi"]
other_signals = ["africa", "latin america", "brazil", "middle east",
"developing", "global south", "spanish", "portuguese"]
counts = {"us": 0, "eu": 0, "asia": 0, "other": 0}
for ex in examples:
text = " ".join(str(ex.get(f, "")) for f in text_fields).lower()
if any(s in text for s in us_signals):
counts["us"] += 1
if any(s in text for s in eu_signals):
counts["eu"] += 1
if any(s in text for s in asia_signals):
counts["asia"] += 1
if any(s in text for s in other_signals):
counts["other"] += 1
total = sum(counts.values())
if total == 0:
observed = {k: 0.0 for k in counts}
else:
observed = {k: v / total for k, v in counts.items()}
target = {"us": 0.35, "eu": 0.25, "asia": 0.25, "other": 0.15}
divergence = js_divergence(observed, target) if total > 0 else 0.0
return BiasAuditResult(
dimension="geographic_representation",
observed_distribution={k: round(v, 4) for k, v in observed.items()},
target_distribution=target,
divergence_score=round(divergence, 4),
flagged=divergence > 0.20 or observed.get("us", 0) > 0.70,
flagging_threshold=0.20,
notes=(
f"Dataset is US-centric ({observed.get('us', 0):.1%} US signals)"
if observed.get("us", 0) > 0.60 else "Geographic distribution within range"
)
)
def run_full_bias_audit(
examples: list[dict],
print_report: bool = True,
) -> dict[str, BiasAuditResult]:
"""Run the full bias audit suite on a synthetic dataset."""
results = {
"gender_pronouns": audit_gender_representation(examples),
"geographic": audit_geographic_representation(examples),
"profession_gender": None, # Separate structure
}
profession_results = audit_profession_gender_bias(examples)
if print_report:
print(f"\nBias Audit Report ({len(examples)} examples)")
print("=" * 50)
for dimension, result in results.items():
if result is None:
continue
status = "FLAGGED" if result.flagged else "OK"
print(f"\n{dimension}: {status} (JSD={result.divergence_score:.4f})")
print(f" Observed: {result.observed_distribution}")
print(f" Target: {result.target_distribution}")
if result.notes:
print(f" Notes: {result.notes}")
if profession_results:
print("\nProfession-Gender Bias:")
for profession, data in profession_results.items():
status = "FLAGGED" if data["flagged"] else "OK"
print(f" {profession:12s}: {status} | he={data['he_ratio']:.2f} she={data['she_ratio']:.2f} (n={data['n_mentions']})")
return results
The Ethical Decision Framework
Before generating any synthetic dataset, four questions determine whether the project is ethically defensible:
Model Card Documentation
Every model trained on synthetic data must disclose this in its model card. This is not optional documentation - it is the legal and ethical record of the decisions made.
def generate_data_card(
teacher_models: list[str],
generation_methods: list[str],
n_examples: int,
domain: str,
filtering_description: str,
pii_scan_performed: bool,
memorization_check_performed: bool,
dp_applied: bool,
dp_epsilon: Optional[float],
bias_audit_results: dict,
tos_review_date: str,
intended_use: str,
known_limitations: list[str],
) -> str:
"""
Generate the data provenance section of a model card.
This output should be included in any published model card.
"""
dp_text = (
f"Applied (epsilon={dp_epsilon}, delta=1e-5)"
if dp_applied and dp_epsilon
else "Not applied"
)
bias_summary = []
for dimension, result in bias_audit_results.items():
if hasattr(result, 'flagged'):
status = "FLAGGED - review needed" if result.flagged else "Within acceptable range"
bias_summary.append(f" - {dimension}: {status} (JSD={result.divergence_score:.4f})")
limitations_text = "\n".join(f" - {lim}" for lim in known_limitations)
bias_text = "\n".join(bias_summary) if bias_summary else " - Audit not performed"
card = f"""## Training Data Provenance
### Synthetic Data Usage
This model was fine-tuned using a synthetically generated dataset.
**Teacher model(s)**: {', '.join(teacher_models)}
**Generation method(s)**: {', '.join(generation_methods)}
**Domain**: {domain}
**Dataset size**: {n_examples:,} examples
**Intended use**: {intended_use}
### Quality and Filtering
{filtering_description}
### Copyright and Terms of Service
- ToS review date: {tos_review_date}
- ToS compliance verified: Yes (documentation on file)
- Known copyright concerns: Memorization detection applied; high-risk domains excluded
### Privacy Assessment
- PII detection scan: {'Performed - no PII found in final dataset' if pii_scan_performed else 'Not performed'}
- Memorization probe: {'Applied (ROUGE-L counterfactual probing)' if memorization_check_performed else 'Not applied'}
- Differential privacy: {dp_text}
### Bias Audit Results
{bias_text}
### Known Limitations
{limitations_text}
### Transparency Statement
Users of this model should be aware that it was fine-tuned on AI-generated synthetic data.
The synthetic data reflects the biases and limitations of the teacher model(s) listed above.
Independent evaluation on domain-specific benchmarks is recommended before deployment in
high-stakes settings.
"""
return card
Legal Compliance by Jurisdiction
| Regulation | Geography | Key Requirements for Synthetic Data | Effective |
|---|---|---|---|
| EU AI Act | EU/EEA | Training data documentation, provenance records, transparency for high-risk systems | Aug 2026 (GPAI) |
| GDPR | EU/EEA | Synthetic data derived from EU resident data needs legal basis; re-identification risk must be assessed | Since 2018 |
| CCPA/CPRA | California, USA | Disclosure requirements for data derived from CA residents; opt-out for data used in AI training | Since 2020/2023 |
| HIPAA | USA (healthcare) | PHI cannot be used without authorization even to generate "synthetic" data; expert determination required for de-identification | Ongoing |
| PIPL | China | Personal information cannot be used for AI training without separate consent; strict data localization | Since 2021 |
| Copyright (US) | USA | Fair use doctrine may protect some research use; commercial use less protected; NYT v. OpenAI sets new caution standard | Ongoing |
COMPLIANCE_REQUIREMENTS = {
"eu_ai_act": {
"applies_when": ["EU deployment", "EU development", "EU users"],
"requirements": [
"Document all training data sources with provenance chain",
"For high-risk systems: register in EU AI Act database before deployment",
"General-purpose AI: summarize training data policy, publish copyright policy",
"Implement data governance processes (logging, auditing)",
],
"penalty": "Up to 3% global annual revenue for violations",
},
"hipaa": {
"applies_when": ["Healthcare domain", "US covered entities or BAs", "PHI in source documents"],
"requirements": [
"PHI cannot be used for synthetic data generation without explicit authorization",
"Expert statistical determination required to claim de-identification",
"Even de-identified data that enables re-identification is still PHI",
"Business Associate Agreement required if using cloud AI provider to process PHI",
],
"penalty": "$100–$50,000 per violation; criminal penalties possible",
},
"copyright": {
"applies_when": ["All commercial use", "Any teacher model output"],
"requirements": [
"Document ToS of every teacher model at time of use (save a copy)",
"Run memorization detection for any content-heavy generation",
"Maintain complete audit trail of generation process",
"Apply fair use analysis for any copyrighted seed documents",
],
"penalty": "Statutory damages up to $150,000 per work for willful infringement",
},
}
def check_compliance_requirements(
domain: str,
geography: list[str],
uses_personal_data_as_source: bool,
is_commercial: bool,
) -> list[dict]:
"""Identify applicable compliance requirements."""
applicable = []
if any(g in geography for g in ["EU", "Europe"]):
applicable.append({
"regulation": "EU AI Act",
"details": COMPLIANCE_REQUIREMENTS["eu_ai_act"],
"urgency": "Required",
})
if domain in ["medical", "healthcare", "clinical"] or uses_personal_data_as_source:
applicable.append({
"regulation": "HIPAA",
"details": COMPLIANCE_REQUIREMENTS["hipaa"],
"urgency": "Required",
})
if is_commercial:
applicable.append({
"regulation": "Copyright",
"details": COMPLIANCE_REQUIREMENTS["copyright"],
"urgency": "Required",
})
return applicable
Common Pitfalls
:::danger Using Teacher Model Outputs Without Checking ToS The most common and most serious mistake in synthetic data pipelines: assuming "AI-generated means we own it and can do anything with it." OpenAI ToS Section 2(c)(iii) (as of 2024) explicitly prohibits using GPT outputs to train competing models. This is a contractual prohibition with real legal teeth. Before building any distillation pipeline, read the current ToS of every model you plan to use as a teacher. Save a copy of the ToS with the date of review - if there is ever a dispute, this evidence that you acted in good faith based on the policies at the time is your strongest defense. If your teacher model prohibits this use, switch to a model that does not (Anthropic Claude, open-weight models like Llama 3, Mistral). :::
:::danger "De-identified" Does Not Mean Privacy-Safe Removing names, dates, and explicit identifiers from source documents before generating synthetic data does not make the result privacy-safe. LLMs learn and can reproduce statistical associations between quasi-identifiers - sets of attributes that, combined, uniquely identify a real individual. A medical document about a 67-year-old male with Type 2 diabetes in a rural Montana county may not contain a name, but combined with other attributes it could uniquely identify a real patient. True privacy protection requires formal mechanisms: differential privacy with proper epsilon bounds, k-anonymity verified by an expert, or simply not using sensitive source documents at all. Informal de-identification is security theater. :::
:::warning Model Collapse from Iterative Synthetic Training Shumailov et al. (Nature, 2024) demonstrated that models trained on AI-generated data, which then generate data for the next model generation, exhibit "model collapse" - progressive loss of output diversity, concentration around modal responses, and degraded performance on tail distributions. If you are building iterative synthetic data pipelines (generate → train → use as teacher → generate again), monitor output diversity metrics at each iteration. Inject fresh human-written data at regular intervals to counteract distributional narrowing. Never train a model exclusively on its own outputs or the outputs of a model in the same generation lineage. :::
:::tip Document Everything for Legal Defense The best protection for a synthetic data project is documentation - comprehensive, timestamped, and stored somewhere that survives team turnover. What teacher model was used, what version, what the ToS said on the date of use, what memorization detection was run, what bias audits were performed, what quality filters were applied, who reviewed and approved the process. If a legal challenge ever arrives - copyright infringement claim, GDPR investigation, EU AI Act audit - documentation showing a good-faith, systematic effort to identify and mitigate violations is your strongest defense. Make documentation a first-class output of every synthetic data pipeline, not an afterthought that happens if someone remembers. :::
Interview Q&A
Q: What is the memorization problem in LLMs, and why does it make synthetic data legally risky?
Memorization in LLMs refers to the model's ability to reproduce verbatim or near-verbatim sequences from its training data when prompted. Carlini et al. (2023) showed that 1–3% of training sequences in GPT-4 class models can be extracted with targeted prompting - a process called "training data extraction." When you use a frontier model as a teacher to generate synthetic training data, some of that generated output will be near-verbatim reproduction of the model's training data. If the training data included copyrighted text - books, articles, code repositories - your synthetic dataset is contaminated with infringing content. This is not theoretical: the NYT v. OpenAI lawsuit includes exhibits showing GPT-4 reproducing NYT articles nearly verbatim under certain prompting conditions. The legal risk is real: copyright infringement even in a training dataset you do not distribute publicly. Detection: use counterfactual probing - give the model a sentence prefix from the generated text and check if completions consistently match the original, which signals memorization. Filter examples with high ROUGE-L overlap between probe completions and original generated text.
Q: What is differential privacy and when should you apply it to synthetic data generation?
Differential Privacy (DP) is a mathematical privacy framework that provides a formal guarantee: including or excluding any single individual's record changes the probability of any output by at most a multiplicative factor of exp(epsilon). In practice for synthetic data: instead of generating examples directly from sensitive source documents (which can reproduce individual-level details), extract aggregate statistics from those documents (topic frequencies, attribute distributions), apply calibrated Laplace noise to those statistics (the DP mechanism), and generate synthetic examples from the noisy aggregate statistics - not from individual records. This breaks the direct link between individuals and outputs. You should apply DP when your source documents contain PHI (protected health information), financial records with individual identifiers, or private communications. The tradeoff is utility: lower epsilon means more noise means more privacy but lower utility. For healthcare applications, epsilon = 1.0 is a common production balance; epsilon = 0.1 is research-grade strong privacy. Do not apply DP when your source documents are already public - there is no individual privacy to protect, and the utility cost is unwarranted.
Q: How would you conduct a bias audit on a synthetic dataset, and what would you do if you found bias?
A bias audit has three parts: identify dimensions to audit, measure the distribution for each, and compare against a target distribution. Key dimensions: (1) Gender pronoun balance - measure he/she/they pronoun ratios across all generated text; flag if any gender exceeds 65% of pronoun usage. (2) Profession-gender correlation - check if "doctor" examples predominantly use male pronouns, "nurse" female, etc.; flag individual professions where one gender exceeds 75%. (3) Geographic representation - check for cultural signals that indicate US/European bias; flag if US-centric signals appear in more than 60% of examples. For each dimension, compute Jensen-Shannon divergence (JSD) between the observed and target distribution - JSD > 0.15 generally warrants attention. When you find bias, two responses: resampling (generate more examples from underrepresented groups by explicitly specifying demographic attributes in the generation prompt), or prompt-level correction (add diversity instructions to the system prompt, e.g., "Use a roughly equal mix of he, she, and they pronouns"). Document audit results and remediation steps in the model card.
Q: What are the key legal differences between scraping real data, collecting user data, and generating synthetic data for AI training? Which approach carries the least legal risk?
Scraping real data: legal exposure through the scraped site's ToS (breach of contract), copyright law if you copy protected expression verbatim, and data protection law (GDPR, CCPA) if the scraped content includes personal information. The hiQ v. LinkedIn and other cases established that scraping publicly accessible data for analysis may be permissible under CFAA, but copyright still applies to the expression scraped. Collecting user data: requires consent mechanisms, subject to GDPR/CCPA depending on geography and user location, creates ongoing data subject rights obligations (right to erasure, portability), and often requires regulatory disclosures. Synthetic data: initially seemed like the lowest-risk option, but now has its own distinct risk profile - ToS of teacher models may prohibit using outputs for competing model training (OpenAI explicitly does this), copyright protections for AI-generated content are unresolved in most jurisdictions, memorization means synthetic data may reproduce copyrighted or private training material, and EU AI Act requires training data documentation regardless of whether data is synthetic. Bottom line: no approach is risk-free. Synthetic data reduces privacy risk substantially (no real users) but does not eliminate copyright or ToS risk. For most commercial applications, synthetic data from permissive teacher models (open-weight or Anthropic with appropriate ToS) on non-sensitive domains carries the lowest combined legal risk.
Q: How would you design an enterprise-grade synthetic data governance process?
Five pillars. First, pre-generation review: before any generation run, a documented checklist covering teacher model ToS verification, copyright risk assessment for the domain, privacy risk assessment for source documents, and ethics review sign-off for high-risk domains. All sign-offs are stored with timestamps and the reviewed ToS text. Second, automated compliance gates: every generated dataset runs through PII detection, memorization probing (for high-risk domains), and bias auditing before it is approved for training use. These are hard gates - data that fails does not enter the training pipeline. Third, provenance tracking: every training example has metadata: teacher model, model version, generation date, generation run ID, prompt template version, filter pipeline version. The metadata is stored alongside the example and queryable. If a legal issue arises, you can trace any example back to its complete production history. Fourth, model card disclosure: every model fine-tuned on synthetic data must include a data provenance section in its model card covering teacher models, generation methods, dataset size, quality processes, privacy measures taken, bias audit results, and known limitations. This is a release requirement, not optional. Fifth, ToS and legal review cadence: quarterly review of all teacher model ToS for changes (ToS change without notice is common), a standing process to retire datasets that become legally problematic due to ToS changes or new litigation outcomes, and annual review of applicable regulations. The governance process should be lightweight enough to not block engineering velocity but rigorous enough to provide a defensible audit trail.
