EDA Best Practices - The Foundation That Makes or Breaks Your Project
Reading time: ~30 min | Interview relevance: High | Roles: MLE, Data Scientist, Applied Scientist, AI Engineer
The Real Interview Moment
You are fifteen minutes into reviewing a take-home submission. The candidate built an XGBoost model with a respectable 0.86 AUC. But something feels wrong. You scroll back to their EDA section: df.describe(), a single histogram, and a correlation heatmap. No narrative. No insights. No evidence that the candidate understood the data before modeling it.
You check the dataset. The target variable has a 95/5 class split. The candidate never mentioned this. They used accuracy as their metric (96% - which is the majority-class baseline). Two features are 99% correlated - effectively duplicates - and both are in the model. A date column was treated as a numeric feature, silently converted to a Unix timestamp by pandas. There are 12% missing values in a critical feature, handled by default (dropped silently).
Every one of these issues would have been caught by a competent EDA. Instead, the candidate rushed to modeling, built on a foundation of sand, and produced results that cannot be trusted. This is the most common failure pattern in take-home projects.
EDA is not a box to check. It is the phase that determines whether everything that follows is sound or meaningless.
What You Will Master
- A systematic framework for exploratory data analysis
- What to explore: distributions, missing data, correlations, outliers, target leakage
- Visualization best practices that communicate rather than decorate
- Statistical tests that add rigor to your observations
- How to document EDA findings that evaluators actually value
- Common EDA mistakes and how to avoid them
- How much time to allocate to EDA
Self-Assessment: Where Are You Now?
| Level | Description | Target |
|---|---|---|
| Beginner | "My EDA is df.describe() and a few histograms" | Read all parts carefully |
| Intermediate | "I do thorough EDA but my insights are not actionable" | Focus on Parts 3-4 (documentation and decision linking) |
| Advanced | "I do good EDA but want to optimize time spent" | Jump to Part 5 (time allocation) and Part 6 (mistakes) |
Part 1 - The Systematic EDA Framework
The DICES Framework
Use this framework to ensure your EDA is comprehensive without being unfocused:
| Step | Focus | Key Questions |
|---|---|---|
| D - Data Profile | Shape, types, quality | How big? What types? What is missing? |
| I - Investigate Target | Target distribution and relationships | Balanced? How does each feature relate to target? |
| C - Correlations and Relationships | Feature interactions | Multicollinearity? Non-linear relationships? |
| E - Edge Cases and Outliers | Anomalies and boundaries | Extreme values? Impossible values? Rare categories? |
| S - Summarize and Strategize | Documented findings | What matters for modeling? What needs handling? |
Part 2 - What to Explore (With Code)
Step D: Data Profiling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
df = pd.read_csv("data/dataset.csv")
# Comprehensive data profile
def full_data_profile(df: pd.DataFrame) -> None:
"""Generate a complete data profile report."""
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Rows: {df.shape[0]:,}")
print(f"Columns: {df.shape[1]}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"Duplicate rows: {df.duplicated().sum():,} ({df.duplicated().mean():.1%})")
print("\n" + "=" * 60)
print("COLUMN TYPES")
print("=" * 60)
for dtype, count in df.dtypes.value_counts().items():
cols = df.select_dtypes(include=[dtype]).columns.tolist()
print(f" {dtype}: {count} columns - {cols[:5]}{'...' if len(cols) > 5 else ''}")
print("\n" + "=" * 60)
print("MISSING VALUES")
print("=" * 60)
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
if len(missing) == 0:
print(" No missing values.")
else:
for col, count in missing.items():
pct = count / len(df) * 100
print(f" {col}: {count:,} ({pct:.1f}%)")
print("\n" + "=" * 60)
print("NUMERIC SUMMARY")
print("=" * 60)
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
print(f"\n {col}:")
print(f" Range: [{df[col].min():.2f}, {df[col].max():.2f}]")
print(f" Mean: {df[col].mean():.2f}, Median: {df[col].median():.2f}")
print(f" Std: {df[col].std():.2f}")
print(f" Skewness: {df[col].skew():.2f}")
print("\n" + "=" * 60)
print("CATEGORICAL SUMMARY")
print("=" * 60)
cat_cols = df.select_dtypes(include=["object", "category"]).columns
for col in cat_cols:
unique = df[col].nunique()
top = df[col].value_counts().head(3)
print(f"\n {col}: {unique} unique values")
for val, count in top.items():
print(f" '{val}': {count:,} ({count/len(df):.1%})")
full_data_profile(df)
"Good EDA starts with data profiling - understanding the shape, types, quality, and basic statistics of your data before any visualization. This 2-minute step prevents 2 hours of debugging later. The most important things to check immediately: missing value patterns (are they random or systematic?), data types (is that numeric column actually an encoded categorical?), and the target distribution (is the problem balanced?)."
Step I: Investigate Target
TARGET = "churn" # Replace with actual target
def investigate_target(df: pd.DataFrame, target: str) -> None:
"""Analyze the target variable and its relationship with features."""
print("=" * 60)
print("TARGET ANALYSIS")
print("=" * 60)
# Distribution
dist = df[target].value_counts()
dist_pct = df[target].value_counts(normalize=True)
for val in dist.index:
print(f" Class {val}: {dist[val]:,} ({dist_pct[val]:.1%})")
imbalance = dist.min() / dist.max()
print(f"\n Imbalance ratio: {imbalance:.3f}")
if imbalance < 0.3:
print(" WARNING: Significant class imbalance detected.")
print(" Recommendation: Use precision-recall metrics, not accuracy.")
print(" Consider: stratified sampling, class weights, or resampling.")
# Feature-target relationships (numeric)
print("\n" + "=" * 60)
print("FEATURE-TARGET RELATIONSHIPS (Numeric)")
print("=" * 60)
numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(target, errors="ignore")
for col in numeric_cols:
group_means = df.groupby(target)[col].mean()
group_stds = df.groupby(target)[col].std()
diff = abs(group_means.iloc[0] - group_means.iloc[1]) / df[col].std()
print(f"\n {col}:")
for val in group_means.index:
print(f" Class {val}: mean={group_means[val]:.3f}, std={group_stds[val]:.3f}")
print(f" Standardized difference: {diff:.3f} {'(potential signal)' if diff > 0.3 else ''}")
investigate_target(df, TARGET)
def plot_feature_vs_target(df: pd.DataFrame, target: str, n_cols: int = 3) -> None:
"""Visualize each feature's relationship with the target."""
numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(target, errors="ignore")
n_features = len(numeric_cols)
n_rows = (n_features + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(6 * n_cols, 4 * n_rows))
axes = axes.flatten() if n_features > n_cols else [axes] if n_features == 1 else axes.flatten()
for i, col in enumerate(numeric_cols):
for label in sorted(df[target].unique()):
subset = df[df[target] == label][col].dropna()
axes[i].hist(subset, bins=30, alpha=0.5, label=f"Class {label}", density=True)
axes[i].set_title(f"{col} by {target}")
axes[i].legend()
for j in range(i + 1, len(axes)):
axes[j].set_visible(False)
plt.suptitle("Feature Distributions by Target Class", fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("results/figures/feature_vs_target.png", dpi=150, bbox_inches="tight")
plt.show()
plot_feature_vs_target(df, TARGET)
Step C: Correlations and Relationships
def analyze_correlations(df: pd.DataFrame, target: str, threshold: float = 0.8) -> None:
"""Analyze feature correlations and flag multicollinearity."""
numeric_cols = df.select_dtypes(include=[np.number]).columns
# Correlation matrix
corr = df[numeric_cols].corr()
# Heatmap
fig, ax = plt.subplots(figsize=(12, 10))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(
corr, mask=mask, annot=True, fmt=".2f", cmap="RdBu_r",
center=0, ax=ax, vmin=-1, vmax=1, square=True,
)
ax.set_title("Feature Correlation Matrix")
plt.tight_layout()
plt.savefig("results/figures/correlation_matrix.png", dpi=150, bbox_inches="tight")
plt.show()
# Flag highly correlated pairs
print("\nHighly correlated feature pairs (|r| > {:.1f}):".format(threshold))
high_corr_pairs = []
for i in range(len(corr.columns)):
for j in range(i + 1, len(corr.columns)):
if abs(corr.iloc[i, j]) > threshold:
pair = (corr.columns[i], corr.columns[j], corr.iloc[i, j])
high_corr_pairs.append(pair)
print(f" {pair[0]} <-> {pair[1]}: r = {pair[2]:.3f}")
if not high_corr_pairs:
print(" None found.")
else:
print(f"\n ACTION: Consider removing one feature from each pair to reduce multicollinearity.")
# Target correlation ranking
if target in numeric_cols:
target_corr = corr[target].drop(target).abs().sort_values(ascending=False)
print(f"\nFeature correlation with {target} (absolute):")
for feat, val in target_corr.items():
bar = "█" * int(val * 20)
print(f" {feat:30s} {val:.3f} {bar}")
analyze_correlations(df, TARGET)
Correlation heatmaps are the most overused and under-interpreted visualization in take-home projects. Simply generating a heatmap is not EDA. You must: (1) interpret what the correlations mean for your modeling strategy, (2) flag highly correlated pairs and decide which to keep, (3) note that correlation is only linear - features with low correlation may still have non-linear relationships with the target. If you include a heatmap, add a markdown cell below it explaining what you learned from it.
Step E: Edge Cases and Outliers
from scipy import stats
def detect_outliers(df: pd.DataFrame) -> pd.DataFrame:
"""Detect outliers using multiple methods and report findings.
Methods:
- IQR: Traditional statistical outlier detection
- Z-score: Standard deviation based
- Domain: Impossible or implausible values
"""
numeric_cols = df.select_dtypes(include=[np.number]).columns
outlier_report = []
for col in numeric_cols:
series = df[col].dropna()
# IQR method
q1, q3 = series.quantile(0.25), series.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
iqr_outliers = ((series < lower_bound) | (series > upper_bound)).sum()
# Z-score method
z_scores = np.abs(stats.zscore(series))
z_outliers = (z_scores > 3).sum()
outlier_report.append({
"feature": col,
"iqr_outliers": iqr_outliers,
"iqr_outlier_pct": iqr_outliers / len(series) * 100,
"z_outliers": z_outliers,
"z_outlier_pct": z_outliers / len(series) * 100,
"min": series.min(),
"max": series.max(),
"lower_bound": lower_bound,
"upper_bound": upper_bound,
})
report = pd.DataFrame(outlier_report).sort_values("iqr_outlier_pct", ascending=False)
print("OUTLIER REPORT")
print("=" * 80)
for _, row in report.iterrows():
if row["iqr_outlier_pct"] > 1:
print(f"\n {row['feature']}:")
print(f" IQR outliers: {row['iqr_outliers']:.0f} ({row['iqr_outlier_pct']:.1f}%)")
print(f" Z-score outliers: {row['z_outliers']:.0f} ({row['z_outlier_pct']:.1f}%)")
print(f" Range: [{row['min']:.2f}, {row['max']:.2f}]")
print(f" Expected range: [{row['lower_bound']:.2f}, {row['upper_bound']:.2f}]")
return report
outlier_report = detect_outliers(df)
def check_impossible_values(df: pd.DataFrame) -> None:
"""Check for domain-impossible values.
These checks should be customized based on the dataset.
Examples are provided as templates.
"""
checks = {
# "age": {"min": 0, "max": 120, "desc": "Age cannot be negative or > 120"},
# "price": {"min": 0, "max": None, "desc": "Price cannot be negative"},
# "percentage": {"min": 0, "max": 100, "desc": "Percentage must be 0-100"},
}
print("DOMAIN VALIDATION CHECKS")
print("=" * 60)
for col, rules in checks.items():
if col not in df.columns:
continue
violations = pd.Series(False, index=df.index)
if rules.get("min") is not None:
violations |= df[col] < rules["min"]
if rules.get("max") is not None:
violations |= df[col] > rules["max"]
n_violations = violations.sum()
if n_violations > 0:
print(f" WARNING - {col}: {n_violations} violations ({rules['desc']})")
print(f" Violating values: {df.loc[violations, col].describe()}")
else:
print(f" OK - {col}: All values within expected range")
check_impossible_values(df)
Step S: Summarize and Strategize
This is the step most candidates skip - and the one evaluators value most.
"""
## EDA Summary and Strategy
### Key Findings
| Finding | Impact | Action |
|---------|--------|--------|
| Target is 95/5 imbalanced | Accuracy is misleading | Use PR-AUC and F1 as primary metrics |
| Feature X and Y are 0.97 correlated | Multicollinearity | Drop one (keeping X, higher target correlation) |
| 12% missing in feature Z | Cannot ignore | Impute with median; add binary indicator for missingness |
| Feature W has 3% negative values | Domain-impossible | These appear to be refunds; create binary flag |
| Date column has no clear trend | May not be useful as-is | Extract month, day-of-week, recency features instead |
### Feature Engineering Plan
Based on the EDA findings above, I will:
1. [Engineering decision 1 - linked to finding]
2. [Engineering decision 2 - linked to finding]
3. [Engineering decision 3 - linked to finding]
### Modeling Implications
- Primary metric: PR-AUC (due to class imbalance)
- Use stratified cross-validation
- Consider class weights in model training
- Feature Z missingness pattern may be informative - include as a feature
"""
"The EDA summary is the single most valuable section of your entire notebook. It demonstrates that you can extract actionable insights from data and translate them into modeling decisions. An evaluator who reads a clear EDA summary immediately trusts your subsequent modeling choices because they can see the reasoning chain. Without this summary, every modeling decision looks arbitrary."
Part 3 - Visualization Best Practices
The Three Rules of Take-Home Visualizations
-
Every plot must have a title that states the finding, not just the variable name. Wrong: "Feature Distribution". Right: "Monthly Charges Are Higher for Churning Customers (Median 62)".
-
Every plot must have labeled axes with units. Wrong: unlabeled axes. Right: "Monthly Charges ($)" and "Number of Customers".
-
Every plot must be followed by a markdown cell interpreting it. A plot without interpretation is a plot without purpose.
# BAD visualization
plt.hist(df["monthly_charges"])
plt.show()
# GOOD visualization
fig, ax = plt.subplots(figsize=(10, 5))
for label, color in [(0, "#2563eb"), (1, "#dc2626")]:
subset = df[df["churn"] == label]["monthly_charges"]
ax.hist(subset, bins=30, alpha=0.6, color=color,
label=f"{'Churned' if label == 1 else 'Retained'} (median=${subset.median():.0f})",
density=True, edgecolor="white")
ax.set_xlabel("Monthly Charges ($)", fontsize=12)
ax.set_ylabel("Density", fontsize=12)
ax.set_title("Churning Customers Pay More on Average\n(Median $78 vs $62, p < 0.001)",
fontsize=13, fontweight="bold")
ax.legend(fontsize=11)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.tight_layout()
plt.savefig("results/figures/charges_by_churn.png", dpi=150, bbox_inches="tight")
plt.show()
Visualization Selection Guide
| What You Want to Show | Best Plot Type | When to Use |
|---|---|---|
| Distribution of one numeric variable | Histogram or KDE | Understanding feature spread |
| Distribution by category | Box plot or violin plot | Comparing groups |
| Relationship between two numeric variables | Scatter plot | Checking linearity, clusters |
| Feature vs target (numeric) | Overlapping histograms or box plots | Feature importance exploration |
| Feature vs target (categorical) | Grouped bar chart | Category-target relationships |
| Correlation between many features | Heatmap (with interpretation!) | Multicollinearity detection |
| Missing value patterns | msno.matrix or custom heatmap | Understanding missing data structure |
| Time trends | Line plot | Time series EDA |
| Categorical proportions | Stacked bar chart | Understanding compositions |
| Outlier context | Box plot with jittered points | Outlier investigation |
Do not include plots generated by automated profiling tools (like pandas-profiling) as a substitute for EDA. Evaluators have seen this trick thousands of times. Auto-generated reports contain 50 plots with zero interpretation. They signal that you cannot distinguish important patterns from noise. Use profiling tools for your own understanding, but present curated, interpreted visualizations in your submission.
Creating a Publication-Quality Style
def set_plot_style():
"""Set a consistent, clean plotting style for the entire notebook."""
plt.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 100,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.labelsize": 12,
"axes.titlesize": 13,
"font.size": 11,
"legend.fontsize": 10,
"xtick.labelsize": 10,
"ytick.labelsize": 10,
})
sns.set_palette("husl")
set_plot_style()
Part 4 \text{---} Statistical Tests
When to Use Statistical Tests in EDA
Statistical tests add rigor to your visual observations. They transform "it looks like churning customers have higher charges" into "churning customers have significantly higher monthly charges (t-test, p < 0.001, Cohen's d = 0.42)."
| Test | When to Use | Python Implementation |
|---|---|---|
| t-test | Compare means of two groups (numeric feature by binary target) | scipy.stats.ttest_ind |
| Mann-Whitney U | Compare distributions when normality is not guaranteed | scipy.stats.mannwhitneyu |
| Chi-squared | Test association between two categorical variables | scipy.stats.chi2_contingency |
| ANOVA / Kruskal-Wallis | Compare means across 3+ groups | scipy.stats.f_oneway / kruskal |
| Kolmogorov-Smirnov | Compare two distributions | scipy.stats.ks_2samp |
| Pearson/Spearman correlation | Test linear/monotonic relationship significance | scipy.stats.pearsonr / spearmanr |
from scipy import stats
def statistical_feature_analysis(
df: pd.DataFrame,
target: str,
numeric_cols: list[str] = None,
categorical_cols: list[str] = None,
) -> pd.DataFrame:
"""Run statistical tests for feature-target relationships.
For numeric features: Mann-Whitney U test (robust to non-normality)
For categorical features: Chi-squared test of independence
"""
results = []
if numeric_cols is None:
numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(target, errors="ignore")
if categorical_cols is None:
categorical_cols = df.select_dtypes(include=["object", "category"]).columns
# Numeric features
for col in numeric_cols:
group_0 = df[df[target] == 0][col].dropna()
group_1 = df[df[target] == 1][col].dropna()
stat, p_value = stats.mannwhitneyu(group_0, group_1, alternative="two-sided")
# Effect size (rank-biserial correlation)
n0, n1 = len(group_0), len(group_1)
effect_size = 1 - (2 * stat) / (n0 * n1)
results.append({
"feature": col,
"type": "numeric",
"test": "Mann-Whitney U",
"statistic": stat,
"p_value": p_value,
"effect_size": abs(effect_size),
"significant": p_value < 0.05,
})
# Categorical features
for col in categorical_cols:
contingency = pd.crosstab(df[col], df[target])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
# Effect size (Cramer's V)
n = contingency.sum().sum()
min_dim = min(contingency.shape) - 1
cramers_v = np.sqrt(chi2 / (n * min_dim)) if min_dim > 0 else 0
results.append({
"feature": col,
"type": "categorical",
"test": "Chi-squared",
"statistic": chi2,
"p_value": p_value,
"effect_size": cramers_v,
"significant": p_value < 0.05,
})
results_df = pd.DataFrame(results).sort_values("p_value")
print("STATISTICAL FEATURE ANALYSIS")
print("=" * 80)
for _, row in results_df.iterrows():
sig = "***" if row["p_value"] < 0.001 else "**" if row["p_value"] < 0.01 else "*" if row["p_value"] < 0.05 else "ns"
print(f" {row['feature']:30s} p={row['p_value']:.4f} {sig} effect={row['effect_size']:.3f} ({row['test']})")
return results_df
stat_results = statistical_feature_analysis(df, TARGET)
Do not p-hack your way through EDA. Running 50 statistical tests and reporting only the significant ones is bad science and evaluators will notice. If you run multiple tests, mention the multiple comparisons issue. Also, statistical significance is not the same as practical significance \text{---} a p-value of 0.001 on a feature with a tiny effect size (d = 0.02) means the difference is reliably detected but practically meaningless for prediction.
Part 5 \text{---} Documenting EDA Findings
The Documentation Pattern
Every EDA finding should be documented with this structure:
### Finding: [Descriptive title]
**Observation:** [What you see in the data]
**Evidence:** [Plot reference, statistical test result, or summary statistic]
**Implication:** [What this means for the modeling approach]
**Action:** [What you will do about it]
Example Documented Findings
### Finding 1: Severe class imbalance in churn target
**Observation:** Only 5.2\% of customers in the dataset have churned.
**Evidence:** Target distribution shows 47,400 retained vs. 2,600 churned customers.
This means a model that always predicts "no churn" achieves 94.8\% accuracy.
**Implication:** Accuracy is an inappropriate metric for this problem. The model
must be evaluated on its ability to identify the minority class (churners).
**Action:**
- Use precision-recall AUC and F1-score as primary metrics
- Apply stratified cross-validation
- Consider class weights in tree-based models
- Report per-class precision and recall, not just overall metrics
---
### Finding 2: Monthly charges strongly differentiate churners
**Observation:** Customers who churn have significantly higher monthly charges
(median $78 vs $62 for retained customers).
**Evidence:** Mann-Whitney U test p < 0.001, effect size (rank-biserial) = 0.34.
See Figure 2 for the distribution comparison.
**Implication:** Monthly charges is likely an important predictive feature.
The relationship appears approximately linear, so it should work well in
both linear and tree-based models without transformation.
**Action:** Include monthly_charges as a feature. Also create a derived feature
"charge_per_service" to capture whether high charges are proportional to
services used.
---
### Finding 3: Missing values in total_charges are systematic
**Observation:** 11 records have missing total_charges, and ALL of them have
tenure = 0 (new customers with no billing history).
**Evidence:** Cross-tabulation of missingness with tenure confirms 100%
overlap with tenure = 0.
**Implication:** This is not random missing data - it is structurally missing
because new customers have no charge history. Imputing with the mean would be
inappropriate.
**Action:** Impute with 0 (they have been charged nothing yet) and create a
binary feature "is_new_customer" to capture this distinct segment.
Part 6 - Common EDA Mistakes
Mistake 1: Skipping EDA Entirely
The most common mistake. Candidates jump straight to model.fit() because they think model accuracy is all that matters.
Why it fails: Evaluators interpret missing EDA as either inexperience or laziness. More practically, skipping EDA leads to poor feature engineering, wrong metric choices, and undetected data quality issues that corrupt your results.
Fix: Allocate 15-20% of your total time to EDA. Even under severe time pressure, spend at least 30 minutes on data profiling and target analysis.
Mistake 2: EDA Without Narrative
Running df.describe(), showing 20 plots, and moving on. No interpretation, no findings, no connection to modeling decisions.
Why it fails: Evaluators cannot tell if you understood the data or just ran boilerplate code. Plots without narrative are decoration, not analysis.
Fix: After every visualization, add a markdown cell: "This shows that [observation]. This means [implication]. Therefore, I will [action]."
Mistake 3: Exhaustive Unfocused EDA
Analyzing every possible feature combination, creating 40 plots, running every statistical test. The EDA section is longer than the modeling section.
Why it fails: It suggests you cannot prioritize. Evaluators do not want to scroll through 40 plots to find the 3 that matter.
Fix: Focus on task-relevant exploration. Ask: "Does this analysis help me build a better model or make a better decision?" If not, skip it or put it in an appendix.
Mistake 4: Not Checking for Target Leakage
Failing to investigate whether any feature is suspiciously correlated with the target because it was derived from the target.
Why it fails: Target leakage inflates your metrics and produces a model that will fail in production. It is the number one technical failure mode in take-home projects.
Fix: For every feature that correlates strongly with the target (r > 0.8), ask: "Would this feature be available at prediction time?" If a feature perfectly separates the classes, it is almost certainly a leakage variable.
def check_for_leakage(df: pd.DataFrame, target: str) -> None:
"""Flag potential target leakage in features."""
numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(target, errors="ignore")
correlations = df[numeric_cols].corrwith(df[target]).abs().sort_values(ascending=False)
print("LEAKAGE CHECK - Features Most Correlated with Target")
print("=" * 60)
for feat, corr in correlations.head(10).items():
flag = " <-- INVESTIGATE" if corr > 0.8 else " <-- SUSPICIOUS" if corr > 0.6 else ""
print(f" {feat:30s} r = {corr:.4f}{flag}")
if correlations.max() > 0.9:
print("\n WARNING: Feature(s) with r > 0.9 detected.")
print(" Ask: Would this feature be available at prediction time?")
print(" If it is derived from the target, it MUST be removed.")
check_for_leakage(df, TARGET)
Mistake 5: Ignoring Missing Data Patterns
Treating all missing data as random and applying blanket imputation.
Why it fails: Missing data is often informative. A customer with no "total_charges" is probably a new customer - that information is lost if you impute with the mean.
Fix: Always investigate why data is missing before deciding how to handle it.
def analyze_missing_patterns(df: pd.DataFrame) -> None:
"""Analyze patterns in missing data."""
missing = df.isnull()
missing_cols = missing.sum()[missing.sum() > 0].sort_values(ascending=False)
if len(missing_cols) == 0:
print("No missing values detected.")
return
print("MISSING DATA PATTERN ANALYSIS")
print("=" * 60)
for col in missing_cols.index:
pct = missing_cols[col] / len(df) * 100
print(f"\n {col}: {missing_cols[col]:,} missing ({pct:.1f}%)")
# Check if missingness correlates with other features
missing_indicator = df[col].isnull().astype(int)
numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(col, errors="ignore")
for other_col in numeric_cols[:5]:
missing_mean = df.loc[missing_indicator == 1, other_col].mean()
present_mean = df.loc[missing_indicator == 0, other_col].mean()
if abs(missing_mean - present_mean) > df[other_col].std() * 0.5:
print(f" Correlated with {other_col}: missing mean={missing_mean:.2f}, present mean={present_mean:.2f}")
analyze_missing_patterns(df)
Mistake 6: Not Visualizing the Right Things
Part 7 - Time Allocation for EDA
How Much Time Should You Spend on EDA?
| Total Project Time | EDA Time | What to Cover |
|---|---|---|
| 4 hours | 30-45 min | Data profile, target analysis, top 3 feature relationships |
| 8 hours | 1-1.5 hours | Full DICES framework, key visualizations, documented findings |
| Weekend | 2-3 hours | Comprehensive EDA, statistical tests, detailed narrative |
| 1 week | 3-5 hours | Publication-quality EDA, all findings documented and linked to modeling |
Some companies (especially in consulting and analytics) weight EDA more heavily than modeling. If the prompt says "What insights can you find in this data?" or "Tell us a story with this data," EDA is the deliverable, not a precursor to modeling. In these cases, allocate 40-50% of your time to EDA.
The Time-Boxed EDA Approach
When time is tight, use this priority order:
- 5 minutes: Data profile (shape, types, missing values)
- 5 minutes: Target distribution and imbalance check
- 10 minutes: Top features vs. target visualization
- 5 minutes: Correlation matrix and leakage check
- 5 minutes: Document the three most important findings
Total: 30 minutes. This is the absolute minimum EDA that will keep your submission credible.
Practice Exercises
Exercise 1: EDA From Scratch
Download any dataset from the UCI ML Repository or Kaggle. Perform a complete EDA using the DICES framework in under 90 minutes. Document at least 5 findings with the observation-evidence-implication-action format.
Exercise 2: Spot the Issues
Given these EDA "findings," identify what is wrong with each:
- "The correlation heatmap shows feature correlations." (What is wrong?)
- "I removed all outliers using the IQR method." (What is wrong?)
- "All features are significant (p < 0.05)." (What is wrong?)
- "Feature X has a 0.95 correlation with the target, which is great for prediction." (What is wrong?)
Answers
- This is a description of the plot, not a finding. A finding would be: "Features X and Y are 0.93 correlated, indicating redundancy - I will remove Y."
- Blanket outlier removal without domain justification. An "outlier" in monthly charges might be a legitimate high-value customer. Always justify outlier handling with domain knowledge.
- With a large enough dataset, virtually everything is statistically significant. Report effect sizes alongside p-values to distinguish practically meaningful differences.
- A 0.95 correlation with the target is almost certainly data leakage. No real predictive feature correlates this strongly. Investigate whether this feature was derived from or is a proxy for the target.
Exercise 3: Time-Pressured EDA
Set a timer for 30 minutes. Perform an EDA on a new dataset and document your findings. Practice the time-boxed approach until you can produce credible EDA in 30 minutes consistently.
Interview Cheat Sheet
| Question | Key Points |
|---|---|
| "How do you approach EDA?" | DICES framework: Data profile, Investigate target, Correlations, Edge cases, Summarize |
| "What is the most important thing to check first?" | Target distribution - determines metric choice and handling strategy |
| "How do you handle missing data?" | Analyze WHY it is missing first; then impute or flag appropriately |
| "How do you detect target leakage?" | Check features with r > 0.8 with target; ask if they would be available at prediction time |
| "What makes a good EDA visualization?" | Title states the finding, axes labeled, followed by interpretation |
| "How much time should EDA take?" | 15-20% of total project time; minimum 30 minutes even under time pressure |
| "What is the difference between statistical and practical significance?" | Large datasets make everything significant; effect size measures practical importance |
| "How do you handle outliers?" | Detect with IQR/z-score, but handle with domain knowledge - not blanket removal |
Next Steps
With a solid EDA foundation, you are ready to make informed modeling decisions. The next chapter, Model Selection Strategy, covers how to choose the right model for your take-home - starting with baselines, comparing models fairly, tuning hyperparameters efficiently, and knowing when to stop iterating.
