EDA Best Practices - The Foundation That Makes or Breaks Your Project

Reading time: ~30 min | Interview relevance: High | Roles: MLE, Data Scientist, Applied Scientist, AI Engineer

The Real Interview Moment

You are fifteen minutes into reviewing a take-home submission. The candidate built an XGBoost model with a respectable 0.86 AUC. But something feels wrong. You scroll back to their EDA section: df.describe(), a single histogram, and a correlation heatmap. No narrative. No insights. No evidence that the candidate understood the data before modeling it.

You check the dataset. The target variable has a 95/5 class split. The candidate never mentioned this. They used accuracy as their metric (96% - which is the majority-class baseline). Two features are 99% correlated - effectively duplicates - and both are in the model. A date column was treated as a numeric feature, silently converted to a Unix timestamp by pandas. There are 12% missing values in a critical feature, handled by default (dropped silently).

Every one of these issues would have been caught by a competent EDA. Instead, the candidate rushed to modeling, built on a foundation of sand, and produced results that cannot be trusted. This is the most common failure pattern in take-home projects.

EDA is not a box to check. It is the phase that determines whether everything that follows is sound or meaningless.

What You Will Master

A systematic framework for exploratory data analysis
What to explore: distributions, missing data, correlations, outliers, target leakage
Visualization best practices that communicate rather than decorate
Statistical tests that add rigor to your observations
How to document EDA findings that evaluators actually value
Common EDA mistakes and how to avoid them
How much time to allocate to EDA

Self-Assessment: Where Are You Now?

Level	Description	Target
Beginner	"My EDA is `df.describe()` and a few histograms"	Read all parts carefully
Intermediate	"I do thorough EDA but my insights are not actionable"	Focus on Parts 3-4 (documentation and decision linking)
Advanced	"I do good EDA but want to optimize time spent"	Jump to Part 5 (time allocation) and Part 6 (mistakes)

Part 1 - The Systematic EDA Framework

The DICES Framework

Use this framework to ensure your EDA is comprehensive without being unfocused:

Step	Focus	Key Questions
D - Data Profile	Shape, types, quality	How big? What types? What is missing?
I - Investigate Target	Target distribution and relationships	Balanced? How does each feature relate to target?
C - Correlations and Relationships	Feature interactions	Multicollinearity? Non-linear relationships?
E - Edge Cases and Outliers	Anomalies and boundaries	Extreme values? Impossible values? Rare categories?
S - Summarize and Strategize	Documented findings	What matters for modeling? What needs handling?

EDA Framework: DICES - Data Profile, Investigate Target, Correlations, Edge Cases, Summarize

Part 2 - What to Explore (With Code)

Step D: Data Profiling

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

df = pd.read_csv("data/dataset.csv")

# Comprehensive data profile
def full_data_profile(df: pd.DataFrame) -> None:
    """Generate a complete data profile report."""

    print("=" * 60)
    print("DATASET OVERVIEW")
    print("=" * 60)
    print(f"Rows: {df.shape[0]:,}")
    print(f"Columns: {df.shape[1]}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
    print(f"Duplicate rows: {df.duplicated().sum():,} ({df.duplicated().mean():.1%})")

    print("\n" + "=" * 60)
    print("COLUMN TYPES")
    print("=" * 60)
    for dtype, count in df.dtypes.value_counts().items():
        cols = df.select_dtypes(include=[dtype]).columns.tolist()
        print(f"  {dtype}: {count} columns - {cols[:5]}{'...' if len(cols) > 5 else ''}")

    print("\n" + "=" * 60)
    print("MISSING VALUES")
    print("=" * 60)
    missing = df.isnull().sum()
    missing = missing[missing > 0].sort_values(ascending=False)
    if len(missing) == 0:
        print("  No missing values.")
    else:
        for col, count in missing.items():
            pct = count / len(df) * 100
            print(f"  {col}: {count:,} ({pct:.1f}%)")

    print("\n" + "=" * 60)
    print("NUMERIC SUMMARY")
    print("=" * 60)
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        print(f"\n  {col}:")
        print(f"    Range: [{df[col].min():.2f}, {df[col].max():.2f}]")
        print(f"    Mean: {df[col].mean():.2f}, Median: {df[col].median():.2f}")
        print(f"    Std: {df[col].std():.2f}")
        print(f"    Skewness: {df[col].skew():.2f}")

    print("\n" + "=" * 60)
    print("CATEGORICAL SUMMARY")
    print("=" * 60)
    cat_cols = df.select_dtypes(include=["object", "category"]).columns
    for col in cat_cols:
        unique = df[col].nunique()
        top = df[col].value_counts().head(3)
        print(f"\n  {col}: {unique} unique values")
        for val, count in top.items():
            print(f"    '{val}': {count:,} ({count/len(df):.1%})")

full_data_profile(df)

60-Second Answer

"Good EDA starts with data profiling - understanding the shape, types, quality, and basic statistics of your data before any visualization. This 2-minute step prevents 2 hours of debugging later. The most important things to check immediately: missing value patterns (are they random or systematic?), data types (is that numeric column actually an encoded categorical?), and the target distribution (is the problem balanced?)."

Step I: Investigate Target

TARGET = "churn"  # Replace with actual target

def investigate_target(df: pd.DataFrame, target: str) -> None:
    """Analyze the target variable and its relationship with features."""

    print("=" * 60)
    print("TARGET ANALYSIS")
    print("=" * 60)

    # Distribution
    dist = df[target].value_counts()
    dist_pct = df[target].value_counts(normalize=True)

    for val in dist.index:
        print(f"  Class {val}: {dist[val]:,} ({dist_pct[val]:.1%})")

    imbalance = dist.min() / dist.max()
    print(f"\n  Imbalance ratio: {imbalance:.3f}")
    if imbalance < 0.3:
        print("  WARNING: Significant class imbalance detected.")
        print("  Recommendation: Use precision-recall metrics, not accuracy.")
        print("  Consider: stratified sampling, class weights, or resampling.")

    # Feature-target relationships (numeric)
    print("\n" + "=" * 60)
    print("FEATURE-TARGET RELATIONSHIPS (Numeric)")
    print("=" * 60)

    numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(target, errors="ignore")

    for col in numeric_cols:
        group_means = df.groupby(target)[col].mean()
        group_stds = df.groupby(target)[col].std()
        diff = abs(group_means.iloc[0] - group_means.iloc[1]) / df[col].std()
        print(f"\n  {col}:")
        for val in group_means.index:
            print(f"    Class {val}: mean={group_means[val]:.3f}, std={group_stds[val]:.3f}")
        print(f"    Standardized difference: {diff:.3f} {'(potential signal)' if diff > 0.3 else ''}")

investigate_target(df, TARGET)

def plot_feature_vs_target(df: pd.DataFrame, target: str, n_cols: int = 3) -> None:
    """Visualize each feature's relationship with the target."""

    numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(target, errors="ignore")
    n_features = len(numeric_cols)
    n_rows = (n_features + n_cols - 1) // n_cols

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(6 * n_cols, 4 * n_rows))
    axes = axes.flatten() if n_features > n_cols else [axes] if n_features == 1 else axes.flatten()

    for i, col in enumerate(numeric_cols):
        for label in sorted(df[target].unique()):
            subset = df[df[target] == label][col].dropna()
            axes[i].hist(subset, bins=30, alpha=0.5, label=f"Class {label}", density=True)
        axes[i].set_title(f"{col} by {target}")
        axes[i].legend()

    for j in range(i + 1, len(axes)):
        axes[j].set_visible(False)

    plt.suptitle("Feature Distributions by Target Class", fontsize=14, y=1.02)
    plt.tight_layout()
    plt.savefig("results/figures/feature_vs_target.png", dpi=150, bbox_inches="tight")
    plt.show()

plot_feature_vs_target(df, TARGET)

Step C: Correlations and Relationships

def analyze_correlations(df: pd.DataFrame, target: str, threshold: float = 0.8) -> None:
    """Analyze feature correlations and flag multicollinearity."""

    numeric_cols = df.select_dtypes(include=[np.number]).columns

    # Correlation matrix
    corr = df[numeric_cols].corr()

    # Heatmap
    fig, ax = plt.subplots(figsize=(12, 10))
    mask = np.triu(np.ones_like(corr, dtype=bool))
    sns.heatmap(
        corr, mask=mask, annot=True, fmt=".2f", cmap="RdBu_r",
        center=0, ax=ax, vmin=-1, vmax=1, square=True,
    )
    ax.set_title("Feature Correlation Matrix")
    plt.tight_layout()
    plt.savefig("results/figures/correlation_matrix.png", dpi=150, bbox_inches="tight")
    plt.show()

    # Flag highly correlated pairs
    print("\nHighly correlated feature pairs (|r| > {:.1f}):".format(threshold))
    high_corr_pairs = []
    for i in range(len(corr.columns)):
        for j in range(i + 1, len(corr.columns)):
            if abs(corr.iloc[i, j]) > threshold:
                pair = (corr.columns[i], corr.columns[j], corr.iloc[i, j])
                high_corr_pairs.append(pair)
                print(f"  {pair[0]} <-> {pair[1]}: r = {pair[2]:.3f}")

    if not high_corr_pairs:
        print("  None found.")
    else:
        print(f"\n  ACTION: Consider removing one feature from each pair to reduce multicollinearity.")

    # Target correlation ranking
    if target in numeric_cols:
        target_corr = corr[target].drop(target).abs().sort_values(ascending=False)
        print(f"\nFeature correlation with {target} (absolute):")
        for feat, val in target_corr.items():
            bar = "█" * int(val * 20)
            print(f"  {feat:30s} {val:.3f} {bar}")

analyze_correlations(df, TARGET)

Common Trap

Correlation heatmaps are the most overused and under-interpreted visualization in take-home projects. Simply generating a heatmap is not EDA. You must: (1) interpret what the correlations mean for your modeling strategy, (2) flag highly correlated pairs and decide which to keep, (3) note that correlation is only linear - features with low correlation may still have non-linear relationships with the target. If you include a heatmap, add a markdown cell below it explaining what you learned from it.

Step E: Edge Cases and Outliers

from scipy import stats

def detect_outliers(df: pd.DataFrame) -> pd.DataFrame:
    """Detect outliers using multiple methods and report findings.

    Methods:
    - IQR: Traditional statistical outlier detection
    - Z-score: Standard deviation based
    - Domain: Impossible or implausible values
    """

    numeric_cols = df.select_dtypes(include=[np.number]).columns
    outlier_report = []

    for col in numeric_cols:
        series = df[col].dropna()

        # IQR method
        q1, q3 = series.quantile(0.25), series.quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        iqr_outliers = ((series < lower_bound) | (series > upper_bound)).sum()

        # Z-score method
        z_scores = np.abs(stats.zscore(series))
        z_outliers = (z_scores > 3).sum()

        outlier_report.append({
            "feature": col,
            "iqr_outliers": iqr_outliers,
            "iqr_outlier_pct": iqr_outliers / len(series) * 100,
            "z_outliers": z_outliers,
            "z_outlier_pct": z_outliers / len(series) * 100,
            "min": series.min(),
            "max": series.max(),
            "lower_bound": lower_bound,
            "upper_bound": upper_bound,
        })

    report = pd.DataFrame(outlier_report).sort_values("iqr_outlier_pct", ascending=False)

    print("OUTLIER REPORT")
    print("=" * 80)
    for _, row in report.iterrows():
        if row["iqr_outlier_pct"] > 1:
            print(f"\n  {row['feature']}:")
            print(f"    IQR outliers: {row['iqr_outliers']:.0f} ({row['iqr_outlier_pct']:.1f}%)")
            print(f"    Z-score outliers: {row['z_outliers']:.0f} ({row['z_outlier_pct']:.1f}%)")
            print(f"    Range: [{row['min']:.2f}, {row['max']:.2f}]")
            print(f"    Expected range: [{row['lower_bound']:.2f}, {row['upper_bound']:.2f}]")

    return report

outlier_report = detect_outliers(df)

def check_impossible_values(df: pd.DataFrame) -> None:
    """Check for domain-impossible values.

    These checks should be customized based on the dataset.
    Examples are provided as templates.
    """
    checks = {
        # "age": {"min": 0, "max": 120, "desc": "Age cannot be negative or > 120"},
        # "price": {"min": 0, "max": None, "desc": "Price cannot be negative"},
        # "percentage": {"min": 0, "max": 100, "desc": "Percentage must be 0-100"},
    }

    print("DOMAIN VALIDATION CHECKS")
    print("=" * 60)

    for col, rules in checks.items():
        if col not in df.columns:
            continue

        violations = pd.Series(False, index=df.index)
        if rules.get("min") is not None:
            violations |= df[col] < rules["min"]
        if rules.get("max") is not None:
            violations |= df[col] > rules["max"]

        n_violations = violations.sum()
        if n_violations > 0:
            print(f"  WARNING - {col}: {n_violations} violations ({rules['desc']})")
            print(f"    Violating values: {df.loc[violations, col].describe()}")
        else:
            print(f"  OK - {col}: All values within expected range")

check_impossible_values(df)

Step S: Summarize and Strategize

This is the step most candidates skip - and the one evaluators value most.

"""
## EDA Summary and Strategy

### Key Findings

| Finding | Impact | Action |
|---------|--------|--------|
| Target is 95/5 imbalanced | Accuracy is misleading | Use PR-AUC and F1 as primary metrics |
| Feature X and Y are 0.97 correlated | Multicollinearity | Drop one (keeping X, higher target correlation) |
| 12% missing in feature Z | Cannot ignore | Impute with median; add binary indicator for missingness |
| Feature W has 3% negative values | Domain-impossible | These appear to be refunds; create binary flag |
| Date column has no clear trend | May not be useful as-is | Extract month, day-of-week, recency features instead |

### Feature Engineering Plan
Based on the EDA findings above, I will:
1. [Engineering decision 1 - linked to finding]
2. [Engineering decision 2 - linked to finding]
3. [Engineering decision 3 - linked to finding]

### Modeling Implications
- Primary metric: PR-AUC (due to class imbalance)
- Use stratified cross-validation
- Consider class weights in model training
- Feature Z missingness pattern may be informative - include as a feature
"""

60-Second Answer

"The EDA summary is the single most valuable section of your entire notebook. It demonstrates that you can extract actionable insights from data and translate them into modeling decisions. An evaluator who reads a clear EDA summary immediately trusts your subsequent modeling choices because they can see the reasoning chain. Without this summary, every modeling decision looks arbitrary."

Part 3 - Visualization Best Practices

The Three Rules of Take-Home Visualizations

Every plot must have a title that states the finding, not just the variable name. Wrong: "Feature Distribution". Right: "Monthly Charges Are Higher for Churning Customers (Median $78 vs$ 62)".
Every plot must have labeled axes with units. Wrong: unlabeled axes. Right: "Monthly Charges ($)" and "Number of Customers".
Every plot must be followed by a markdown cell interpreting it. A plot without interpretation is a plot without purpose.

# BAD visualization
plt.hist(df["monthly_charges"])
plt.show()

# GOOD visualization
fig, ax = plt.subplots(figsize=(10, 5))

for label, color in [(0, "#2563eb"), (1, "#dc2626")]:
    subset = df[df["churn"] == label]["monthly_charges"]
    ax.hist(subset, bins=30, alpha=0.6, color=color,
            label=f"{'Churned' if label == 1 else 'Retained'} (median=${subset.median():.0f})",
            density=True, edgecolor="white")

ax.set_xlabel("Monthly Charges ($)", fontsize=12)
ax.set_ylabel("Density", fontsize=12)
ax.set_title("Churning Customers Pay More on Average\n(Median $78 vs $62, p < 0.001)",
             fontsize=13, fontweight="bold")
ax.legend(fontsize=11)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)

plt.tight_layout()
plt.savefig("results/figures/charges_by_churn.png", dpi=150, bbox_inches="tight")
plt.show()

Visualization Selection Guide

What You Want to Show	Best Plot Type	When to Use
Distribution of one numeric variable	Histogram or KDE	Understanding feature spread
Distribution by category	Box plot or violin plot	Comparing groups
Relationship between two numeric variables	Scatter plot	Checking linearity, clusters
Feature vs target (numeric)	Overlapping histograms or box plots	Feature importance exploration
Feature vs target (categorical)	Grouped bar chart	Category-target relationships
Correlation between many features	Heatmap (with interpretation!)	Multicollinearity detection
Missing value patterns	`msno.matrix` or custom heatmap	Understanding missing data structure
Time trends	Line plot	Time series EDA
Categorical proportions	Stacked bar chart	Understanding compositions
Outlier context	Box plot with jittered points	Outlier investigation

Instant Rejection

Do not include plots generated by automated profiling tools (like pandas-profiling) as a substitute for EDA. Evaluators have seen this trick thousands of times. Auto-generated reports contain 50 plots with zero interpretation. They signal that you cannot distinguish important patterns from noise. Use profiling tools for your own understanding, but present curated, interpreted visualizations in your submission.

Creating a Publication-Quality Style

def set_plot_style():
    """Set a consistent, clean plotting style for the entire notebook."""
    plt.rcParams.update({
        "figure.figsize": (10, 6),
        "figure.dpi": 100,
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.labelsize": 12,
        "axes.titlesize": 13,
        "font.size": 11,
        "legend.fontsize": 10,
        "xtick.labelsize": 10,
        "ytick.labelsize": 10,
    })
    sns.set_palette("husl")

set_plot_style()

Part 4 \text{---} Statistical Tests

When to Use Statistical Tests in EDA

Statistical tests add rigor to your visual observations. They transform "it looks like churning customers have higher charges" into "churning customers have significantly higher monthly charges (t-test, p < 0.001, Cohen's d = 0.42)."

Test	When to Use	Python Implementation
t-test	Compare means of two groups (numeric feature by binary target)	`scipy.stats.ttest_ind`
Mann-Whitney U	Compare distributions when normality is not guaranteed	`scipy.stats.mannwhitneyu`
Chi-squared	Test association between two categorical variables	`scipy.stats.chi2_contingency`
ANOVA / Kruskal-Wallis	Compare means across 3+ groups	`scipy.stats.f_oneway` / `kruskal`
Kolmogorov-Smirnov	Compare two distributions	`scipy.stats.ks_2samp`
Pearson/Spearman correlation	Test linear/monotonic relationship significance	`scipy.stats.pearsonr` / `spearmanr`

from scipy import stats

def statistical_feature_analysis(
    df: pd.DataFrame,
    target: str,
    numeric_cols: list[str] = None,
    categorical_cols: list[str] = None,
) -> pd.DataFrame:
    """Run statistical tests for feature-target relationships.

    For numeric features: Mann-Whitney U test (robust to non-normality)
    For categorical features: Chi-squared test of independence
    """
    results = []

    if numeric_cols is None:
        numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(target, errors="ignore")
    if categorical_cols is None:
        categorical_cols = df.select_dtypes(include=["object", "category"]).columns

    # Numeric features
    for col in numeric_cols:
        group_0 = df[df[target] == 0][col].dropna()
        group_1 = df[df[target] == 1][col].dropna()

        stat, p_value = stats.mannwhitneyu(group_0, group_1, alternative="two-sided")

        # Effect size (rank-biserial correlation)
        n0, n1 = len(group_0), len(group_1)
        effect_size = 1 - (2 * stat) / (n0 * n1)

        results.append({
            "feature": col,
            "type": "numeric",
            "test": "Mann-Whitney U",
            "statistic": stat,
            "p_value": p_value,
            "effect_size": abs(effect_size),
            "significant": p_value < 0.05,
        })

    # Categorical features
    for col in categorical_cols:
        contingency = pd.crosstab(df[col], df[target])
        chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

        # Effect size (Cramer's V)
        n = contingency.sum().sum()
        min_dim = min(contingency.shape) - 1
        cramers_v = np.sqrt(chi2 / (n * min_dim)) if min_dim > 0 else 0

        results.append({
            "feature": col,
            "type": "categorical",
            "test": "Chi-squared",
            "statistic": chi2,
            "p_value": p_value,
            "effect_size": cramers_v,
            "significant": p_value < 0.05,
        })

    results_df = pd.DataFrame(results).sort_values("p_value")

    print("STATISTICAL FEATURE ANALYSIS")
    print("=" * 80)
    for _, row in results_df.iterrows():
        sig = "***" if row["p_value"] < 0.001 else "**" if row["p_value"] < 0.01 else "*" if row["p_value"] < 0.05 else "ns"
        print(f"  {row['feature']:30s} p={row['p_value']:.4f} {sig}  effect={row['effect_size']:.3f}  ({row['test']})")

    return results_df

stat_results = statistical_feature_analysis(df, TARGET)

Common Trap

Do not p-hack your way through EDA. Running 50 statistical tests and reporting only the significant ones is bad science and evaluators will notice. If you run multiple tests, mention the multiple comparisons issue. Also, statistical significance is not the same as practical significance \text{---} a p-value of 0.001 on a feature with a tiny effect size (d = 0.02) means the difference is reliably detected but practically meaningless for prediction.

Part 5 \text{---} Documenting EDA Findings

The Documentation Pattern

Every EDA finding should be documented with this structure:

### Finding: [Descriptive title]

**Observation:** [What you see in the data]

**Evidence:** [Plot reference, statistical test result, or summary statistic]

**Implication:** [What this means for the modeling approach]

**Action:** [What you will do about it]

Example Documented Findings

### Finding 1: Severe class imbalance in churn target

**Observation:** Only 5.2\% of customers in the dataset have churned.

**Evidence:** Target distribution shows 47,400 retained vs. 2,600 churned customers.
This means a model that always predicts "no churn" achieves 94.8\% accuracy.

**Implication:** Accuracy is an inappropriate metric for this problem. The model
must be evaluated on its ability to identify the minority class (churners).

**Action:**
- Use precision-recall AUC and F1-score as primary metrics
- Apply stratified cross-validation
- Consider class weights in tree-based models
- Report per-class precision and recall, not just overall metrics

---

### Finding 2: Monthly charges strongly differentiate churners

**Observation:** Customers who churn have significantly higher monthly charges
(median $78 vs $62 for retained customers).

**Evidence:** Mann-Whitney U test p < 0.001, effect size (rank-biserial) = 0.34.
See Figure 2 for the distribution comparison.

**Implication:** Monthly charges is likely an important predictive feature.
The relationship appears approximately linear, so it should work well in
both linear and tree-based models without transformation.

**Action:** Include monthly_charges as a feature. Also create a derived feature
"charge_per_service" to capture whether high charges are proportional to
services used.

---

### Finding 3: Missing values in total_charges are systematic

**Observation:** 11 records have missing total_charges, and ALL of them have
tenure = 0 (new customers with no billing history).

**Evidence:** Cross-tabulation of missingness with tenure confirms 100%
overlap with tenure = 0.

**Implication:** This is not random missing data - it is structurally missing
because new customers have no charge history. Imputing with the mean would be
inappropriate.

**Action:** Impute with 0 (they have been charged nothing yet) and create a
binary feature "is_new_customer" to capture this distinct segment.

Part 6 - Common EDA Mistakes

Mistake 1: Skipping EDA Entirely

The most common mistake. Candidates jump straight to model.fit() because they think model accuracy is all that matters.

Why it fails: Evaluators interpret missing EDA as either inexperience or laziness. More practically, skipping EDA leads to poor feature engineering, wrong metric choices, and undetected data quality issues that corrupt your results.

Fix: Allocate 15-20% of your total time to EDA. Even under severe time pressure, spend at least 30 minutes on data profiling and target analysis.

Mistake 2: EDA Without Narrative

Running df.describe(), showing 20 plots, and moving on. No interpretation, no findings, no connection to modeling decisions.

Why it fails: Evaluators cannot tell if you understood the data or just ran boilerplate code. Plots without narrative are decoration, not analysis.

Fix: After every visualization, add a markdown cell: "This shows that [observation]. This means [implication]. Therefore, I will [action]."

Mistake 3: Exhaustive Unfocused EDA

Analyzing every possible feature combination, creating 40 plots, running every statistical test. The EDA section is longer than the modeling section.

Why it fails: It suggests you cannot prioritize. Evaluators do not want to scroll through 40 plots to find the 3 that matter.

Fix: Focus on task-relevant exploration. Ask: "Does this analysis help me build a better model or make a better decision?" If not, skip it or put it in an appendix.

Mistake 4: Not Checking for Target Leakage

Failing to investigate whether any feature is suspiciously correlated with the target because it was derived from the target.

Why it fails: Target leakage inflates your metrics and produces a model that will fail in production. It is the number one technical failure mode in take-home projects.

Fix: For every feature that correlates strongly with the target (r > 0.8), ask: "Would this feature be available at prediction time?" If a feature perfectly separates the classes, it is almost certainly a leakage variable.

def check_for_leakage(df: pd.DataFrame, target: str) -> None:
    """Flag potential target leakage in features."""
    numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(target, errors="ignore")

    correlations = df[numeric_cols].corrwith(df[target]).abs().sort_values(ascending=False)

    print("LEAKAGE CHECK - Features Most Correlated with Target")
    print("=" * 60)
    for feat, corr in correlations.head(10).items():
        flag = " <-- INVESTIGATE" if corr > 0.8 else " <-- SUSPICIOUS" if corr > 0.6 else ""
        print(f"  {feat:30s} r = {corr:.4f}{flag}")

    if correlations.max() > 0.9:
        print("\n  WARNING: Feature(s) with r > 0.9 detected.")
        print("  Ask: Would this feature be available at prediction time?")
        print("  If it is derived from the target, it MUST be removed.")

check_for_leakage(df, TARGET)

Mistake 5: Ignoring Missing Data Patterns

Treating all missing data as random and applying blanket imputation.

Why it fails: Missing data is often informative. A customer with no "total_charges" is probably a new customer - that information is lost if you impute with the mean.

Fix: Always investigate why data is missing before deciding how to handle it.

def analyze_missing_patterns(df: pd.DataFrame) -> None:
    """Analyze patterns in missing data."""
    missing = df.isnull()
    missing_cols = missing.sum()[missing.sum() > 0].sort_values(ascending=False)

    if len(missing_cols) == 0:
        print("No missing values detected.")
        return

    print("MISSING DATA PATTERN ANALYSIS")
    print("=" * 60)

    for col in missing_cols.index:
        pct = missing_cols[col] / len(df) * 100
        print(f"\n  {col}: {missing_cols[col]:,} missing ({pct:.1f}%)")

        # Check if missingness correlates with other features
        missing_indicator = df[col].isnull().astype(int)
        numeric_cols = df.select_dtypes(include=[np.number]).columns.drop(col, errors="ignore")

        for other_col in numeric_cols[:5]:
            missing_mean = df.loc[missing_indicator == 1, other_col].mean()
            present_mean = df.loc[missing_indicator == 0, other_col].mean()
            if abs(missing_mean - present_mean) > df[other_col].std() * 0.5:
                print(f"    Correlated with {other_col}: missing mean={missing_mean:.2f}, present mean={present_mean:.2f}")

analyze_missing_patterns(df)

Mistake 6: Not Visualizing the Right Things

What to Visualize by Task Type - Classification, Regression, Time Series, NLP

Part 7 - Time Allocation for EDA

How Much Time Should You Spend on EDA?

Total Project Time	EDA Time	What to Cover
4 hours	30-45 min	Data profile, target analysis, top 3 feature relationships
8 hours	1-1.5 hours	Full DICES framework, key visualizations, documented findings
Weekend	2-3 hours	Comprehensive EDA, statistical tests, detailed narrative
1 week	3-5 hours	Publication-quality EDA, all findings documented and linked to modeling

Company Variation

Some companies (especially in consulting and analytics) weight EDA more heavily than modeling. If the prompt says "What insights can you find in this data?" or "Tell us a story with this data," EDA is the deliverable, not a precursor to modeling. In these cases, allocate 40-50% of your time to EDA.

The Time-Boxed EDA Approach

When time is tight, use this priority order:

5 minutes: Data profile (shape, types, missing values)
5 minutes: Target distribution and imbalance check
10 minutes: Top features vs. target visualization
5 minutes: Correlation matrix and leakage check
5 minutes: Document the three most important findings

Total: 30 minutes. This is the absolute minimum EDA that will keep your submission credible.

Practice Exercises

Exercise 1: EDA From Scratch

Download any dataset from the UCI ML Repository or Kaggle. Perform a complete EDA using the DICES framework in under 90 minutes. Document at least 5 findings with the observation-evidence-implication-action format.

Exercise 2: Spot the Issues

Given these EDA "findings," identify what is wrong with each:

"The correlation heatmap shows feature correlations." (What is wrong?)
"I removed all outliers using the IQR method." (What is wrong?)
"All features are significant (p < 0.05)." (What is wrong?)
"Feature X has a 0.95 correlation with the target, which is great for prediction." (What is wrong?)

Answers

This is a description of the plot, not a finding. A finding would be: "Features X and Y are 0.93 correlated, indicating redundancy - I will remove Y."
Blanket outlier removal without domain justification. An "outlier" in monthly charges might be a legitimate high-value customer. Always justify outlier handling with domain knowledge.
With a large enough dataset, virtually everything is statistically significant. Report effect sizes alongside p-values to distinguish practically meaningful differences.
A 0.95 correlation with the target is almost certainly data leakage. No real predictive feature correlates this strongly. Investigate whether this feature was derived from or is a proxy for the target.

Exercise 3: Time-Pressured EDA

Set a timer for 30 minutes. Perform an EDA on a new dataset and document your findings. Practice the time-boxed approach until you can produce credible EDA in 30 minutes consistently.

Interview Cheat Sheet

Question	Key Points
"How do you approach EDA?"	DICES framework: Data profile, Investigate target, Correlations, Edge cases, Summarize
"What is the most important thing to check first?"	Target distribution - determines metric choice and handling strategy
"How do you handle missing data?"	Analyze WHY it is missing first; then impute or flag appropriately
"How do you detect target leakage?"	Check features with r > 0.8 with target; ask if they would be available at prediction time
"What makes a good EDA visualization?"	Title states the finding, axes labeled, followed by interpretation
"How much time should EDA take?"	15-20% of total project time; minimum 30 minutes even under time pressure
"What is the difference between statistical and practical significance?"	Large datasets make everything significant; effect size measures practical importance
"How do you handle outliers?"	Detect with IQR/z-score, but handle with domain knowledge - not blanket removal

Next Steps

With a solid EDA foundation, you are ready to make informed modeling decisions. The next chapter, Model Selection Strategy, covers how to choose the right model for your take-home - starting with baselines, comparing models fairly, tuning hyperparameters efficiently, and knowing when to stop iterating.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - The Systematic EDA Framework​

The DICES Framework​

Part 2 - What to Explore (With Code)​

Step D: Data Profiling​

Step I: Investigate Target​

Step C: Correlations and Relationships​

Step E: Edge Cases and Outliers​

Step S: Summarize and Strategize​

Part 3 - Visualization Best Practices​

The Three Rules of Take-Home Visualizations​

Visualization Selection Guide​

Creating a Publication-Quality Style​

Part 4 \text{---} Statistical Tests​

When to Use Statistical Tests in EDA​

Part 5 \text{---} Documenting EDA Findings​

The Documentation Pattern​

Example Documented Findings​

Part 6 - Common EDA Mistakes​

Mistake 1: Skipping EDA Entirely​

Mistake 2: EDA Without Narrative​

Mistake 3: Exhaustive Unfocused EDA​

Mistake 4: Not Checking for Target Leakage​

Mistake 5: Ignoring Missing Data Patterns​

Mistake 6: Not Visualizing the Right Things​

Part 7 - Time Allocation for EDA​

How Much Time Should You Spend on EDA?​

The Time-Boxed EDA Approach​

Practice Exercises​

Exercise 1: EDA From Scratch​

Exercise 2: Spot the Issues​

Exercise 3: Time-Pressured EDA​

Interview Cheat Sheet​

Next Steps​