📘 Pandas Data Cleaning: Missing Values

Welcome to the wonderful world of data cleaning with pandas! 🎉 Have you ever opened a spreadsheet only to find empty cells staring back at you? That’s what we’ll tackle today - those pesky missing values that can turn your data analysis into a nightmare! 😱

In this tutorial, we’ll explore how pandas makes handling missing data as easy as pie 🥧. Whether you’re analyzing customer data, scientific measurements, or social media statistics, you’ll learn to clean your data like a pro! 💪

📚 Understanding Missing Values in Pandas

Think of missing values like holes in Swiss cheese 🧀 - they’re gaps in your data that need special attention. In pandas, missing values can appear in different forms:

NaN (Not a Number) - the most common type
None - Python’s way of saying “nothing here”
NaT (Not a Time) - for missing datetime values

import pandas as pd
import numpy as np

# 🎯 Let's create a dataset with missing values
data = {
    'name': ['Alice', 'Bob', None, 'David', 'Emma'],
    'age': [25, np.nan, 30, 35, 28],
    'salary': [50000, 60000, np.nan, np.nan, 55000],
    'department': ['IT', 'HR', 'IT', None, 'Finance']
}

df = pd.DataFrame(data)
print("Our messy data:")
print(df)
# 👀 Notice the NaN and None values!

🔧 Basic Syntax and Usage

Let’s learn the essential pandas methods for detecting and handling missing values! These are your data cleaning superpowers 🦸‍♀️:

Detecting Missing Values

# 🔍 Check for missing values
print("\nMissing values check:")
print(df.isnull())  # Returns True where values are missing

# 📊 Count missing values per column
print("\nMissing values count:")
print(df.isnull().sum())

# 🎯 Check if ANY values are missing
print(f"\nDoes our data have missing values? {df.isnull().any().any()}")

Handling Missing Values

# 🗑️ Drop rows with ANY missing values
df_dropped = df.dropna()
print("\nAfter dropping rows with missing values:")
print(df_dropped)

# 🎨 Fill missing values with a specific value
df_filled = df.fillna(value={
    'name': 'Unknown',
    'age': df['age'].mean(),  # 👍 Use average age
    'salary': 0,
    'department': 'Unassigned'
})
print("\nAfter filling missing values:")
print(df_filled)

💡 Practical Examples

Let’s dive into real-world scenarios where you’ll use these skills! 🌟

Example 1: Customer Shopping Data 🛒

# 🛍️ Customer purchase history with missing data
shopping_data = {
    'customer_id': [101, 102, 103, 104, 105],
    'name': ['Sarah', 'Mike', None, 'Lisa', 'Tom'],
    'last_purchase': ['2024-01-15', np.nan, '2024-02-20', '2024-01-30', np.nan],
    'total_spent': [250.50, np.nan, 180.00, np.nan, 320.75],
    'loyalty_points': [500, 300, np.nan, 750, np.nan]
}

customers = pd.DataFrame(shopping_data)
print("Customer data with missing values:")
print(customers)

# 🎯 Smart filling strategy
customers_cleaned = customers.copy()

# Fill missing names with 'Guest'
customers_cleaned['name'] = customers_cleaned['name'].fillna('Guest')

# Fill missing purchase amounts with median (more robust than mean!)
median_spent = customers_cleaned['total_spent'].median()
customers_cleaned['total_spent'] = customers_cleaned['total_spent'].fillna(median_spent)

# Fill loyalty points with 0 (new customers)
customers_cleaned['loyalty_points'] = customers_cleaned['loyalty_points'].fillna(0)

print("\n✨ Cleaned customer data:")
print(customers_cleaned)

Example 2: Weather Station Data 🌡️

# 🌤️ Weather data with sensor failures
weather_data = {
    'date': pd.date_range('2024-01-01', periods=7),
    'temperature': [22.5, np.nan, 24.1, 23.8, np.nan, 25.2, 24.5],
    'humidity': [65, 70, np.nan, 68, 72, np.nan, 69],
    'rainfall': [0, 5.2, np.nan, 0, 0, 3.1, np.nan]
}

weather = pd.DataFrame(weather_data)
print("Weather station data:")
print(weather)

# 🔧 Forward fill for continuous measurements
weather_ffill = weather.fillna(method='ffill')
print("\n📈 Forward filled (using previous value):")
print(weather_ffill)

# 🎯 Interpolation for smoother estimates
weather_interpolated = weather.interpolate()
print("\n📊 Interpolated values:")
print(weather_interpolated)

Example 3: Student Grades Analysis 📚

# 🎓 Student performance data
grades_data = {
    'student': ['Amy', 'Ben', 'Carl', 'Dana', 'Eve'],
    'math': [85, np.nan, 92, 78, np.nan],
    'english': [90, 88, np.nan, 85, 92],
    'science': [np.nan, 85, 90, np.nan, 88],
    'attendance_rate': [0.95, 0.88, np.nan, 0.92, 0.90]
}

grades = pd.DataFrame(grades_data)

# ❌ Wrong way - dropping too much data!
grades_wrong = grades.dropna()
print(f"😱 We lost {len(grades) - len(grades_wrong)} students!")

# ✅ Right way - smart handling
grades_smart = grades.copy()

# Fill missing grades with subject average
for subject in ['math', 'english', 'science']:
    subject_mean = grades_smart[subject].mean()
    grades_smart[subject] = grades_smart[subject].fillna(subject_mean)

# Fill missing attendance with minimum threshold
grades_smart['attendance_rate'] = grades_smart['attendance_rate'].fillna(0.75)

print("\n✅ Smart handling preserves all data:")
print(grades_smart)

🚀 Advanced Concepts

Ready to level up your missing data game? Let’s explore advanced techniques! 🎮

Custom Missing Value Indicators

# 🎯 Working with custom missing value markers
sales_data = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [999, -999, 45, 299],  # -999 means missing
    'stock': [10, 0, -1, 5],  # -1 means missing
    'rating': [4.5, 0, 4.2, 4.8]  # 0 means missing
})

# 🔄 Replace custom missing markers with NaN
sales_cleaned = sales_data.replace({
    'price': {-999: np.nan},
    'stock': {-1: np.nan},
    'rating': {0: np.nan}
})

print("After converting custom missing markers:")
print(sales_cleaned)

Strategic Missing Data Handling

# 🏥 Medical data example
medical_data = {
    'patient_id': range(1, 6),
    'blood_pressure': [120, np.nan, 130, 125, np.nan],
    'cholesterol': [200, 180, np.nan, 210, 190],
    'bmi': [22.5, 25.1, np.nan, 24.3, 26.8]
}

patients = pd.DataFrame(medical_data)

# 🎯 Different strategies for different columns
def smart_fill(df):
    result = df.copy()
    
    # Blood pressure: use median (outlier-resistant)
    result['blood_pressure'] = result['blood_pressure'].fillna(
        result['blood_pressure'].median()
    )
    
    # Cholesterol: use mean of similar BMI patients
    for idx in result[result['cholesterol'].isnull()].index:
        patient_bmi = result.loc[idx, 'bmi']
        if pd.notna(patient_bmi):
            # Find patients with similar BMI (±2 points)
            similar = result[
                (result['bmi'] >= patient_bmi - 2) & 
                (result['bmi'] <= patient_bmi + 2)
            ]
            result.loc[idx, 'cholesterol'] = similar['cholesterol'].mean()
    
    return result

patients_smart = smart_fill(patients)
print("Smart filling based on data relationships:")
print(patients_smart)

⚠️ Common Pitfalls and Solutions

Let’s learn from common mistakes so you don’t have to make them! 🛡️

Pitfall 1: Dropping Too Much Data

# ❌ Wrong - losing valuable information
messy_df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# This drops 75% of your data! 😱
wrong_approach = messy_df.dropna()
print(f"Lost {len(messy_df) - len(wrong_approach)} rows!")

# ✅ Right - be selective
right_approach = messy_df.dropna(subset=['A'])  # Only require column A
print(f"Kept {len(right_approach)} rows by being selective!")

Pitfall 2: Inappropriate Filling Methods

# ❌ Wrong - filling categories with numbers
categories_df = pd.DataFrame({
    'product_type': ['Electronics', np.nan, 'Clothing', 'Food'],
    'price': [100, 50, np.nan, 25]
})

# Don't fill categorical data with mean!
# wrong_fill = categories_df['product_type'].fillna(
#     categories_df['product_type'].mean()  # This would error!
# )

# ✅ Right - use mode for categories
mode_value = categories_df['product_type'].mode()[0]
categories_df['product_type'] = categories_df['product_type'].fillna(mode_value)
print("Correctly filled categorical data:")
print(categories_df)

🛠️ Best Practices

Follow these golden rules for handling missing data like a pro! 🏆

1. Understand Your Data First

# 🔍 Always investigate before cleaning
def missing_data_report(df):
    """Generate a comprehensive missing data report"""
    missing_count = df.isnull().sum()
    missing_percent = (missing_count / len(df)) * 100
    
    report = pd.DataFrame({
        'Missing Count': missing_count,
        'Missing %': missing_percent
    })
    
    return report[report['Missing Count'] > 0].sort_values(
        'Missing %', ascending=False
    )

# Example usage
sample_df = pd.DataFrame({
    'complete': [1, 2, 3, 4, 5],
    'half_missing': [1, np.nan, 3, np.nan, 5],
    'mostly_missing': [np.nan, np.nan, 3, np.nan, np.nan]
})

print("Missing Data Report:")
print(missing_data_report(sample_df))

2. Document Your Decisions

# 📝 Create a cleaning log
def clean_with_log(df, strategies):
    """Clean data and log all decisions"""
    cleaned = df.copy()
    log = []
    
    for column, strategy in strategies.items():
        before_missing = cleaned[column].isnull().sum()
        
        if strategy == 'drop':
            cleaned = cleaned.dropna(subset=[column])
        elif strategy == 'mean':
            cleaned[column] = cleaned[column].fillna(cleaned[column].mean())
        elif strategy == 'mode':
            cleaned[column] = cleaned[column].fillna(cleaned[column].mode()[0])
        elif isinstance(strategy, (int, float, str)):
            cleaned[column] = cleaned[column].fillna(strategy)
        
        after_missing = cleaned[column].isnull().sum()
        log.append({
            'column': column,
            'strategy': strategy,
            'missing_before': before_missing,
            'missing_after': after_missing
        })
    
    return cleaned, pd.DataFrame(log)

# Example
strategies = {
    'age': 'mean',
    'department': 'Unassigned',
    'salary': 'drop'
}

# cleaned_data, cleaning_log = clean_with_log(df, strategies)

3. Consider Domain Knowledge

# 🏪 Retail example - domain-specific handling
retail_data = pd.DataFrame({
    'product': ['TV', 'Laptop', 'Phone', 'Tablet'],
    'sales_monday': [5, np.nan, 12, 8],
    'sales_tuesday': [np.nan, 10, 15, np.nan],
    'returns': [1, 0, np.nan, 2]
})

# 📊 Business logic: closed on Monday = 0 sales, not missing
retail_cleaned = retail_data.copy()
retail_cleaned['sales_monday'] = retail_cleaned['sales_monday'].fillna(0)

# Returns: missing likely means no returns
retail_cleaned['returns'] = retail_cleaned['returns'].fillna(0)

print("Domain-aware cleaning:")
print(retail_cleaned)

🧪 Hands-On Exercise

Time to put your skills to the test! 🎯 Try solving this real-world problem:

# 🏃‍♀️ Fitness tracker data with missing values
fitness_data = {
    'date': pd.date_range('2024-01-01', periods=10),
    'steps': [8000, np.nan, 12000, 9500, np.nan, 11000, 7500, np.nan, 10000, 8500],
    'calories': [2200, 2100, np.nan, 2300, np.nan, np.nan, 2000, 2250, np.nan, 2150],
    'sleep_hours': [7, 6.5, np.nan, 8, 7.5, np.nan, 6, 7, 7.5, np.nan],
    'workout': ['Yes', 'No', np.nan, 'Yes', 'No', 'Yes', np.nan, 'Yes', 'Yes', 'No']
}

fitness_df = pd.DataFrame(fitness_data)

# Your mission:
# 1. Create a missing data report
# 2. Fill missing steps with the weekly average
# 3. Fill missing calories with interpolation
# 4. Fill missing sleep with forward fill
# 5. Fill missing workout with 'No' (assumption: people log when they work out)
# 6. Create a summary showing before/after missing counts

# Try it yourself first! 💪

💡 Click here for the solution

# 🎯 Solution
print("=== Fitness Tracker Data Cleaning ===\n")

# 1. Missing data report
print("1. Missing Data Report:")
missing_report = pd.DataFrame({
    'Missing Count': fitness_df.isnull().sum(),
    'Missing %': (fitness_df.isnull().sum() / len(fitness_df)) * 100
})
print(missing_report[missing_report['Missing Count'] > 0])

# 2-5. Apply cleaning strategies
fitness_cleaned = fitness_df.copy()

# Fill missing steps with weekly average (7-day rolling mean)
fitness_cleaned['steps'] = fitness_cleaned['steps'].fillna(
    fitness_cleaned['steps'].rolling(window=7, min_periods=1).mean()
)

# Fill missing calories with interpolation
fitness_cleaned['calories'] = fitness_cleaned['calories'].interpolate()

# Fill missing sleep with forward fill
fitness_cleaned['sleep_hours'] = fitness_cleaned['sleep_hours'].fillna(method='ffill')

# Fill missing workout with 'No'
fitness_cleaned['workout'] = fitness_cleaned['workout'].fillna('No')

# 6. Summary
print("\n6. Cleaning Summary:")
summary = pd.DataFrame({
    'Before': fitness_df.isnull().sum(),
    'After': fitness_cleaned.isnull().sum()
})
print(summary[summary['Before'] > 0])

print("\n✨ Cleaned fitness data:")
print(fitness_cleaned)

# Bonus: Visualize the difference
print("\n📊 Average daily metrics:")
print(f"Steps: {fitness_cleaned['steps'].mean():.0f}")
print(f"Calories: {fitness_cleaned['calories'].mean():.0f}")
print(f"Sleep: {fitness_cleaned['sleep_hours'].mean():.1f} hours")
print(f"Workout days: {(fitness_cleaned['workout'] == 'Yes').sum()}/10")

🎓 Key Takeaways

Congratulations! You’ve mastered the art of handling missing values in pandas! 🎉 Here’s what you’ve learned:

Detection is Key 🔍 - Always check for missing values before analysis
Different Data, Different Strategies 🎯 - Numbers vs. categories need different approaches
Domain Knowledge Matters 🧠 - Understanding your data leads to better cleaning decisions
Preserve What You Can 💎 - Don’t drop data unless absolutely necessary
Document Everything 📝 - Keep track of your cleaning decisions

Remember: Missing data isn’t your enemy - it’s just another puzzle to solve! 🧩

🤝 Next Steps

You’re doing amazing! 🌟 Here’s what to explore next:

Advanced Imputation 🚀 - Learn about machine learning-based imputation methods
Time Series Missing Data ⏰ - Special techniques for temporal data
Missing Data Visualization 📊 - Create heatmaps to visualize patterns
Data Quality Automation 🤖 - Build automated data quality checks

Keep practicing with real datasets, and soon you’ll be cleaning data in your sleep! 😴 (Though we recommend doing it while awake for better results! 😄)

Happy data cleaning! Remember, every data scientist spends 80% of their time cleaning data - now you know why and how to do it like a pro! 💪✨

📘 Pandas Data Cleaning: Missing Values

Prerequisites

What you'll learn

📘 Pandas Data Cleaning: Missing Values

📚 Understanding Missing Values in Pandas

🔧 Basic Syntax and Usage

Detecting Missing Values

Handling Missing Values

💡 Practical Examples

Example 1: Customer Shopping Data 🛒

Example 2: Weather Station Data 🌡️

Example 3: Student Grades Analysis 📚

🚀 Advanced Concepts

Custom Missing Value Indicators

Strategic Missing Data Handling

⚠️ Common Pitfalls and Solutions

Pitfall 1: Dropping Too Much Data

Pitfall 2: Inappropriate Filling Methods

🛠️ Best Practices

1. Understand Your Data First

2. Document Your Decisions

3. Consider Domain Knowledge

🧪 Hands-On Exercise

🎓 Key Takeaways

🤝 Next Steps

More python Tutorials

📘 Pandas Basics: DataFrames and Series

📘 Pandas Data Cleaning: Missing Values

📘 Pandas Data Analysis: GroupBy and Pivot

Tutorial Info