🔧 MemoLearning Data Wrangling

Master data cleaning, transformation, and preparation techniques for analysis-ready datasets

← Back to Data Science

Data Wrangling Curriculum

10
Core Units
~120
Wrangling Techniques
6+
Essential Tools
40+
Real-world Examples
1

Data Quality Assessment

Learn to identify and assess data quality issues including completeness, accuracy, consistency, and validity.

  • Data quality dimensions
  • Completeness assessment
  • Accuracy and precision evaluation
  • Consistency checking
  • Validity and integrity rules
  • Data profiling techniques
  • Quality metrics and scoring
  • Automated quality checks
2

Handling Missing Data

Master techniques for identifying, understanding, and dealing with missing values in datasets.

  • Types of missing data (MCAR, MAR, MNAR)
  • Missing data patterns
  • Deletion strategies
  • Imputation methods
  • Forward and backward fill
  • Statistical imputation
  • Advanced imputation techniques
  • Evaluation of imputation quality
3

Data Type Conversion and Formatting

Convert between data types, standardize formats, and ensure data consistency across datasets.

  • Data type identification
  • Numeric type conversions
  • String formatting and parsing
  • Date and time formatting
  • Boolean conversion
  • Categorical data encoding
  • Custom type conversions
  • Format standardization
4

Outlier Detection and Treatment

Identify anomalous data points and apply appropriate strategies for handling outliers in your datasets.

  • Statistical outlier detection
  • Interquartile range (IQR) method
  • Z-score and modified Z-score
  • Isolation Forest
  • Local Outlier Factor
  • Visual outlier identification
  • Outlier treatment strategies
  • Domain-specific considerations
5

Text Data Cleaning

Clean and standardize text data including removing noise, normalizing text, and handling encoding issues.

  • Text encoding and decoding
  • Removing special characters
  • Case normalization
  • Whitespace handling
  • Regular expressions for cleaning
  • Text standardization
  • Handling unicode issues
  • Text validation patterns
6

Data Transformation and Normalization

Transform data to appropriate scales and distributions for analysis and modeling purposes.

  • Feature scaling techniques
  • Min-max normalization
  • Standard scaling (Z-score)
  • Robust scaling
  • Log transformations
  • Box-Cox transformations
  • Power transformations
  • Custom transformation functions
7

Data Integration and Merging

Combine data from multiple sources, resolve conflicts, and create unified datasets for analysis.

  • Data source identification
  • Schema mapping and alignment
  • Join operations and strategies
  • Handling duplicate records
  • Entity resolution
  • Conflict resolution rules
  • Data lineage tracking
  • Integration validation
8

Feature Engineering and Creation

Create new meaningful features from existing data to improve analysis and model performance.

  • Feature extraction techniques
  • Mathematical transformations
  • Binning and discretization
  • Polynomial features
  • Interaction features
  • Date and time features
  • Text feature extraction
  • Domain-specific features
9

Data Validation and Quality Control

Implement validation rules and quality control processes to ensure data integrity and reliability.

  • Validation rule design
  • Range and constraint checking
  • Referential integrity
  • Business rule validation
  • Data consistency checks
  • Automated validation pipelines
  • Error reporting and logging
  • Quality monitoring dashboards
10

Data Wrangling Pipelines

Build automated, reproducible data wrangling workflows and pipelines for efficient data processing.

  • Pipeline architecture design
  • Workflow orchestration
  • Error handling and recovery
  • Performance optimization
  • Parallel processing
  • Pipeline monitoring
  • Version control for data
  • Documentation and maintenance

Unit 1: Data Quality Assessment

Learn to identify and assess data quality issues including completeness, accuracy, consistency, and validity.

Data Quality Dimensions

Understand the fundamental dimensions of data quality and how they impact analysis outcomes.

Completeness Accuracy Consistency Validity
# Data quality dimensions checklist
quality_dimensions = {
  'completeness': missing_values_check,
  'accuracy': data_validation_check,
  'consistency': format_check
}

Completeness Assessment

Evaluate the completeness of your dataset by identifying missing values and incomplete records.

# Check for missing values
df.info()
df.isnull().sum()
missing_percentage = df.isnull().mean() * 100

Accuracy and Precision Evaluation

Assess the accuracy of data values and identify potential errors or inconsistencies in your dataset.

# Check for impossible values
age_errors = df[df['age'] < 0]['age']
# Validate against reference data
invalid_codes = df[~df['code'].isin(valid_codes)]

Consistency Checking

Ensure data consistency across columns, records, and related datasets for reliable analysis.

# Check date consistency
inconsistent = df[df['start_date'] > df['end_date']]
# Format consistency
phone_format = df['phone'].str.match(r'^\d{3}-\d{3}-\d{4}$')

Validity and Integrity Rules

Define and apply business rules to validate data integrity and ensure compliance with domain requirements.

# Define validation rules
def validate_email(email):
  return '@' in email and '.' in email

df['valid_email'] = df['email'].apply(validate_email)

Data Profiling Techniques

Use statistical and analytical techniques to understand data characteristics and identify quality issues.

# Basic profiling
df.describe()
df.dtypes
# Unique value counts
df.nunique()

Quality Metrics and Scoring

Develop quantitative metrics to measure and track data quality over time.

# Calculate quality score
completeness_score = 1 - df.isnull().mean()
validity_score = df['valid_records'].mean()
overall_score = (completeness_score + validity_score) / 2

Automated Quality Checks

Implement automated systems for continuous data quality monitoring and alerting.

# Automated quality pipeline
def quality_check_pipeline(df):
  checks = []
  checks.append(check_completeness(df))
  checks.append(check_validity(df))
  return all(checks)

Unit 2: Handling Missing Data

Master techniques for identifying, understanding, and dealing with missing values in datasets.

Types of Missing Data

Understand the different mechanisms that lead to missing data: MCAR, MAR, and MNAR.

MCAR MAR MNAR
# Identify missing data patterns
import missingno as msno
msno.matrix(df)
msno.heatmap(df)

Missing Data Patterns

Analyze patterns in missing data to understand the underlying causes and choose appropriate handling strategies.

# Analyze missing patterns
missing_pattern = df.isnull().groupby([col for col in df.columns]).size()
# Check correlation of missingness
msno.dendrogram(df)

Deletion Strategies

Learn when and how to remove records or features with missing values effectively.

# Listwise deletion
df_complete = df.dropna()
# Column-wise deletion
df_cleaned = df.dropna(axis=1, thresh=len(df)*0.8)

Imputation Methods

Apply various imputation techniques to fill in missing values based on data characteristics.

# Simple imputation
df['age'].fillna(df['age'].mean(), inplace=True)
df['category'].fillna(df['category'].mode()[0], inplace=True)

Forward and Backward Fill

Use temporal relationships in time series data to fill missing values using adjacent observations.

# Forward fill
df['price'].fillna(method='ffill', inplace=True)
# Backward fill
df['price'].fillna(method='bfill', inplace=True)

Statistical Imputation

Apply statistical methods like regression and clustering for more sophisticated missing value imputation.

from sklearn.impute import KNNImputer
# KNN imputation
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df))

Advanced Imputation Techniques

Explore advanced methods like multiple imputation and deep learning approaches for handling missing data.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Multiple imputation
imputer = IterativeImputer(random_state=42)

Evaluation of Imputation Quality

Assess the effectiveness of imputation methods and validate the quality of filled values.

# Compare distributions
original_mean = df_original['column'].mean()
imputed_mean = df_imputed['column'].mean()
# Statistical tests for validation

Unit 3: Data Type Conversion and Formatting

Convert between data types, standardize formats, and ensure data consistency across datasets.

Data Type Identification

Identify current data types and determine the appropriate target types for each column in your dataset.

# Check current data