What is Classification
Learn the fundamental concept of classification in machine learning and supervised learning.
Supervised Learning
Categorical Output
Discrete Labels
Classification is a supervised learning task where the goal is to predict the category or class of new observations based on a training dataset of observations whose category membership is known.
Binary vs Multiclass Problems
Understand the difference between binary and multiclass classification problems.
Binary Classification: 2 classes (Yes/No, Spam/Not Spam, Fraud/Not Fraud)
Multiclass Classification: 3+ classes (Red/Green/Blue, Cat/Dog/Bird)
Multilabel Classification: Multiple labels per instance
# Examples of classification problems
classification_types = {
"Binary": {
"Email": ["Spam", "Not Spam"],
"Medical": ["Disease", "Healthy"],
"Finance": ["Fraud", "Legitimate"]
},
"Multiclass": {
"Image": ["Cat", "Dog", "Bird", "Fish"],
"Text": ["Sports", "Politics", "Technology"],
"Iris": ["Setosa", "Versicolor", "Virginica"]
}
}
Decision Boundaries
Understand how classifiers create decision boundaries to separate different classes.
Decision Boundary: The hyperplane that separates different classes in the feature space. For logistic regression, this boundary is linear and defined by the equation where the predicted probability equals 0.5.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# Create sample 2D data
np.random.seed(42)
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
# Fit logistic regression
model = LogisticRegression()
model.fit(X, y)
# Decision boundary: w0*x1 + w1*x2 + b = 0
w = model.coef_[0]
b = model.intercept_[0]
# Plot decision boundary
x_boundary = np.linspace(-3, 3, 100)
y_boundary = -(w[0] * x_boundary + b) / w[1]
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu')
plt.plot(x_boundary, y_boundary, 'k-', linewidth=2)
plt.title('Logistic Regression Decision Boundary')
plt.show()
Classification vs Regression
Learn the key differences between classification and regression tasks.
Discrete
Continuous
Categorical
Numerical
# Key differences
differences = {
"Output Type": {
"Classification": "Discrete/Categorical",
"Regression": "Continuous/Numerical"
},
"Examples": {
"Classification": "Email spam detection",
"Regression": "House price prediction"
},
"Evaluation": {
"Classification": "Accuracy, Precision, Recall",
"Regression": "MSE, RMSE, MAE, R²"
},
"Algorithms": {
"Classification": "Logistic Regression, SVM, Random Forest",
"Regression": "Linear Regression, Ridge, Lasso"
}
}
Real-world Applications
Explore practical applications of classification in various industries.
Healthcare
Finance
Marketing
Technology
# Real-world classification applications
applications = {
"Healthcare": [
"Disease diagnosis from symptoms",
"Medical image classification",
"Drug response prediction"
],
"Finance": [
"Credit approval decisions",
"Fraud detection systems",
"Risk assessment models"
],
"Marketing": [
"Customer segmentation",
"Churn prediction",
"Ad targeting optimization"
],
"Technology": [
"Email spam filtering",
"Image recognition",
"Natural language processing"
]
}
Probabilistic Interpretation
Understand how classification can be viewed through a probabilistic lens.
Instead of hard predictions, we estimate P(y=1|X) - the probability that an instance belongs to class 1 given its features X. This probabilistic approach provides uncertainty estimates and enables better decision-making.
from sklearn.linear_model import LogisticRegression
import numpy as np
# Sample data
X = np.array([[1, 2], [2, 3], [3, 1], [4, 5]])
y = np.array([0, 0, 1, 1])
# Fit logistic regression
model = LogisticRegression()
model.fit(X, y)
# Get probability predictions
probabilities = model.predict_proba(X)
print("Probabilities for each class:")
print(probabilities)
# Get class predictions (threshold = 0.5)
predictions = model.predict(X)
print("Class predictions:", predictions)
# Custom threshold
custom_predictions = (probabilities[:, 1] > 0.3).astype(int)
print("Custom threshold predictions:", custom_predictions)