MLP based CNN for Classification
- Machine learning , Deep learning
- June 10, 2025
Perceptrons are the foundational elements of artificial neural networks, inspired by the biological neuron. Developed by Frank Rosenblatt in the late 1950s, they represent one of the earliest models of machine learning capable of learning from data.
At its core, a perceptron is a simple binary classifier. It takes multiple binary (or real-valued) inputs and produces a single binary output.
Components:
- Inputs (
x₁, x₂, ..., xₙ): Features or attributes of the data point. - Weights (
w₁, w₂, ..., wₙ): Importance of each input. - Summation Function (Weighted Sum): Calculates the weighted sum of inputs.
- Bias (
b): Shifts the activation threshold. - Activation Function: Determines the perceptron’s output. While the original perceptron used a simple step (threshold) function, modern neural networks use differentiable activation functions. Common choices include:
-
Sigmoid: Maps input values to a range between 0 and 1. Useful for probabilistic outputs but can suffer from vanishing gradients.
σ(x) = 1 / (1 + exp(-x)) derivative max 0.25 -
Tanh (Hyperbolic Tangent): Outputs values between -1 and 1, centering the data and often leading to faster convergence than sigmoid.
tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x)) derivative max 1 -
ReLU (Rectified Linear Unit): Outputs zero for negative inputs and the input itself for positive values. It is computationally efficient and helps mitigate the vanishing gradient problem.
ReLU(x) = max(0, x) derivative range 0 to 1
-
These activation functions introduce non-linearity, allowing neural networks to learn complex patterns.
Equations:
Weighted Sum = x₁·w₁ + x₂·w₂ + ... + xₙ·wₙ
Net Input = (x₁·w₁ + x₂·w₂ + ... + xₙ·wₙ) + b
Output = 1 if Net Input ≥ 0, else 0
A perceptron can be visualized as a decision boundary (line or hyperplane) separating two classes.
The perceptron learns by adjusting weights and bias based on classification errors using labeled data.
Algorithm Steps:
- Initialization: Start with random weights and bias.
- Iteration: For each training example:
- Calculate output.
- Compare with target.
- If incorrect, update weights and bias:
Wherewᵢ ← wᵢ + α · (T - Y) · xᵢ b ← b + α · (T - Y)αis the learning rate.
- Convergence: Guaranteed if data is linearly separable.
A single perceptron solves only linearly separable problems, but combining them in layers forms multi-layer neural networks (MLPs).
Structure:
- Input Layer: Receives features.
- Hidden Layers: One or more layers of perceptrons.
- Output Layer: Produces final output.
Training: Multi-layer networks use backpropagation to adjust all weights and biases, enabling learning of complex, non-linear relationships.
The Problem: How to Adjust Hidden Layer Weights?
In MLPs, hidden layer weights indirectly affect the output, making it challenging to assign blame for errors.
Backpropagation works backward from the output, propagating error signals and computing gradients for each layer.
Algorithm Steps:
- Forward Pass: Compute predictions layer by layer.
- Calculate Loss: Measure error using a loss function.
- Backward Pass: Compute gradients for each layer using the chain rule.
-
For each neuron in the output layer, calculate the error term (δ) based on the derivative of the loss with respect to the neuron’s output.
-
For each neuron in hidden layers, propagate the error backward:
δⱼ = f'(zⱼ) · Σₖ (δₖ · wₖⱼ) ∂L/∂wⱼᵢ = δⱼ · aᵢ ∂L/∂bⱼ = δⱼ- Here,
f'(zⱼ)is the derivative of the activation function at neuron j,δₖis the error term from the next layer,wₖⱼis the weight from neuron j to neuron k, andaᵢis the activation from the previous layer.
- Here,
- Update Weights and Biases:
w_new = w_old - α · ∂L/∂w_old
b_new = b_old - α · ∂L/∂b_old
- Repeat: Iterate over epochs and batches.
-
Differentiable Activation Functions: For the chain rule to work, activation functions in neurons must be differentiable. This is why functions like Sigmoid, Tanh, and ReLU (with a minor tweak at 0) are widely used, while the hard step function of a classic perceptron is not suitable for MLPs trained with backpropagation.
-
Loss Function: A differentiable loss function is essential to quantify the error and guide the learning process.
-
Chain Rule: This mathematical rule enables us to compute how a small change in a weight in an early layer propagates through the network to affect the final loss. It allows efficient computation of all gradients without calculating each one independently.
-
Gradient Descent: This optimization algorithm uses the gradients computed by backpropagation to iteratively adjust the network’s parameters toward minimizing the loss function.
-
Efficiency: Computes all gradients efficiently.
-
Generalization: Learns to perform well on unseen data.
-
Foundation for Deep Learning: Enables training of deep neural networks.
Notebook
You can find the implementation of CNN for a binary classification task in this Jupyter Notebook. It also contains the visual representations of the filters learned by the convolutional layers.