Multi-Layer Perceptrons

The Perceptron

The Perceptron is the composition of a linear function with an element-wise application of a nonlinear activation function \(\sigma\):
\[\begin{align} y(x,\theta) = \sigma(f(x,\theta)) = \sigma(w^\top x + b) = \sigma\left( \sum w_i x_i + b\right), \end{align}\] where \(w\) has the same length as \(x\).
It defines the simplest neural network: one neuron. The neuron receives inputs \(x\), linearly combines them into a scalar using weights \(w\) with a bias \(b\), and finally applies a nonlinearity \(\sigma(\cdot)\) to produce output \(y\).

Layers

How many layers does a perceptron have, given that it is a neural network? I’d say it has one layer, defined by the single neuron. Since the output of that neuron is the output of the network, it could be called the output layer. Is there an input layer? Often, the input is said to be generated by a set of input neurons in the input layer. So is the perceptron really a single neuron? The simple principle is that we only count neurons performing computations as being part of layers. The input neurons are just sources without computations, so everything works out.

Logistic Regression

When the activation function is the sigmoid function and the model predicts a scalar value, fitting a perceptron model to the data is the same as solving a logistic regression problem. Once data are loaded, we can run nearly the same code with the exception of adding the activation function to the model.

The linear model

class Lin(nn.Module):
    def __init__(self):
        super().__init__()
        self.net =  nn.Linear(1, 1)

    def forward(self, x):
        return self.net(x)
model = Lin()

becomes

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(4, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        return self.net(x)
model = MLP()

Initial w:  tensor([[0., 0., 0., 0.]]) b: tensor([0.])
Start training: 
Epoch 1/150000 - Loss: 0.2500
Epoch 15001/150000 - Loss: 0.0291
Epoch 30001/150000 - Loss: 0.0260
Epoch 45001/150000 - Loss: 0.0245
Epoch 60001/150000 - Loss: 0.0236
Epoch 75001/150000 - Loss: 0.0230
Epoch 90001/150000 - Loss: 0.0225
Epoch 105001/150000 - Loss: 0.0221
Epoch 120001/150000 - Loss: 0.0218
Epoch 135001/150000 - Loss: 0.0216

Classification Report:

              precision    recall  f1-score   support

  versicolor       0.98      0.96      0.97        50
   virginica       0.96      0.98      0.97        50

    accuracy                           0.97       100
   macro avg       0.97      0.97      0.97       100
weighted avg       0.97      0.97      0.97       100

Multi-Layer Perceptron

The output of one set of perceptrons can be treated as the input to another set of perceptrons. Each set is called a layer of a multi-layer perceptron. When you use non-sigmoid activations, you get a generic artificial feed-forward neural network (AFFNN). Nowadays people call an AFFNN an MLP regardless of the activation function.

When we have \(L\) layers, the neural network function may be written as
\[\begin{align} y = x_L = \sigma\ \circ \mathrm{Aff}_L \circ \sigma\ \circ \mathrm{Aff}_{L-1} \circ \sigma\ \circ \mathrm{Aff}_{L-2} \circ \cdots \circ \sigma\ \circ \mathrm{Aff}_{2} \circ \sigma\ \circ \mathrm{Aff}_{1} (x_0), \end{align}\] where \(\mathrm{Aff}_i(x) = W_i x + b_i\).

An example code snippet of an MLP defined using torch.nn:

self.net = nn.Sequential(
    nn.Linear(4, 64),
    nn.Sigmoid(),
    nn.Linear(64, 32),
    nn.Sigmoid(),
)

The code says that this network has two layers, with the first layer having \(64\) neurons and the second layer having \(32\). So the output dimension is \(y \in \mathbb{R}^{32}\). The activations are sigmoids. We also expect the network to have parameters \(W_1 \in \mathbb{R}^{64 \times 4}\), \(b_1 \in \mathbb{R}^{64}\), \(W_2 \in \mathbb{R}^{32 \times 64}\) and \(b_2 \in \mathbb{R}^{32}\).

name: net.0.weight   size: torch.Size([64, 4])
name: net.0.bias     size: torch.Size([64])
name: net.2.weight   size: torch.Size([32, 64])
name: net.2.bias     size: torch.Size([32])

Application: Iris Classification

We modify the code used in the lesson on PyTorch to build a full classifier for the iris dataset. There are some differences from that code and from the two-class case:

We transform the data to normalize it, meaning its mean is zero and ‘variance’ is near \(1\).
We split data into train and test sets.
Instead of a model that outputs a single scalar (fine for two species), we now predict three numbers, one for each species. Whichever value is highest for an input determines its species.
We use a cross-entropy loss instead of a mean squared error loss, which measures how far away we are from predicting a value of \(1\) for the true species and \(0\) for the others. It wants the model to behave like a probability distribution over the species. This desire is often made explicit using a softmax final layer.
We use an in-built optimizer instead of rolling our own gradient update.

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# 1. Load and preprocess data
iris = load_iris()
X = iris.data
y = iris.target

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.long)

# 2. Define MLP model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(4, 64),
            nn.Sigmoid(),
            nn.Linear(64, 32),
            nn.Sigmoid(),
            nn.Linear(32, 3),  # 3 classes
        )

    def forward(self, x):
        return self.net(x)

model = MLP()

# 3. Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# 4. Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/100 - Loss: {loss.item():.4f}")

# 5. Evaluation
model.eval()
with torch.no_grad():
    outputs = model(X_test)
    _, predicted = torch.max(outputs, 1)
    print("\nClassification Report:\n")
    print(classification_report(y_test.numpy(), predicted.numpy(), target_names=iris.target_names))

Epoch 10/100 - Loss: 0.8600
Epoch 20/100 - Loss: 0.4502
Epoch 30/100 - Loss: 0.2779
Epoch 40/100 - Loss: 0.1800
Epoch 50/100 - Loss: 0.1144
Epoch 60/100 - Loss: 0.0759
Epoch 70/100 - Loss: 0.0603
Epoch 80/100 - Loss: 0.0541
Epoch 90/100 - Loss: 0.0497
Epoch 100/100 - Loss: 0.0471

Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00        10
   virginica       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30