class Lin(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Linear(1, 1)
def forward(self, x):
return self.net(x)
= Lin() model
Multi-Layer Perceptrons
The Perceptron
The Perceptron is the composition of a linear function with an element-wise application of a nonlinear activation function \(\sigma\):
\[\begin{align}
y(x,\theta) = \sigma(f(x,\theta)) = \sigma(w^\top x + b) = \sigma\left( \sum w_i x_i + b\right),
\end{align}\] where \(w\) has the same length as \(x\).
It defines the simplest neural network: one neuron. The neuron receives inputs \(x\), linearly combines them into a scalar using weights \(w\) with a bias \(b\), and finally applies a nonlinearity \(\sigma(\cdot)\) to produce output \(y\).
Layers
How many layers does a perceptron have, given that it is a neural network? I’d say it has one layer, defined by the single neuron. Since the output of that neuron is the output of the network, it could be called the output layer. Is there an input layer? Often, the input is said to be generated by a set of input neurons in the input layer. So is the perceptron really a single neuron? The simple principle is that we only count neurons performing computations as being part of layers. The input neurons are just sources without computations, so everything works out.
Logistic Regression
When the activation function is the sigmoid function and the model predicts a scalar value, fitting a perceptron model to the data is the same as solving a logistic regression problem. Once data are loaded, we can run nearly the same code with the exception of adding the activation function to the model.
The linear model
becomes
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
4, 1),
nn.Linear(
nn.Sigmoid(),
)
def forward(self, x):
return self.net(x)
= MLP() model
Initial w: tensor([[0., 0., 0., 0.]]) b: tensor([0.])
Start training:
Epoch 1/150000 - Loss: 0.2500
Epoch 15001/150000 - Loss: 0.0291
Epoch 30001/150000 - Loss: 0.0260
Epoch 45001/150000 - Loss: 0.0245
Epoch 60001/150000 - Loss: 0.0236
Epoch 75001/150000 - Loss: 0.0230
Epoch 90001/150000 - Loss: 0.0225
Epoch 105001/150000 - Loss: 0.0221
Epoch 120001/150000 - Loss: 0.0218
Epoch 135001/150000 - Loss: 0.0216
Classification Report:
precision recall f1-score support
versicolor 0.98 0.96 0.97 50
virginica 0.96 0.98 0.97 50
accuracy 0.97 100
macro avg 0.97 0.97 0.97 100
weighted avg 0.97 0.97 0.97 100
Multi-Layer Perceptron
The output of one set of perceptrons can be treated as the input to another set of perceptrons. Each set is called a layer of a multi-layer perceptron. When you use non-sigmoid activations, you get a generic artificial feed-forward neural network (AFFNN). Nowadays people call an AFFNN an MLP regardless of the activation function.
When we have \(L\) layers, the neural network function may be written as
\[\begin{align}
y = x_L = \sigma\ \circ \mathrm{Aff}_L \circ \sigma\ \circ \mathrm{Aff}_{L-1} \circ \sigma\ \circ \mathrm{Aff}_{L-2} \circ \cdots \circ \sigma\ \circ \mathrm{Aff}_{2} \circ \sigma\ \circ \mathrm{Aff}_{1} (x_0),
\end{align}\] where \(\mathrm{Aff}_i(x) = W_i x + b_i\).
An example code snippet of an MLP defined using torch.nn
:
self.net = nn.Sequential(
4, 64),
nn.Linear(
nn.Sigmoid(),64, 32),
nn.Linear(
nn.Sigmoid(), )
The code says that this network has two layers, with the first layer having \(64\) neurons and the second layer having \(32\). So the output dimension is \(y \in \mathbb{R}^{32}\). The activations are sigmoids. We also expect the network to have parameters \(W_1 \in \mathbb{R}^{64 \times 4}\), \(b_1 \in \mathbb{R}^{64}\), \(W_2 \in \mathbb{R}^{32 \times 64}\) and \(b_2 \in \mathbb{R}^{32}\).
name: net.0.weight size: torch.Size([64, 4])
name: net.0.bias size: torch.Size([64])
name: net.2.weight size: torch.Size([32, 64])
name: net.2.bias size: torch.Size([32])
Application: Iris Classification
We modify the code used in the lesson on PyTorch
to build a full classifier for the iris dataset. There are some differences from that code and from the two-class case:
- We transform the data to normalize it, meaning its mean is zero and ‘variance’ is near \(1\).
- We split data into train and test sets.
- Instead of a model that outputs a single scalar (fine for two species), we now predict three numbers, one for each species. Whichever value is highest for an input determines its species.
- We use a cross-entropy loss instead of a mean squared error loss, which measures how far away we are from predicting a value of \(1\) for the true species and \(0\) for the others. It wants the model to behave like a probability distribution over the species. This desire is often made explicit using a
softmax
final layer. - We use an in-built optimizer instead of rolling our own gradient update.
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
# 1. Load and preprocess data
= load_iris()
iris = iris.data
X = iris.target
y
= StandardScaler()
scaler = scaler.fit_transform(X)
X
= train_test_split(
X_train, X_test, y_train, y_test =0.2, stratify=y, random_state=42
X, y, test_size
)
= torch.tensor(X_train, dtype=torch.float32)
X_train = torch.tensor(y_train, dtype=torch.long)
y_train = torch.tensor(X_test, dtype=torch.float32)
X_test = torch.tensor(y_test, dtype=torch.long)
y_test
# 2. Define MLP model
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
4, 64),
nn.Linear(
nn.Sigmoid(),64, 32),
nn.Linear(
nn.Sigmoid(),32, 3), # 3 classes
nn.Linear(
)
def forward(self, x):
return self.net(x)
= MLP()
model
# 3. Define loss and optimizer
= nn.CrossEntropyLoss()
criterion = optim.Adam(model.parameters(), lr=0.01)
optimizer
# 4. Training loop
for epoch in range(100):
model.train()
optimizer.zero_grad()= model(X_train)
outputs = criterion(outputs, y_train)
loss
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/100 - Loss: {loss.item():.4f}")
# 5. Evaluation
eval()
model.with torch.no_grad():
= model(X_test)
outputs = torch.max(outputs, 1)
_, predicted print("\nClassification Report:\n")
print(classification_report(y_test.numpy(), predicted.numpy(), target_names=iris.target_names))
Epoch 10/100 - Loss: 0.8600
Epoch 20/100 - Loss: 0.4502
Epoch 30/100 - Loss: 0.2779
Epoch 40/100 - Loss: 0.1800
Epoch 50/100 - Loss: 0.1144
Epoch 60/100 - Loss: 0.0759
Epoch 70/100 - Loss: 0.0603
Epoch 80/100 - Loss: 0.0541
Epoch 90/100 - Loss: 0.0497
Epoch 100/100 - Loss: 0.0471
Classification Report:
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 10
virginica 1.00 1.00 1.00 10
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30