Initialize weights and biases

Jupyter Notebook

import numpy as np

Step 2: Define the Activation Function

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)  # Derivative of the sigmoid function

Step 3: Initialize Weights and Biases

np.random.seed(42)
input_dim = 2  # Number of input features
hidden_dim = 2  # Number of hidden neurons
output_dim = 1  # Number of output neurons

# Weights and biases as column vectors
W1 = np.random.rand(hidden_dim, input_dim)  # Shape: (h, d)
b1 = np.random.rand(hidden_dim, 1)  # Shape: (h, 1)
W2 = np.random.rand(output_dim, hidden_dim)  # Shape: (o, h)
b2 = np.random.rand(output_dim, 1)  # Shape: (o, 1)


print("Initial Weights (Input to Hidden):\n", W1)
print("Initial Weights (Hidden to Output):\n", W2)

Initial Weights (Input to Hidden):
 [[0.37454012 0.95071431]
 [0.73199394 0.59865848]]
Initial Weights (Hidden to Output):
 [[0.05808361 0.86617615]]

Step 4: Forward Pass

For a single example, the value of the hidden layer after applying the activation function, $ v^{[1]} $, is computed as follows:

$z^{[1]} = W^{[1]} x + b^{[1]}$ $v^{[1]} = \sigma(z^{[1]})$

Where:

$ x $ is the input vector (shape: $d \times 1$).
$ W^{[1]} $ is the weight matrix between the input and hidden layer (shape: $h \times d$).
$ b^{[1]} $ is the bias vector for the hidden layer (shape: $h \times 1$) (broadcasted over samples when batched).

Similarly, for the output layer:

$z^{[2]} = W^{[2]}v^{[1]} + b^{[2]}$ $\hat{y} = \sigma(z^{[2]})$

Where:

$ W^{[2]} $ is the weight matrix between the hidden and output layer (shape: $o \times h$).
$ b^{[2]} $ is the bias for the output layer (shape: $o \times 1$).

# Forward pass
def forward_pass(x, W1, b1, W2, b2):
    # x is a column vector (d, 1)
    z1 = np.dot(W1, x) + b1  # (h, d) * (d, 1) + (h, 1) => (h, 1)
    v1 = sigmoid(z1)  # Apply activation function (h, 1)

    z2 = np.dot(W2, v1) + b2  # (o, h) * (h, 1) + (o, 1) => (o, 1)
    y_hat = sigmoid(z2)  # Apply activation function (o, 1)
    
    return v1, y_hat

Step 5: Loss Calculation

The loss function measures the difference between the actual and predicted outputs. For this implementation, we use the Mean Squared Error (MSE) as the loss function:

\[Loss = \frac{1}{n} \sum_{i=1}^{n} \left(y^{(i)} - \hat{y}^{(i)} \right)^2\]

def compute_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Sample input and output for testing
x_sample = np.array([[0], [1]])  # Example input
y_true_sample = np.array([[1]])  # Example true output

# Perform forward pass
v1_sample, y_pred_sample = forward_pass(x_sample, W1, b1, W2, b2)

# Compute loss
loss_sample = compute_loss(y_true_sample, y_pred_sample)
print(f'Loss for the initialized weights: {loss_sample}')

Loss for the initialized weights: 0.05084754666230897

Step 6: Backpropagation

1. Output Layer Gradients:

We first compute the error at the output layer:

$\delta^{[2]} = \frac{\partial L}{\partial z^{[2]}} = (y - \hat{y}) \cdot \sigma'(z^{[2]})$ Where:

$ \delta^{[2]} $ is the error at the output layer.
$ \sigma’(z^{[2]}) = \hat{y} \cdot (1 - \hat{y}) $ is the derivative of the sigmoid function.

The gradients for the weights and biases between the hidden and output layers are:

$\frac{\partial L}{\partial W^{[2]}} = v^{[1]T} \delta^{[2]}$ $\frac{\partial L}{\partial b^{[2]}} = \delta^{[2]}$

2. Hidden Layer Gradients:

Next, we compute the error at the hidden layer:

$\delta^{[1]} = \delta^{[2]} W^{[2]T} \cdot \sigma'(z^{[1]})$ Where:

$ \sigma’(z^{[1]}) = v^{[1]} \cdot (1 - v^{[1]}) $ is the derivative of the sigmoid function at the hidden layer.

The gradients for the weights and biases between the input and hidden layers are:

$\frac{\partial L}{\partial W^{[1]}} = X^T \delta^{[1]}$ $\frac{\partial L}{\partial b^{[1]}} = \delta^{[1]}$

def backpropagation(x, v1, y_hat, y, W1, b1, W2, b2, learning_rate):
    # 1. Output layer error
    error_output = y_hat - y  # Shape: (o, 1)
    d_output = error_output * sigmoid_derivative(y_hat)  # (o, 1)

    # 2. Hidden layer error (backpropagate)
    error_hidden = np.dot(W2.T, d_output)  # (h, 1)
    d_hidden = error_hidden * sigmoid_derivative(v1)  # (h, 1)

    # 3. Update weights and biases for output layer
    W2 -= learning_rate * np.dot(d_output, v1.T)  # (o, h) -= (o, 1) * (1, h)
    b2 -= learning_rate * d_output  # (o, 1)

    # 4. Update weights and biases for hidden layer
    W1 -= learning_rate * np.dot(d_hidden, x.T)  # (h, d) -= (h, 1) * (1, d)
    b1 -= learning_rate * d_hidden  # (h, 1)

    return W1, b1, W2, b2

# Sample data
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
outputs = np.array([[0], [1], [1], [0]])  # XOR problem

# # Sample input (column vector) and output
# inputs = np.array([[0], [1]])  # Input column vector (d, 1)
# outputs = np.array([[1]])  # Target output (o, 1)


# Training parameters
epochs = 10000
learning_rate = 0.1

# Training loop
epochs = 10000
for epoch in range(epochs):
    total_loss = 0
    for i in range(len(inputs)):
        x = inputs[i].reshape(-1, 1)
        y = outputs[i].reshape(-1, 1)

        v1, y_hat = forward_pass(x, W1, b1, W2, b2) # intermediate results (such as v1) are saved for backpropagation
        W1, b1, W2, b2 = backpropagation(x, v1, y_hat, y, W1, b1, W2, b2, learning_rate)

        total_loss += compute_loss(y, y_hat)

    if epoch % 1000 == 0:
        print(f'Epoch {epoch+1}, Loss: {total_loss / len(inputs)}')

Epoch 1, Loss: 0.3239537158130419
Epoch 1001, Loss: 0.24817211793449434
Epoch 2001, Loss: 0.21459076862934467
Epoch 3001, Loss: 0.15936526038440854
Epoch 4001, Loss: 0.04888994977531156
Epoch 5001, Loss: 0.016090123221200778
Epoch 6001, Loss: 0.00861826270292999
Epoch 7001, Loss: 0.005697165950060215
Epoch 8001, Loss: 0.00419533701162551
Epoch 9001, Loss: 0.0032952036447040957

# Test the trained model
_, final_output = forward_pass(inputs.T, W1, b1, W2, b2)
print("Predicted Output after Training:\n", final_output)

Predicted Output after Training:
 [[0.05464331 0.95032601 0.95006665 0.05336686]]

#### Exercise: Extend to batch gradient descent rather than stochastic gradient descent.
## Which one converges faster? Plot the loss over time for both approaches.