General architecture of a neural network

Neural networks are an ensemble of neurons (vertices) and connections (edges). Each neurons is characterised by an activation function which receives an input and generates an output. A commonly used activation function is the sigmoid function, which takes a real number as an input, and outputs a number between 0 and 1.

Neurons are sorted by layers connected to each other by an ensemble of connections, each with their own associated weight. The weights are used to multiply the output of a layer of neurons before feeding it to the next layer as part of its input.

Neurons also each possess their own bias, which is loosely akin to the intercept term in linear

regression. A neuron’s respective bias is added to the weighted average of the previous layer’s output.

The general procedure of creating a layer of neurons, giving it bias then connecting it to the previous layer with weight is done in a recursive fashion until our neural network is complete. The first layer (the input layer) would have a number of neuron equal to the number of predictors (i.e.: each predictor would active its own neuron), whereas the last layer (the output layer) can vary in size. For example, if we’re doing regression, the last layer would be a single neuron which would give us our regression

output in accordance to a set of predictors. If we’re doing classification, we could use as many neurons as there are classes and label outputs with respect to the “most activated” neuron, i.e.: the one with the highest value. The layers in-between the input and output layers are called hidden layers. They can vary in size and number.

Neural network mathematics

The architecture described above can be neatly represented in terms of vector and matrices. Consider the following two layers, while focusing solely on the red connections and the first neuron of the second layer:

The input to be fed to the second layer’s first neuron would then be:

We can generalize this concept to the blue and purple connections in order to form the following expression for the 3 inputs to be fed to the second layer:

Which can be in turn simplified by expressing the system of equations as matrices and vectors:

And the only remaining step would be to apply the sigmoid function to every element of Z2 in order to obtain the output vector of the second layer. In other words, we have the following recurrence relationship linking the output of the (n-1)th layer to the output of the nth layer:

Where “f” denotes the sigmoid function being applied to every elements of Zn .

Tuning a neural network

In order to select optimal weights and bias, one needs to devise an algorithm to update weights and bias with respect to a loss function. The loss functions takes two variables as its input: what we predicted, and what we should have predicted. Its output is based on the magnitude of the error between its inputs. A common loss function for regression is the root mean squared error (RMSE), namely the square root of the mean of the squared residuals.

Like most optimisation problem, selecting the best parameters is done with a calculus-based algorithm. Our goal will be to find the gradient of the loss function with respect to the weights and bias. To do so, we will use the chain rule in a recursive manner by resorting to a process called backpropagation.

First, we will take a set of predictors and feed it to our neural network. We’ll use the same notation as before, that is:

By the chain rule:

Similarly:

Where:

And finally:

Next, we compute the derivatives:

Where the last expression is a 3rd order tensor. The final expression for the gradient with respect to bias is simply , whereas the gradient for Wk is the outer product of the vector acting on its columns, which would be the previous layer of activated neurons, with the vector sigma_k  . 

To illustrate this procedure, let’s consider a network with the following architecture:

We’ll use R to tune the network. First, we declare all our parameters and propagate the input through the network before printing the first output:

#Declare the weights

w1 <- matrix(nrow = 4, ncol = 3, rnorm(mean = 0, sd = 1/(4*3), n = 4*3))
w2 <- matrix(nrow = 1, ncol = 4, rnorm(mean = 0, sd = 1/4, n = 4))

#Declare the bias

b1 <- matrix(nrow = 4, ncol = 1, 0)
b2 <- matrix(nrow = 1, ncol = 1, 0)

#Declare the input and target variable

x <- matrix(nrow = 3, ncol = 1, c(0.5, 0.2, -1.2))
y <- 0.987654321

#Declare the neurons

n1 <- matrix(nrow = 3, ncol = 1)
n2 <- matrix(nrow = 4, ncol = 1)
n3 <- matrix(nrow = 1, ncol = 1)

#Declare the sigmoid funtion

A <- function(x){return(1 / (1+exp(-x)))}

#Declare the derivative of the sigmoid function

A_prime <- function(x){return(-exp(-x) / (1 + exp(-x))^2)}

#Declare the loss function

L <- function(x,y){return((x-y)^2)}

#Declare the derivative of the loss function

L_prime <- function(x,y){return(2*(x-y))}

#Declare a step-size for backpropagation

t <- 0.005

#Loop the input throught the network

n1 <- A(x)

z2 <- w1 %*% n1 + b1
n2 <- A(z2)

z3 <-w2 %*% n2 + b2
n3 <- A(z3)

print(paste("output:", n3), quote = FALSE)

Next, we obtain our gradient and adjust our weights and bias:

#Compute the sigmas (gradients with respect to z = Aw + b)

sigma1 <- L_prime(n3, y)
sigma2 <- sigma1 %*% w2 * t(A_prime(z2))

#Get the weight derivatives
#(It is computed as the outer product of the activated neurons and their z-gradients)

w1_deriv <- t(n1 %*% sigma2)
w2_deriv <- t(n2 %*% sigma1)

#Get the bias derivatives
#(They are the sigmas themselves)

b1_deriv <- t(sigma2)
b2_deriv <- sigma1

#Adjust the weights and bias

w1 <- w1 - t*w1_deriv
w2 <- w2 - t*w2_deriv

b1 <- b1 - t*b1_deriv
b2 <- b2 - t*b2_deriv

Finally, we loop the procedure until we get an absolute error less than or equal to 0.001:

#Let's loop through the procedure a few times

results <- c()

repeat{

n1 <- A(x)

z2 <- w1 %*% n1 + b1
n2 <- A(z2)

z3 <-w2 %*% n2 + b2
n3 <- A(z3)

results <- c(results, n3)

if(abs(n3 - y) <= 0.001){break}

#Compute the sigmas (gradients with respect to z = Aw + b)

sigma1 <- L_prime(n3, y)
sigma2 <- sigma1 %*% w2 * t(A_prime(z2))

#Get the weight derivatives
#(It is computed as the outer product of the activated neurons and their z-gradients)

w1_deriv <- t(n1 %*% sigma2)
w2_deriv <- t(n2 %*% sigma1)

#Get the bias derivatives
#(They are the sigmas themselves)

b1_deriv <- t(sigma2)
b2_deriv <- sigma1

#Adjust the weights and bias

w1 <- w1 - t*w1_deriv
w2 <- w2 - t*w2_deriv

b1 <- b1 - t*b1_deriv
b2 <- b2 - t*b2_deriv

}

#Plot the absolute error

plot.frame <- matrix(nrow = length(results), ncol = 2)
plot.frame[,1] <- abs(results - y)
plot.frame[,2] <- c(1:length(results))

colnames(plot.frame) <- c("Error", "Iteration")

library(ggplot2)

ggplot(data = data.frame(plot.frame), aes(x = Iteration, y = Error)) + geom_line(color = "blue")

As expected, the output is pretty close to the actual value of y.

Github repositories

Link to the neural network code (in R): https://github.com/frankfredj/NNetexample/blob/master/NNet.Example.R

Link to neural network code with dropout (in R): https://github.com/frankfredj/nnet/blob/master/NNetFile.

Link to neural network code with dropout and batch norm (in R): https://github.com/frankfredj/NNet-with-batch-norm/blob/master/NNet.Batch.Norm.R

References

  Rojas, R. (1996). Neural Networks: A Systematic Introduction. Springer.

  deepai.org. (n.d.). What is an Activation Function.

  Hastie, T. (2009). The Elements of Statistical Learning.