It’s Not Rocket Science: Machine Learning to the Rescue!

Kayla Vecera

The reason for the surge in the application of machine learning techniques to space exploration stems, in part, from the need to manage extremely large datasets. The data sets that are produced by space exploration efforts are enormous and call for the application of machine learning techniques.

The SKA project is an international effort to construct the world’s most enormous radio telescope and is the perfect illustration of data overload. SKA uses thousands of dishes and one million antennas to monitor the entire sky in great detail, transmitting data at higher speeds than any system in existence today. Together, these radio-telescopes generate approximately seven-hundred terabytes of data per second. In perspective, this amount of data is equivalent to the amount of data transmitted through the internet every two days (Kirkovska, 2018).

Screen Shot 2019-04-11 at 4.31.49 PM.png

That being said, transmitting such high volumes of data from deep space to Earth has generated issues. Seeing as host planets differ in their rotational speeds and direction/orbits, these massive data packets must be transmitted to Earth during specific windows of time. This delay, lasting between months and years, is dependent on how far Earth lies from the spacecraft’s host planet (Tian, 2018). This process is critical because a spacecraft’s tracking and communications systems are the only means with which to interact with it after it has left Earth because deep space missions don’t return to Earth after launch (Knosp, 2018). Suppose that data packet transmission is unsuccessful, it is possible for data to be permanently lost if it was overwritten with new data in the onboard memory (Tian, 2018).

Machine learning is instrumental in managing transmission issues. In 2005, the Mars Express AI Tool (MEXAR2), was introduced by the Institute for Science and Technology (ISTC-CNR) (Tian, 2018). The onboard learning algorithm leverages historical data to filter superfluous data and recognize the download schedule, ultimately optimizing data packet transmission (Tian, 2018). This outer data transmission technique is already implored by NASA and other space agencies worldwide in space research programs (Tian, 2018).

Screen Shot 2019-04-11 at 4.50.01 PM.png

MEXAR2 weighs variables that may impact data downloading. More specifically, MEXAR2 weights the overall science observation schedule for all Mars Express instruments in an attempt to predict which on-board data packets might be later lost due to memory conflicts. Based on these predictions, it generates a data download schedule and creates the commands needed to implement the download. Fred Jansen, ESA's mission manager for Mars Express, was quoted saying, ”With MEXAR2, any loss of stored data packets has been largely eliminated” (ESA, 2008).

Screen Shot 2019-04-11 at 4.49.22 PM.png

Thus, machine learning is imperative in the process of storing and processing the exponential amount of incoming data and generating valuable insights.

References:

Chowdhury, Amit Paul. “How Big Data advances are fuelling space exploration.” Analytics India Magazine, January 10, 2017, accessed February 4, 2019.

Tian, Robert. “The New Age of Discovery: Space Exploration and Machine Learning.” Medium, March 31, 2018, accessed February 4, 2019.

Martin, David. “Origin of the Universe (3268): Projects- Square Kilometre Array.” Jet Propulsion Laboratory: California Institute of Technology, accessed February 4, 2019.

Kirkovska, Anita. “Big Data and its impact in the space sector, one bit at a time.” Medium, September 14, 2018, accessed February 4, 2019.

Knosp, Brian. "Deep Space Communications.” NASA: Jet Propulsion Laboratory, accessed February 28, 2019.

ESA. “Artificial Intelligence Boosts Science From Mars.” ESA: Our Activities: Operations, 29 April 2008, accessed April 11, 2019.

General architecture of a neural network

Neural networks are an ensemble of neurons[1] (vertices) and connections[1] (edges). Each neurons is characterised by an activation function[1] which receives an input and generates an output. A commonly used activation function is the sigmoid function[2], which takes a real number as an input, and outputs a number between 0 and 1.

dummy.png



Neurons are sorted by layers[1] connected to each other by an ensemble of connections, each with their own associated weight[1]. The weights are used to multiply the output of a layer of neurons before feeding it to the next layer as part of its input.

dummy.png

Neurons also each possess their own bias[1], which is loosely akin to the intercept term in linear

regression. A neuron’s respective bias is added to the weighted average of the previous layer’s output.

dummy.png

The general procedure of creating a layer of neurons, giving it bias then connecting it to the previous layer with weight is done in a recursive fashion until our neural network is complete. The first layer (the input layer[1]) would have a number of neuron equal to the number of predictors (i.e.: each predictor would active its own neuron), whereas the last layer (the output layer[1]) can vary in size. For example, if we’re doing regression, the last layer would be a single neuron which would give us our regression

output in accordance to a set of predictors. If we’re doing classification, we could use as many neurons as there are classes and label outputs with respect to the “most activated” neuron, i.e.: the one with the highest value. The layers in-between the input and output layers are called hidden layers[1]. They can vary in size and number.









Neural network mathematics

The architecture described above can be neatly represented in terms of vector and matrices. Consider the following two layers, while focusing solely on the red connections and the first neuron of the second layer:

dummy.png

The input to be fed to the second layer’s first neuron would then be:

dummy.png

 

We can generalize this concept to the blue and purple connections in order to form the following expression for the 3 inputs to be fed to the second layer:

dummy.png

Which can be in turn simplified by expressing the system of equations as matrices and vectors:

dummy.png

And the only remaining step would be to apply the sigmoid function to every element of Z2 in order to obtain the output vector of the second layer. In other words, we have the following recurrence relationship linking the output of the (n-1)th layer to the output of the nth layer:

dummy.png

Where “f” denotes the sigmoid function being applied to every elements of Zn .

dummy.png




Tuning a neural network

 

In order to select optimal weights and bias, one needs to devise an algorithm to update weights and bias with respect to a loss function[3]. The loss functions takes two variables as its input: what we predicted, and what we should have predicted. Its output is based on the magnitude of the error between its inputs. A common loss function for regression is the root mean squared error (RMSE), namely the square root of the mean of the squared residuals.

 

Like most optimisation problem, selecting the best parameters is done with a calculus-based algorithm. Our goal will be to find the gradient of the loss function with respect to the weights and bias. To do so, we will use the chain rule in a recursive manner by resorting to a process called backpropagation[1].

  

First, we will take a set of predictors and feed it to our neural network. We’ll use the same notation as before, that is:

By the chain rule:

dummy.png

Similarly:


Which leads to:

Where:

And finally:

dummy.png

Next, we compute the derivatives:

dummy.png

Where the last expression is a 3rd order tensor. The final expression for the gradient with respect to bias is simply , whereas the gradient for Wk is the outer product of the vector acting on its columns, which would be the previous layer of activated neurons, with the vector sigma_k  . [1]


To illustrate this procedure, let’s consider a network with the following architecture:

dummy.png


We’ll use R to tune the network. First, we declare all our parameters and propagate the input through the network before printing the first output:


#Declare the weights

w1 <- matrix(nrow = 4, ncol = 3, rnorm(mean = 0, sd = 1/(4*3), n = 4*3))
w2 <- matrix(nrow = 1, ncol = 4, rnorm(mean = 0, sd = 1/4, n = 4))

#Declare the bias

b1 <- matrix(nrow = 4, ncol = 1, 0)
b2 <- matrix(nrow = 1, ncol = 1, 0)

#Declare the input and target variable

x <- matrix(nrow = 3, ncol = 1, c(0.5, 0.2, -1.2))
y <- 0.987654321

#Declare the neurons

n1 <- matrix(nrow = 3, ncol = 1)
n2 <- matrix(nrow = 4, ncol = 1)
n3 <- matrix(nrow = 1, ncol = 1)

#Declare the sigmoid funtion

A <- function(x){return(1 / (1+exp(-x)))}

#Declare the derivative of the sigmoid function

A_prime <- function(x){return(-exp(-x) / (1 + exp(-x))^2)}

#Declare the loss function

L <- function(x,y){return((x-y)^2)}

#Declare the derivative of the loss function

L_prime <- function(x,y){return(2*(x-y))}

#Declare a step-size for backpropagation

t <- 0.005

#Loop the input throught the network

n1 <- A(x)

z2 <- w1 %*% n1 + b1
n2 <- A(z2)

z3 <-w2 %*% n2 + b2
n3 <- A(z3)

print(paste("output:", n3), quote = FALSE)


Next, we obtain our gradient and adjust our weights and bias:


#Compute the sigmas (gradients with respect to z = Aw + b)

sigma1 <- L_prime(n3, y)
sigma2 <- sigma1 %*% w2 * t(A_prime(z2))

#Get the weight derivatives
#(It is computed as the outer product of the activated neurons and their z-gradients)

w1_deriv <- t(n1 %*% sigma2)
w2_deriv <- t(n2 %*% sigma1)

#Get the bias derivatives
#(They are the sigmas themselves)

b1_deriv <- t(sigma2)
b2_deriv <- sigma1

#Adjust the weights and bias

w1 <- w1 - t*w1_deriv
w2 <- w2 - t*w2_deriv

b1 <- b1 - t*b1_deriv
b2 <- b2 - t*b2_deriv

 

Finally, we loop the procedure until we get an absolute error less than or equal to 0.001:


#Let's loop through the procedure a few times

results <- c()

repeat{

n1 <- A(x)

z2 <- w1 %*% n1 + b1
n2 <- A(z2)

z3 <-w2 %*% n2 + b2
n3 <- A(z3)

results <- c(results, n3)

if(abs(n3 - y) <= 0.001){break}

#Compute the sigmas (gradients with respect to z = Aw + b)

sigma1 <- L_prime(n3, y)
sigma2 <- sigma1 %*% w2 * t(A_prime(z2))

#Get the weight derivatives
#(It is computed as the outer product of the activated neurons and their z-gradients)

w1_deriv <- t(n1 %*% sigma2)
w2_deriv <- t(n2 %*% sigma1)

#Get the bias derivatives
#(They are the sigmas themselves)

b1_deriv <- t(sigma2)
b2_deriv <- sigma1

#Adjust the weights and bias

w1 <- w1 - t*w1_deriv
w2 <- w2 - t*w2_deriv

b1 <- b1 - t*b1_deriv
b2 <- b2 - t*b2_deriv

}

#Plot the absolute error

plot.frame <- matrix(nrow = length(results), ncol = 2)
plot.frame[,1] <- abs(results - y)
plot.frame[,2] <- c(1:length(results))

colnames(plot.frame) <- c("Error", "Iteration")

library(ggplot2)

ggplot(data = data.frame(plot.frame), aes(x = Iteration, y = Error)) + geom_line(color = "blue")
dummy.png

As expected, the output is pretty close to the actual value of y.

Github repositories

Link to the neural network code (in R): https://github.com/frankfredj/NNetexample/blob/master/NNet.Example.R 

Link to neural network code with dropout (in R): https://github.com/frankfredj/nnet/blob/master/NNetFile.

Link to neural network code with dropout and batch norm (in R): https://github.com/frankfredj/NNet-with-batch-norm/blob/master/NNet.Batch.Norm.R

  

References

[1]  Rojas, R. (1996). Neural Networks: A Systematic Introduction. Springer.

[2]  deepai.org. (n.d.). What is an Activation Function.

[3]  Hastie, T. (2009). The Elements of Statistical Learning.