What Does Learning Mean For A Machine?

Jun 25, 2024

Unveiling the secrets of machine learning: How machines truly learn and transform our world.

In the age of artificial intelligence, the term “machine learning” has become ubiquitous, resonating far beyond the realms of data science and technology. From predicting our next favorite song to diagnosing diseases, machine learning is revolutionizing how we interact with the world. In fact, I’m not a big fan of the term AI because it is used in an abusive way these days, and most of the time people really mean Machine Learning instead of Artificial Intelligence. But what does “learning” truly mean for a machine?

This article aims to demystify the concept of machine learning by exploring it from different perspectives. We will start with a basic explanation that highlights the essence of machine learning in simple terms. We will then delve into more detailed descriptions, shedding light on the mechanisms and theories that underpin this transformative technology. By the end, you’ll have a comprehensive understanding of what it means for a machine to “learn” and why this process is pivotal in the advancement of artificial intelligence.


The Intuition Behind Learning

Grasping the essence of learning isn’t a simple feat. Humans naturally pick up knowledge through exposure to various experiences, steadily shaping our understanding of the world. Computers, on the other hand, tackle this differently — they sift through data. From a computer’s viewpoint, learning revolves around improving at a specific task through data analysis. The ultimate aim is for the machine to execute this task effectively with new, unseen data after the learning phase.The main difference between a classical algorithm and a so called machine learning algorithm, is that the machine learning algorithm is not explicitly programmed to solve a task. Rather, it is programmed to find the best possible answer by iteratively adjusting its parameters to the data its given.A Mathematical AngleMathematically, the problem can be defined in simple terms. Knowing x and y such that f(x) = y, how can we find the best approximation of function f. This process of learning f while being provided x and y will be done through iteratively adjusting a set of parameters θ so that our current estimate of f, is the best one according to the data that we have seen so far.

The algorithm learns the best estimation of f while being provided with x and y.

Although there can be some variation to this definition, as sometimes the algorithm is only provided with x and not y. For the purpose of the article, we can stick to this definition.


A Simple Case: The Linear Regression

Let’s say we have gathered some observations about two variables. This could be house prices and their corresponding sizes, the income of a group of people and their level of education or the same group of people’s weights and heights. We want to find the relationship between these two variables. If we assume that this relationship is linear, the idea would be to find the straight line that best describes the mapping between those variables. So if we take our equation from the previous section, we are trying to find the best approximation of f(x) such that :

Which is simply the equation for a straight line. In the previously stated example, x could be the house price and y could be its size. But what is our θ ? In this case, we want to learn the a and b that best fit the data, so θ = {a,b}. Now we need to find a way to learn these parameters.

An Iterative Process

As we stated earlier, the main difference compared to a classical algorithm is that we will not provide the program with an explicit way to find the exact answer to this problem. What we will do, however, is to show a subset of the data at each step of the learning process and refine our estimation iteratively. This might seem abstract at first but it will become clearer as we define all the concepts related to this process. The important bit is to understand that this process is iterative, so our estimate at time i+1 should be better than our estimate at time i.

The Loss Function

Our goal is to find the best approximation of this function. But in order to be able to say that it is the best one, we need to define a value which tells us how close our estimate is to the actual data. And ultimately we will try to minimize that value so that our approximation is as close as possible to the data. Such a criterion is called the error, we will denote it as J. Our goal will be to find the set of parameters a and b such that J is minimal. There are many functions we can use for the error (also commonly called the loss or cost function). A common choice for linear regression is the mean squared error, defined as follows.

Mean Squared Error function.

where ŷ is our current approximation of y. The idea is that when we have optimized this function with regards to the x and y’s that we already know, our approximation will accurately predict previously unseen values (in this case y when provided with unseen x). But how can we make sure that we choose our parameters so that we decrease this value overtime? This question takes us to a core concept in machine learning, called gradient descent.

Gradient Descent

There is one last piece missing before we can derive the algorithm. It is easy to compute the error using our estimate of f and the actual data, but how do we know in which direction to go to select the next set of parameters θ = {a,b} that will make this error decrease? This is where gradient descent comes into play. Gradient descent is an optimization algorithm used to minimize the loss function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In practice, we compute the derivative of the loss function with respect to each learnable parameter θ, then subtract this value from the original parameter to move in the direction of the steepest descent.

Alpha is a parameter that we choose to be relatively small in order to not overshoot the update of these parameters, risking to never find a local minima for J. Alpha is called the learning rate because it describes the rate at which the algorithm will make its learning updates. If the learning rate is very small, the algorithm will converge slowly, if it’s very high on the other end, it will converge rapidly at the risk of never converging at all.

If we derive the values for the partial derivatives using the previously defined loss, we get :

And if we refactor the last term of both equations :

At each step of the algorithm, we recompute the error J to evaluate our current estimate. If we are satisfied with the value, we stop the process, if not, we take one more step and update the parameters.

This image illustrate the process of gradient descent. In the graph, y represents the loss function J(a,b) while x represents a parameter of the model (either a or b in this case). At each step, the gradient is computed and we move x in the direction of the steepest descent, in the hope of finding a minima for J(a,b). (source)

With this last addition, we have everything that we need to define the learning algorithm.

The Code

Let’s deep dive into a pseudo code to see how everything fits together.

# Data is given
X = [...]
y = [...]

Learning vs Classical Algorithm

We have described how the learning process works for linear regressions. If we had to do the same exercise using a classical algorithm, we would model this as an optimization problem, where one would try to minimize the residual sum of squares, by setting its gradient to 0.

The linear regression problem if we are trying to solve it using classical optimization modelling.

For the uni-variate case, this set of equations is quite simple and probably much easier to solve directly than with gradient descent. But as we increase the number of dimensions, this formula becomes much more complicated and expensive to compute. Gradient descent does not suffer from such scalability issues.

Why do we say that one method learns but the other does not? In essence, gradient descent learns the parameters a and b by continually refining them so that they make up for a suitable approximation of the actual data. The linear algebra solution, simply computes the optimal solution, without any iterative learning process. This process of iterative learning for a linear regression is illustrated by the figure down below.

Illustration of the learning process for a linear regression. At each step, a new set of parameters {a,b} is computed to fit the data better than the previous iteration, until convergence is reached.

Although this use case is useful to illustrate the process of learning, the strong assumptions of this model are usually not suitable for more complex use cases. For problems involving nonlinear relationships, interactions between features, or high-dimensional data, more advanced techniques are required. In the section we will see how we can extend the concepts we have seen to model more complicated functions.


A Less Simple Case : The Neural Network

Unlike linear regression, which assumes a straightforward linear relationship between inputs and outputs, neural networks are designed to capture complex, nonlinear patterns in data. This capability stems from their architecture, which mimics the structure of the human brain, allowing them to learn and generalize from vast amounts of data.The NeuronIn a neural network, the basic building block is the neuron, also known as a node or perceptron. Each neuron performs a simple computation that, when combined with many other neurons, enables the network to solve complex problems. A neuron is made of inputs, weights, a bias term and an activation function. All of these combined gives us the final output for the neuron.

The architecture of a neuron. 

Let’s say we are trying to classify pictures of dogs and cats. The input in this case would be an array of numbers where an element of the array corresponds to a pixel in the image. If the image is in black-and-white then each number can have a value between 0 and 1 depending on the brightness of the pixel.

Each input ​ is associated with a weight​. The weights are parameters that are learned during the learning process (also called training). They determine the importance of each input in contributing to the neuron’s output. The neuron computes a weighted sum of its inputs. This sum is often called the activation of the neuron. Mathematically, it is represented as:

Here, b is the bias term, an additional learned parameter that ensures that the neuron can learn from patterns that do not pass through the origin. In our previous example of the linear regression, a was a weight and b was a bias.

The weighted sum z is then passed through an activation function g(z). The activation function introduces a non-linearity into the model, allowing the network to learn and represent complex patterns. There are many activation functions to choose from like the sigmoid function, tanh and more. For the purpose of this blog post, we will stick with the Rectified Liner Unit function (ReLU), which is fairly common but also very easy.

ReLU activation function.

That gives us the output for our neuron. While the mathematical setup of the neuron enables us to compute simple nonlinear transformation of the input, it is insufficient for capturing the complex, hierarchical patterns often present in real-world data. This is why we need to combine these neurons in a network.

The Network

Transitioning from single neurons to neural networks significantly enhances a model’s capability to capture complex patterns in data. But how does it work in practice? There are two main actions at play here :

  • layering : we combine many neurons into layers of neurons, each layer has its own set of weights and biases, the activation function is the same for each neuron of the layer.

  • stacking : When our layers are defined, we stack them on top of each other. This means that the inputs of one layer are the outputs of the previous one. The input of the first layer is simply the input data and the output of the last layer is the output of the model, every layer in between is a hidden layer. The output of each neuron is fed into every neuron of the following layer. Stacking in this manner results in what we call a fully connected (or dense) neural network.

Representation of a neural network.

The output layer of a neural network can be a single neuron or a set of neurons depending on the task. If we are trying to predict if a dog or a cat is in an image, the output layer will be a neuron that outputs 1 if there is a dog in the image and 0 if there is a cat. If we are trying to predict what’s the next frame in a video knowing the previous frame, then the output will be a set of neurons of the same dimension as the input, each corresponding to a pixel of the next frame.

We have seen previously how we can learn two parameters for a linear regression. But how can we apply gradient descent in such a complex model with potentially thousands of parameters? The answer to this question lies in the concept of back-propagation.

Learning with Back-Propagation

Back-propagation is an algorithm that computes the gradient of the loss function with respect to each weight by the chain rule, enabling the network to learn from errors and improve its predictions iteratively. Nothing different from what we have seen previously, except that this time we need to propagate this gradient backwards, from the loss function of the output layer through all layers up to the input.

The algorithm is divided in two steps, the forward pass and the backward pass. In the forward pass, the input data is passed through the network layer by layer. Each neuron computes a weighted sum of its inputs, adds a bias, and applies a nonlinear activation function. This process continues until the final output is produced. The output is then compared to the actual target values to calculate the loss, which quantifies the error of the network’s prediction.

Illustration of the forward pass without computation of the loss, which would be and addition layer after Layer 2.

The backward pass, as its name points out, happens from the output layer back to the input. The role of the backward past is to adjust the weights of the network so that the error is minimized with regard to the input and output that were just forwarded through the network. At a high-level, it tries to change the parameters of the network such that next time this input comes up, the network will predict the right answer. We have seen how this works previously with the simple case of linear regression. The tricky part here is computing the partial derivatives of the loss with respect to each parameter. The relationship between most parameters of the network and the error are defined by composition of functions, as one layer’s output is another’s input. This means that we have to use the chain rule of derivation to make this computation, as illustrated for one parameter in the image below.

Example representation of how we use the chain rule to compute the partial derivative of the network’s parameters.

When we have computed the partial derivatives of the loss function with respect to each parameter in the network, we can update the weights by taking a step in the direction of steepest decent. This means that we subtract the partial derivative times some coefficient (the learning rate) from the original weight value, exactly like we did for our linear regression case.

And that’s it! And even though there is a lot more to Machine Learning than just what is described in this post, these core concepts are the foundations of what it means for machine to learn.


Conclusion

In this journey through the essence of machine learning, we’ve explored the fundamental concepts that enable machines to learn from data. From understanding the basics of linear regression to delving into the complexities of neural networks and back-propagation, we’ve seen how machines can iteratively improve their performance on a given task.

To try to give a one-sentence answer to the original question : what we, humans, have defined as “learning” for a machine is improving its performance on a task over time by adjusting internal parameters to minimize errors based on the data its given, thus making better predictions or decisions. The rabbit hole is still deep and there are many other topics than can be covered, like the link between human learning and machine learning, training hyper-parameters, under-fitting and over-fitting etc. But these are topics for another time.



Newsletter

Enjoyed this read? Subscribe.

Discover design insights, project updates, and tips to elevate your work straight to your inbox.

Unsubscribe at any time

Updated on

Jun 25, 2024

Join the Adventure

We’re a startup too
and we’re building this with you

If you’re here, it likely means we can help. We’re shaping the platform alongside real teams solving real problems.

Your feedback drives what we build next

Redefining how digital products are built in the AI era.

Newsletter

Stay ahead with insights on building, shipping, and scaling products in the AI era.

Join the Adventure

We’re a startup too
and we’re building this with you

If you’re here, it likely means we can help. We’re shaping the platform alongside real teams solving real problems.

Your feedback drives what we build next

Redefining how digital products are built in the AI era.

Newsletter

Stay ahead with insights on building, shipping, and scaling products in the AI era.

Join the Adventure

We’re a startup too
and we’re building this with you

If you’re here, it likely means we can help. We’re shaping the platform alongside real teams solving real problems.

Your feedback drives what we build next

Redefining how digital products are built in the AI era.

Newsletter

Stay ahead with insights on building, shipping, and scaling products in the AI era.