We introduced the basics of the stochastic gradient descent (SGD) algorithm and its implementation on Python with the MNIST dataset as a test case in our article Mastering Stochastic Gradient Descent algorithm with the MNIST Dataset. We can see that stochastic gradient descent is one of the most powerful algorithms used in the world of machine learning.
We prefer the use of the SGD algorithm over Gradient Descent (GD) solely for one main purpose the data is too large to handle for our machine, so instead of processing the entire batch of samples for learning, we process sample by sample to update our algorithm.
A key thing to remember: We use SGD/GD algorithms in machine learning to minimize the loss function associated with our machine learning problem (the problem can be classification or regression). The loss function
Suggested Read: Convergence analysis of Stochastic Gradient Descent (SGD) when a function is L-smooth and non-convex
In this article, we will investigate the proof of convergence when our loss function
Preliminaries
Let’s quickly recap a few things before jumping onto the convergence analysis. The stochastic gradient descent algorithm is given as:
Algorithm - Stochastic Gradient Descent (SGD)
Input: A random vector and a decaying step size ;
1. for do
2. Evaluate the stochastic gradient mapping at , i.e., evaluate ;
3. Do the following update:
4. end for
In the above algorithm,
Projection Operator:
We will use some properties of the projection operator in the update rule in our convergence analysis. So, we will take a brief detour towards the projection operator.
The idea of the projection operator is to project back the update iterate to the convex set by finding the point in the convex set having minimum Euclidian distance. Mathematically, it is defined as:
Properties of Projection Operator:
Let set
- Non-expansive:
- Distance: The distance is a convex function in terms of
from a set is given by:
The Stochastic Optimization Problem
The fundamental idea of the stochastic gradient descent algorithm is to find the optimal solution of the following stochastic optimization problem:
Where
Concept of Filtration:
In algorithms like stochastic gradient descent, we process sample by sample. Hence, they arrive randomly, and in convergence analysis, we need to keep track of the history of these random samples. This is denoted by a set known as Filtration, which contains the random samples till iteration
and
The main idea of using conditional expectation is that in our Filtration, if we know
Apart from Filtration, we need the following two assumptions to prove the convergence rates for the SGD algorithm:
Assumptions:
for all . for all for some .
This assumption
Main Theorem
Consider the stochastic optimization problem and SGD Algorithm defined above. Let Assumption 1 hold. Let
Let the step size be diminishing given by:
Consider the averaged iterate defined by:
Then, there exists some scalars
Proof:
From the update rule of the algorithm (Step 3) and by using the non-expansivity of the projection operator, we obtain:
We now define stochastic error as
Taking conditional expectation w.r.t Filtration operator on both sides, we obtain:
We now briefly detour and define the following Lemma, which captures the statistical properties of stochastic errors.
Lemma 1: Consider the definition of stochastic error
(Now continuing the proof): Since
From convexity of
Dividing both sides by
Taking expectation w.r.t
Adding and subtracting
From boundedness of the set
Summing both sides over
Now we apply the fact that \bar x_T = \frac{1}{T} \sum_{t=0}^T \bx_t
By noticing the telescoping series and applying the inequality
We obtain:
This concludes the proof.