We introduced the basics of the stochastic gradient descent (SGD) algorithm and its implementation on Python with the MNIST dataset as a test case in our article Mastering Stochastic Gradient Descent algorithm with the MNIST Dataset. We can see that stochastic gradient descent is one of the most powerful algorithms used in the world of machine learning.

We prefer the use of the SGD algorithm over Gradient Descent (GD) solely for one main purpose the data is too large to handle for our machine, so instead of processing the entire batch of samples for learning, we process sample by sample to update our algorithm.

A key thing to remember: We use SGD/GD algorithms in machine learning to minimize the loss function associated with our machine learning problem (the problem can be classification or regression). The loss function f(), however, can have certain attributes, is it convex, non-convex, smooth, non-smooth, etc. The nature of this function f() governs the performance of the SGD algorithm and its effect on the rate of convergence.

Suggested Read: Convergence analysis of Stochastic Gradient Descent (SGD) when a function is L-smooth and non-convex

In this article, we will investigate the proof of convergence when our loss function f() is convex.

Preliminaries

Let’s quickly recap a few things before jumping onto the convergence analysis. The stochastic gradient descent algorithm is given as:


Algorithm - Stochastic Gradient Descent (SGD)

Input: A random vector x0Rn and a decaying step size γt>0; 1. for t=1,2,, do 2. Evaluate the stochastic gradient mapping at xt, i.e., evaluate F(xt,ξt); 3. Do the following update: xt+1:=PX(xtγtF(xt,ξt)) 4. end for

In the above algorithm, xt is the update of the objective variable x at time t, f(xt) is the gradient of the loss function f, and PX is the projection operator. We use the projection to ensure that the update of our algorithm doesn’t cause the objective variable x to lie outside the feasible set, and if it does, the operator projects the update xt+1 back to the feasible set. The details of the projection operator is given as:

Projection Operator:

We will use some properties of the projection operator in the update rule in our convergence analysis. So, we will take a brief detour towards the projection operator.

The idea of the projection operator is to project back the update iterate to the convex set by finding the point in the convex set having minimum Euclidian distance. Mathematically, it is defined as:

PX(x^)=argminxXxx^22.

Properties of Projection Operator:

Let set XRn be a nonempty closed convex set. Consider the projection mapping PX:RnX. The projection operator has the following properties:

  1. Non-expansive:
    PX(x)PX(y)2xy2for all x,yRn.
  2. Distance: The distance is a convex function in terms of x from a set X is given by:
    dist(x,X)=PX(x)x2

The Stochastic Optimization Problem

The fundamental idea of the stochastic gradient descent algorithm is to find the optimal solution of the following stochastic optimization problem:

minimizef(x)E[F(x,ξ)]subject to:xX.

Where f:RnR is an unknown deterministic function, F:Rn×ΩR is a known stochastic function, ξ:ΩRd is a random variable (this is a random data sample), and E[] denotes the expectation operator taking an expectation due to the random variable ξ.

Concept of Filtration:

In algorithms like stochastic gradient descent, we process sample by sample. Hence, they arrive randomly, and in convergence analysis, we need to keep track of the history of these random samples. This is denoted by a set known as Filtration, which contains the random samples till iteration t. Mathematically, it is defined as:

Ft{x0,ξ0,ξ1,,ξt1}for all t1,

and F0x0. Since whenever randomness is involved, we need expectations to remove that. So, we define the conditional expectation given the filtration as E[Ft].

The main idea of using conditional expectation is that in our Filtration, if we know ξt1, then we know what xt i.e., by knowing the randomness of ξt1 we can find the xt by using the update rule of the algorithm (Step 3). This is how filtration and conditional expectation is useful in removing the randomness, and we can characterize the next iterate xt+1.

Apart from Filtration, we need the following two assumptions to prove the convergence rates for the SGD algorithm:

Assumptions:

  1. E[F(x,ξ)x]=f(x) for all xX.
  2. E[F(x,ξ)f(x)2x]σ2 for all xX for some σ>0.

This assumption 1 implies that the stochastic gradient is an unbiased estimator of the true gradient of the objective function, and assumption 2

Main Theorem

Consider the stochastic optimization problem and SGD Algorithm defined above. Let Assumption 1 hold. Let XRn. Suppose X is nonempty, compact, and convex. Assume that f:RnR is continuously differentiable and convex.

Let the step size be diminishing given by:

γt:=γ0t+1for all t0.

Consider the averaged iterate defined by:

x¯Tt=0T1xtTfor all T1.

Then, there exists some scalars M>0 and C>0 and an optimal solution vector xX such that:

E[f(x¯T)]f(x)(M2γ0+γ0(C2+σ2))1Tfor all T2.

Proof:

From the update rule of the algorithm (Step 3) and by using the non-expansivity of the projection operator, we obtain:

xt+1x2PX(xtγtF(xt,ξt))PX(x)2xtx2+γt2F(xt,ξt)22γtF(xt,ξt)T(xtx).

We now define stochastic error as wtF(xt,ξt)f(xt) for all t0. Invoking this by replacing F(xt,ξt) for f(xt)+wt we obtain:

xt+1x2xkx2+γt2f(xt)+wt22γt(f(xt)+wt)T(xtx)=xtx2+γt2(f(xt)2+wt2+2f(xt)Twt)2γt(f(xt)+wt)T(xtx).

Taking conditional expectation w.r.t Filtration operator on both sides, we obtain:

E[xt+1x2Ft]xkx2+γt2(f(xt)2+E[wt2Ft]+2f(xt)TE[wtFt])2γt(f(xt)+E[wtFt])T(xtx).

We now briefly detour and define the following Lemma, which captures the statistical properties of stochastic errors.

Lemma 1: Consider the definition of stochastic error wt and under the defined assumption 1 we have for all t0:

E[wtFt]=0nE[wt2Ft]σ2.

(Now continuing the proof): Since X is compact and that f is continuous (due to convexity), there exists C>0 such that supxXf(x)C. Invoking Assumption 1 and Lemma 1, we obtain:

E[xt+1x2Ft]xtx2+γt2(C2+σ2)2γtf(xt)T(xtx).

From convexity of f, we have that f(xt)T(xtx)f(xt)f(x). We obtain:

E[xt+1x2Ft]xkx2+γt2(C2+σ2)2γt(f(xt)f(x)).

Dividing both sides by 2γt and rearranging the terms, we obtain:

f(xt)f(x)12γt(xtx2E[xt+1x2Ft])+γt2(C2+σ2).

Taking expectation w.r.t Ft on both sides and using the tower property of expectation, we obtain:

E[f(xt)]f(x)12γt(E[xtx2]E[E[xt+1x2Ft]])+γt2(C2+σ2)=12γt(E[xkx2]E[xt+1x2])+γt2(C2+σ2).

Adding and subtracting 12γt1E[xtx2], we obtain:

From boundedness of the set X, there exists a scalar M>0 such that maxxX|xx|2M. Thus, we have:

E[f(xt)]f(x)(12γt1E[xtx2]12γtE[xt+1x2])+(12γt12γt1)M+γt2(C2+σ2).

Summing both sides over t and dividing by T. On L.H.S., we can note that the function is convex, so we can apply Jensen’s inequality

t=0TE[f(xt)]Tf(x)t=0T(12γt1E[xtx2]12γtE[xt+1x2])+t=0T(12γt12γt1)M+t=0Tγt2(C2+σ2).Ef[1Tt=0Txt]f(x)1Tt=0T[12γt1E[xtx2]12γtE[xt+1x2]]+1Tt=0T(12γt12γt1)M+C2+σ22Tt=0Tγt

Now we apply the fact that \bar x_T = \frac{1}{T} \sum_{t=0}^T \bx_tand\gamma_t = \frac{\gamma_0}{\sqrt{t+1}}$

Ef(x¯T)f(x)1Tt=0T[TE[xtx2]T+1E[xt+1x2]2γ0]+1Tt=0T(T+1T2γ0)M+C2+σ22Tt=0Tγt

By noticing the telescoping series and applying the inequality t=0Tγt2γ0T and further noticing the fact that:

M2γ0+γ0(C2+σ2)M2γ0+γ0(C2+σ2)E[xtx2]2γ0

We obtain:

E[f(x¯T)]f(x)1T[M2γ0+γ0(C2+σ2)]T2

This concludes the proof.