We provided the proof of convergence of gradient descent algorithm for smooth and nonconvex case in our article on Convergence of Gradient Descent for Smooth and Nonconvex Case.

In this article, we will provide a different flavor and provide the convergence bound when the function is smooth and convex.

A quick recap:

Gradient descent is a widely used optimization algorithm that aims to find the minimum of a given function by iteratively updating its parameters in the direction of the negative gradient. In recent years, gradient descent has become particularly popular in the context of training deep neural networks for machine learning tasks. However, many of these problems are characterized by smooth, convex and nonconvex functions, which present unique challenges for convergence.

Recall that the gradient descent algorithm is given as:


Algorithm 1 - Gradient Descent (GD)

Input: An arbitrary vector x0Rn and a decaying step size γt>0; 1. for t=1,2,, do 2. Evaluate the gradient mapping at xt, i.e., evaluate f(xt); 3. Do the following update: xt+1:=xtγtf(xt) 4. end for

Pre-requisite for convergence:

L-smooth: A differentiable function f:XR over the set XRn is called L-smooth if the following property is satisfied, where L>0 is a scalar:

f(x)f(y)Lxyfor all x,yX.

Descent Lemma: Let f:RnR be L-smooth over the set XRn. Then, for any x,yX we have

f(x)f(y)+f(y)T(xy)+L2xy2.

Subgradient Inequality: A differentiable and convex function f satisfies the following inequality:

f(xt)+f(xt)(xxt)f(x)

Theorem for smooth and convex case

Let minxRnf(x) have at least one global optimal solution denoted by xRn, where f is L-smooth and convex. Then, when γ1L:

f(xT)f(x)0.5γ1x0x2Tfor all T1.

Proof:

We now use the Lipschitz continuity and Descent lemma to prove the above result.

f(xt+1)(i)f(xt)+f(xt)T(xt+1xt)+L2xt+1xt2=(ii)f(xt)+f(xt)T(γf(xt))+L2γf(xt)2=(iii)f(xt)γf(xt)2+γ2L2f(xt)2=f(xt)γ(1γL2)f(xt)2(iv)f(xt)γ2f(xt)2.

Where, in (i) we have used Descent lemma by substituting x:=xt+1,y:=xt; in (ii) we applied gradient descent update rule xt+1:=xtγf(xt); in (iii) the definition of Euclidian norm square, i.e., aTa=a2 for all aRn; and finally in (iv) we assume that γL1 which states that 1γL212. So, we have:

(1)f(xt+1)f(xt)γ2f(xt)2.

We will now use the subgradient inequality in 1 to get:

f(xt+1)+f(xt)T(xxt)f(x)γ2f(xt)2f(xt+1)f(x)f(xt)T(xxt)γ2f(xt)2(i)1γ(xt+1xt)T(xxt)12γxt+1xt2(ii)1γ(xt+1xt)T(xxt)12γ(xt+1x)(xtx)2(iii)1γ(xt+1xt)T(xxt)12γ(xt+1x2+xtx22(xt+1x)T(xtx))=[1γ(xt+1xt)T(xxt)+1γ(xt+1x)T(xtx)]12γ(xt+1x2+xtx2)=1γxtx212γxt+1x212γxtx2=12γ(xtx2xt+1x2)

Where in (i) we replaced f(xt) from the update rule, in (ii), we added and subtracted x in the last term, and in (iii), we expanded the whole square term.

Now summing over t=0,1,,T1. Noting that on the right side, we will have telescoping series.

(t=0T1f(xt+1))Tf(x)12γ(x0x2xTx2)

Note that, f(xt+1)<f(xt) as f is a convex function, so (t=0T1f(xt+1))Tf(xT) and that 20. So, we have:

Tf(xT)Tf(x)(t=0T1f(xt+1))Tf(x)12γ(x0x2)

Hence, we have:

f(xT)f(x)0.5γ1x0x2T

This concludes the proof of convergence of the gradient descent algorithm when a function is L-smooth and convex.