We provided the proof of convergence of gradient descent algorithm for smooth and nonconvex case in our article on Convergence of Gradient Descent for Smooth and Nonconvex Case and for smooth and convex case in Convergence of gradient descent for smooth and convex case. This article is continuation of the convergence of gradient descent algorithm when the function is smooth and strongly convex.

A quick recap:

Gradient descent is a widely used optimization algorithm that aims to find the minimum of a given function by iteratively updating its parameters in the direction of the negative gradient. In recent years, gradient descent has become particularly popular in the context of training deep neural networks for machine learning tasks. However, many of these problems are characterized by smooth, convex and nonconvex functions, which present unique challenges for convergence.


Algorithm 1 - Gradient Descent (GD)

Input: An arbitrary vector x0Rn and a decaying step size γt>0; 1. for t=1,2,, do 2. Evaluate the gradient mapping at xt, i.e., evaluate f(xt); 3. Do the following update: xt+1:=xtγtf(xt) 4. end for

Pre-requisite for convergence analysis

Before jumping to convergence analysis for smooth and strongly convex cases for gradient descent algorithm, we first define some basic inequalities that will be used:

Lipschitz smoothness: By definition, a differentiable function f:XR over the set XRn is called L-smooth if the following property is satisfied, where L>0 is a scalar:

f(x)f(y)Lxyfor all x,yX.

Strong Convexity: Consider a continuously differentiable function f:RnR. We say f is μ-strongly convex over XRn if for some μ>0 we have

f(y)+f(y)T(xy)+μ2xy2f(x),for all x,yX.

By some algebraic manipulation of the above expression, we can say that for a strong convex function, the following also holds:

(f(x)f(y))T(xy)μxy2,for all x,yX.

Theorem for smooth and strongly convex function:

Let xRn denote the unique global optimal solution of minxRnf(x), where f is L-smooth and μ-strongly convex. Then, when γ<2μL2, we have:

xTx2(12γμ+γ2L2)Txx02for all T1.

Proof:

xt+1x2=(i)xtγf(xt)x2=(ii)xtx22γf(xt)T(xtx)+γ2f(xt)2=(iii)xtx22γ(f(xt)f(x))T(xtx)+γ2f(xt)f(x)2(iv)xtx22γμxtx2+γ2L2xtx2=(12γμ+γ2L2)xtx2.

Where in (i), we have applied the update rule of gradient descent algorithm, in (ii) we have expanded the norm squared term by considering the terms xtx and f(xt). In (iii), we have assumed the fact that f is strongly convex, so f(x)=0 when x is optimal point. Finally, in (iv), we apply the inequalities of a function being L smooth and μ strongly convex as defined above.

By recursion, we obtain:

xTx2(12γμ+γ2L2)Txx02for all T1.

This concludes the proof of convergence of gradient descent when the function is smooth and strongly convex.