Chapter 2 Mathematical Tools
Convex Analysis
Definition 2.1 (Convex Combination) A convex combination of a set of n vectors
A set that is closed under arbitrary convex combinations is a convex set. Geometrically speaking, convex sets are those that contain all line segments that join two points inside the set. As a result, they cannot have any inward “bulges”.
Definition 2.2 (Convex Set) A set
This defines the threshold of variable value.
Figure 2.1 gives visual representations of prototypical convex and non-convex sets.

The convex function could be defined as
This means that any value on the line segment that join two points on function

Definition 2.3 (Convex Function) A continuously differentiable function

A more general definition that extends to non-differentiable functions uses the notion of subgradient to replace the gradient in the above expression. A special class of convex functions is the class of strongly convex and strongly smooth functions. Figure 2.2 provides a handy visual representation of these classes of functions.
Definition 2.4 (Strongly Convex / Smooth Function) A continuously differentiable function
Given Taylor Theorem, for
Thus,
The Hessian matrix

Therefore, we know that the largest eigenvalue (right bottom) of the Hessian of
It is useful to note that strong convexity places a quadratic lower bound and let the function grow at least as fast as a quadratic function, with the SC parameter
These two properties are extremely useful in forcing optimization algorithms to rapidly converge to optima.
Whereas strongly convex functions are definitely convex, strong smoothness does not imply convexity. Strongly smooth functions may very well be non-convex.
The explanation may could be found in exe 2.1 TRY IT !!!
Definition 2.5 (Lipschitz Function) A function
Given Lagrange Theorem, we have
Similar with
Lipschitzness places a upper bound on the growth of the function that is linear in the perturbation i.e.,
Lipschitz functions need not be differentiable. However, differentiable functions with bounded gradients are always Lipschitz
An important property that generalizes the behavior of convex functions on convex combinations is the Jensen’s inequality.
Lemma 2.1 (Jensen’s Inequality) If
This property will be useful while analyzing iterative algorithms.
Convex Projections
The projected gradient descent technique is a popular method for constrained optimization problems, both convex and non-convex. The projection step plays as an important role in this technique. Given any closed

This definition means that if
In general,
For instance, if

For instance, if
Lemma 2.2 (Projection Property-O) For any set (convex or not)
Lemma 2.3 (Projection Property-I) For any convex set
Projection Property-I can be used to prove a very useful contraction property for convex projections. In some sense, a convex projection brings a point closer to all points in the convex set simultaneously.
Lemma 2.4 (Projection Property-II) For any convex set
The Definition 2.2 could be used to proof these three above Lemma.

Note that Projection Properties-I and II are also called first order properties, and Projection Property-O is often called zeroth order property.
Projected Gradient Descent
This is an extremely simple and efficient technique that can effortlessly scale to large problems. The projected gradient descent algorithm is stated in Algorithm 1, The procedure generates iterates

Convergence Guarantees for PGD
We will analyze PGD for objective functions that either a) convex with bounded gradients, or b) strongly convex and strongly smooth. Let
Convergence with Bounded Gradient Convex Functions
Consider a convex objective function
Theorem 2.5 Let
The proof of Theorem 2.5
Let
Let the potential function
And then, the statement of the theorem is equivalent to claiming that
Given the Convexity
And then, given
In the PGD algorithm, we set that
And given the fact that the objective function
Applying Lemma 2.4,
Which means that
Thus,
Apart from the
Summing up across time steps and performing telescopic cancellations, using
Since
Till now, we FINALLY get the claimed result
In case we are not sure what the value of
This setting could ensure the function value of the iterates approaches
OPTIOPN 1: It is the cheapest and doesn’t require any additional operations, but it doesn’t converge to the optimum for convex functions in general and may oscillate close to the optimum. However it does converge if the objective function
OPTION 2: It is cheaper than than OPTION 3, since we do not have to perform function evaluations to find the best iterate. And it could also prove the convergence of the PGD algorithm by applying Lemma 2.1 Jensen’s inequality and Theorem 2.5. For all
And do note that Jensen’s inequality may be applied only when the function
OPTION 3: Could also prove the convergence of the PGD algorithm with the construction of
Convergence with Strongly Convex and Smooth Functions
Theorem 2.6 Let
When the objective is
Notice that the setting of step length,
The proof of Theorem 2.6
Since
Thus,
Since
Therefore, the statement of the theorem is equivalent to claiming that
Given the
In PGD,
The above expression contains an unwieldy term
Combining the above results and using
The above expression is perfect for a telescoping step but for the inner product term. Given the
Thus,
We have
Which can be written as the following with the fact that
Applying this result recursively gives us
Since
Notice that the convergence of the PGD algorithm is of the form
Comparing Theorem 2.5 and 2.6, we may fancy the latter one, because it could converge faster than the former one based
Definition 2.6 (Condition Number - Informal) The condition number of a function
Functions with a small condition number are stable and changes to their input do not affect the function output values too much. However, functions with a large condition number can be quite jumpy and experience abrupt changes in output values even if the input is changed slightly.
Assume that a differentiable function
Thus, upon perturbing the input from the global minimum
The condition number of the objective function can significantly affect the convergence rate of algorithms. Indeed, if