Chapter 2 Mathematical Tools

Convex Analysis

Definition 2.1 (Convex Combination) A convex combination of a set of n vectors , in an arbitrary real space is a vector where , and .

A set that is closed under arbitrary convex combinations is a convex set. Geometrically speaking, convex sets are those that contain all line segments that join two points inside the set. As a result, they cannot have any inward “bulges”.

Definition 2.2 (Convex Set) A set is considered convex if for every and , we have as well.

This defines the threshold of variable value.

Figure 2.1 gives visual representations of prototypical convex and non-convex sets.

The convex function could be defined as , for every and , we have .

This means that any value on the line segment that join two points on function , is greater than the value corresponding to the variable value. And this defines the threshold of the value corresponding to the variable value.

Definition 2.3 (Convex Function) A continuously differentiable function is considered convex if for every we have , where is the gradient of at .

A more general definition that extends to non-differentiable functions uses the notion of subgradient to replace the gradient in the above expression. A special class of convex functions is the class of strongly convex and strongly smooth functions. Figure 2.2 provides a handy visual representation of these classes of functions.

Definition 2.4 (Strongly Convex / Smooth Function) A continuously differentiable function is considered -strongly convex (SC) and -strongly smooth (SS) if for every we have

Given Taylor Theorem, for , we have

Thus,

The Hessian matrix of is a square matrix, usually defined as follows:

Therefore, we know that the largest eigenvalue (right bottom) of the Hessian of is uniformly upper bounded by , while the smallest eigenvalue (left top) refers to the lower bounded by .

It is useful to note that strong convexity places a quadratic lower bound and let the function grow at least as fast as a quadratic function, with the SC parameter dictating the lower limit. Similarly, strong smoothness places a quadratic upper bound and does not let the function grow too fast, with the SS parameter dictating the upper limit.

These two properties are extremely useful in forcing optimization algorithms to rapidly converge to optima.

Whereas strongly convex functions are definitely convex, strong smoothness does not imply convexity. Strongly smooth functions may very well be non-convex.

The explanation may could be found in exe 2.1 TRY IT !!!

Definition 2.5 (Lipschitz Function) A function is B-Lipschitz if for every we have

Given Lagrange Theorem, we have

Similar with -strongly smooth (SS) function, we get the largest eigenvalue (right bottom) of the Hessian of is uniformly upper bounded by .

Lipschitzness places a upper bound on the growth of the function that is linear in the perturbation i.e., , whereas strong smoothness places a quadratic upper bound.

Lipschitz functions need not be differentiable. However, differentiable functions with bounded gradients are always Lipschitz.

An important property that generalizes the behavior of convex functions on convex combinations is the Jensen’s inequality.

Lemma 2.1 (Jensen’s Inequality) If is a random variable taking vales in the domain of a convex function , then

This property will be useful while analyzing iterative algorithms.

Convex Projections

The projected gradient descent technique is a popular method for constrained optimization problems, both convex and non-convex. The projection step plays as an important role in this technique. Given any closed , the projection operator is defined as

This definition means that if is outside , then project on ; whilst is in , then nothing need to be changed.

In general, -norm is not only but the most commonly way to define projection. If is a convex set, then the above problem reduces to a convex optimization problem.

For instance, if i.e., the unit ball, then projection is equivalent to a normalization step

For instance, if , the projection step reduces to the popular soft thresholding operation, which finds the minimizer of an objective function that involves data fitting in an sense as well as minimization of the norm (i.e., absolute value). if , then , where is a threshold that can be decided by a sorting operation on the vector.

Lemma 2.2 (Projection Property-O) For any set (convex or not) and , let . Then for all .

Lemma 2.3 (Projection Property-I) For any convex set and , let . Then for all .

Projection Property-I can be used to prove a very useful contraction property for convex projections. In some sense, a convex projection brings a point closer to all points in the convex set simultaneously.

Lemma 2.4 (Projection Property-II) For any convex set and , let . Then for all .

The Definition 2.2 could be used to proof these three above Lemma.

Note that Projection Properties-I and II are also called first order properties, and Projection Property-O is often called zeroth order property.

Projected Gradient Descent

This is an extremely simple and efficient technique that can effortlessly scale to large problems. The projected gradient descent algorithm is stated in Algorithm 1, The procedure generates iterates by taking steps guided by the gradient in an effort to reduce the function value locally. Finally it returns either the final iterate, the average iterate, or the best iterate.

Convergence Guarantees for PGD

We will analyze PGD for objective functions that either a) convex with bounded gradients, or b) strongly convex and strongly smooth. Let be the optimal value of the optimization problem. A point will be said to be an -optimal solution if .

Convergence with Bounded Gradient Convex Functions

Consider a convex objective function with bounded gradients over a convex constraint set . i.e., for all .

Theorem 2.5 Let be a convex objective with bounded gradients and Algorithm 1 be executed for time steps with step lengths . Then, for any , if , then .

The proof of Theorem 2.5

Let , and such a point always exists if the constraint set is closed and the objective function continuous.

Let the potential function , which measures the sub-optimality of the -th iterate.

And then, the statement of the theorem is equivalent to claiming that .

Given the Convexity is a global property and very useful in getting an upper bound on the level of sub-optionality of the current iterate in such analyses. We apply it to upper bound the potential function at every step.

And then, given ,

In the PGD algorithm, we set that , thus

And given the fact that the objective function has bounded gradients as for all ,

Applying Lemma 2.4, for all , according to PGD (Line 4), is the result of the projection of , we have

Which means that

Thus,

Apart from the is small since , is small if the consecutive iterates and are close to each other (and hence similar in distance from . This observation tells that once PGD stops making a lot of progress, it actually converges to the optimum. In this case, only a vanishing gradient can cause PGD to stop progressing, since the step length is constant. And for convex functions, this only happens at global optima.

Summing up across time steps and performing telescopic cancellations, using , and dividing throughout by gives us,

Since and ,

Till now, we FINALLY get the claimed result .

In case we are not sure what the value of would be, the setting of step length depends on the total number of iterations for which the PGD algorithm is executed. This is called a horizon-aware setting of the step length.

This setting could ensure the function value of the iterates approaches on an average, which could prove the convergence of the PGD algorithm.

OPTIOPN 1: It is the cheapest and doesn’t require any additional operations, but it doesn’t converge to the optimum for convex functions in general and may oscillate close to the optimum. However it does converge if the objective function is strongly smooth, since strongly smooth functions may not grow at a faster-than-quadratic rate.

OPTION 2: It is cheaper than than OPTION 3, since we do not have to perform function evaluations to find the best iterate. And it could also prove the convergence of the PGD algorithm by applying Lemma 2.1 Jensen’s inequality and Theorem 2.5. For all , we have

And do note that Jensen’s inequality may be applied only when the function is convex.

OPTION 3: Could also prove the convergence of the PGD algorithm with the construction of and Theorem 2.5. For all , we have

Convergence with Strongly Convex and Smooth Functions

Theorem 2.6 Let be an objective that satisfies the - and - properties. Let Algorithm 1 be executed with step lengths . Then after at most steps, we have .

When the objective is /, it ensures that the final iterate converges. And the rate of convergence is accelerated with the properties, since for general convex functions, PGD requires iterations to reach an -optimal solution, whilst it requires only iterations for / functions.

Notice that the setting of step length, , in this case is only to make the computation which presents a problem be more easy. In practice, the step is tuned globally by doing a grid search over several values, or per-iteration using line search mechanisms, to obtain a step length value that assures good convergence rates.

The proof of Theorem 2.6

Since satisfies the - and - properties, for every we have

Thus,

Since is an optimal point for the constrained optimization problem with a convex constraint set , the first order optimality condition gives us for any . Applying this condition with gives us

Therefore, the statement of the theorem is equivalent to claiming that , and is an -optimal point.

Given the - property,

In PGD, , thus

The above expression contains an unwieldy term . Since this term only appears during projection steps, we eliminate it by applying Projection Property-I (Lemma 2.3), for all ,to get

Combining the above results and using , we have

The above expression is perfect for a telescoping step but for the inner product term. Given the - property, this can be eliminated.

Thus,

We have , which means that

Which can be written as the following with the fact that for all

Applying this result recursively gives us

Since , we deduce the following expression after at most steps which finishes the proof.

Notice that the convergence of the PGD algorithm is of the form . The number is the condition number of the optimization problem, which is central to numerical optimization. The exact numerical form of the condition number will also change depending on the application at hand. In general, all these definitions of condition number will satisfy the following property.

Comparing Theorem 2.5 and 2.6, we may fancy the latter one, because it could converge faster than the former one based .

Definition 2.6 (Condition Number - Informal) The condition number of a function is scalar that bounds how much the function value can change relative to a perturbation of the input.

Functions with a small condition number are stable and changes to their input do not affect the function output values too much. However, functions with a large condition number can be quite jumpy and experience abrupt changes in output values even if the input is changed slightly.

Assume that a differentiable function that is also - and -. And a stationary point for , i.e., a point such that . For a general function, such a point can be a local optima or a saddle point. however, since is strongly convex, is the (unique) global minima of . Then for any other point we have

Thus, upon perturbing the input from the global minimum to a point distance away, the function value does change much []. Such well behaved response to perturbations is very easy for optimization algorithms to exploit to give fast convergence.

The condition number of the objective function can significantly affect the convergence rate of algorithms. Indeed, if is small, then would be small, ensuring fast convergence. However, if then and the procedure might offer slow convergence, which is under ill condition. In this case, pre-condition is a good technique.

ZHI

Non Convex Optimization Chapter 2 Note