How to do constrained optimization

Ronny Bergmann

This tutorial is a short introduction to using solvers for constraint optimisation in Manopt.jl.

Introduction

A constraint optimisation problem is given by

\[\tag{P} \begin{align*} \operatorname*{arg\,min}_{p∈\mathcal M} & f(p)\\ \text{such that} &\quad g(p) \leq 0\\ &\quad h(p) = 0,\\ \end{align*}\]

where $f: \mathcal M → ℝ$ is a cost function, and $g: \mathcal M → ℝ^m$ and $h: \mathcal M → ℝ^n$ are the inequality and equality constraints, respectively. The $\leq$ and $=$ in (P) are meant element-wise.

This can be seen as a balance between moving constraints into the geometry of a manifold $\mathcal M$ and keeping some, since they can be handled well in algorithms, see [BH19], [LB19] for details.

using Distributions, LinearAlgebra, Manifolds, Manopt, Random
Random.seed!(42);

In this tutorial we want to look at different ways to specify the problem and its implications. We start with specifying an example problems to illustrate the different available forms.

We consider the problem of a Nonnegative PCA, cf. Section 5.1.2 in [LB19]

let $v_0 ∈ ℝ^d$, $\lVert v_0 \rVert=1$ be given spike signal, that is a signal that is sparse with only $s=\lfloor δd \rfloor$ nonzero entries.

\[Z = \sqrt{σ} v_0v_0^{\mathrm{T}}+N,\]

where $\sigma$ is a signal-to-noise ratio and $N$ is a matrix with random entries, where the diagonal entries are distributed with zero mean and standard deviation $1/d$ on the off-diagonals and $2/d$ on the diagonal

d = 150; # dimension of v0
σ = 0.1^2; # SNR
δ = 0.1; sp = Int(floor(δ * d)); # Sparsity
S = sample(1:d, sp; replace=false);
v0 =  [i ∈ S ? 1 / sqrt(sp) : 0.0 for i in 1:d];
N = rand(Normal(0, 1 / d), (d, d)); N[diagind(N, 0)] .= rand(Normal(0, 2 / d), d);
Z = Z = sqrt(σ) * v0 * transpose(v0) + N;

In order to recover $v_0$ we consider the constrained optimisation problem on the sphere $\mathcal S^{d-1}$ given by

\[\begin{align*} \operatorname*{arg\,min}_{p∈\mathcal S^{d-1}} & -p^{\mathrm{T}}Zp^{\mathrm{T}}\\ \text{such that} &\quad p \geq 0\\ \end{align*}\]

or in the previous notation $f(p) = -p^{\mathrm{T}}Zp^{\mathrm{T}}$ and $g(p) = -p$. We first initialize the manifold under consideration

M = Sphere(d - 1)
Sphere(149, ℝ)

A first augmented Lagrangian run

We first defined $f$ and $g$ as usual functions

f(M, p) = -transpose(p) * Z * p;
g(M, p) = -p;

since $f$ is a functions defined in the embedding $ℝ^d$ as well, we obtain its gradient by projection.

grad_f(M, p) = project(M, p, -transpose(Z) * p - Z * p);

For the constraints this is a little more involved, since each function $g_i=g(p)_i=p_i$ has to return its own gradient. These are again in the embedding just $\operatorname{grad} g_i(p) = -e_i$ the $i$ th unit vector. We can project these again onto the tangent space at $p$:

grad_g(M, p) = project.(
    Ref(M), Ref(p), [[i == j ? -1.0 : 0.0 for j in 1:d] for i in 1:d]
);

We further start in a random point:

p0 = rand(M);

Let’s verify a few things for the initial point

f(M, p0)
0.0057476048331242344

How much the function g is positive

maximum(g(M, p0))
0.17885478285466855

Now as a first method we can just call the Augmented Lagrangian Method with a simple call:

@time v1 = augmented_Lagrangian_method(
    M, f, grad_f, p0; g=g, grad_g=grad_g,
    debug=[:Iteration, :Cost, :Stop, " | ", (:Change, "Δp : %1.5e"), 20, "\n"],
    stopping_criterion = StopAfterIteration(300) | (
        StopWhenSmallerOrEqual(:ϵ, 1e-5) & StopWhenChangeLess(M, 1e-8)
    )
);
Initial f(x): 0.005748 | 
# 20    f(x): -0.123842 | Δp : 1.00014e+00
# 40    f(x): -0.123842 | Δp : 4.29831e-07
# 60    f(x): -0.123842 | Δp : 7.12598e-08
The value of the variable (ϵ) is smaller than or equal to its threshold (1.0e-5).
At iteration 68 the algorithm performed a step with a change (3.74507028292393e-7) less than 9.77237220955808e-6.
  5.070427 seconds (7.48 M allocations: 1.112 GiB, 8.30% gc time, 94.26% compilation time)

Now we have both a lower function value and the point is nearly within the constraints, namely up to numerical inaccuracies

f(M, v1)
-0.1238417838899298
maximum( g(M, v1) )
8.423262829014554e-13

A faster augmented Lagrangian run

Now this is a little slow, so we can modify two things:

  1. Gradients should be evaluated in place, so for example
grad_f!(M, X, p) = project!(M, X, p, -transpose(Z) * p - Z * p);
  1. The constraints are currently always evaluated all together, since the function grad_g always returns a vector of gradients. We first change the constraints function into a vector of functions. We further change the gradient both into a vector of gradient functions $\operatorname{grad} g_i,i=1,\ldots,d$, as well as gradients that are computed in place.
g2 = [(M, p) -> -p[i] for i in 1:d];
grad_g2! = [
    (M, X, p) -> project!(M, X, p, [i == j ? -1.0 : 0.0 for j in 1:d]) for i in 1:d
];

We obtain

@time v2 = augmented_Lagrangian_method(
        M, f, grad_f!, p0; g=g2, grad_g=grad_g2!, evaluation=InplaceEvaluation(),
        debug=[:Iteration, :Cost, :Stop, " | ", (:Change, "Δp : %1.5e"), 20, "\n"],
        stopping_criterion = StopAfterIteration(300) | (
          StopWhenSmallerOrEqual(:ϵ, 1e-5) & StopWhenChangeLess(M, 1e-8)
        )
    );
Initial f(x): 0.005748 | 
# 20    f(x): -0.123842 | Δp : 1.00014e+00
# 40    f(x): -0.123842 | Δp : 4.29831e-07
# 60    f(x): -0.123842 | Δp : 7.12598e-08
The value of the variable (ϵ) is smaller than or equal to its threshold (1.0e-5).
At iteration 68 the algorithm performed a step with a change (3.74507028292393e-7) less than 9.77237220955808e-6.
  1.934760 seconds (1.76 M allocations: 539.962 MiB, 1.58% gc time, 94.15% compilation time)

As a technical remark: note that (by default) the change to InplaceEvaluations affects both the constrained solver as well as the inner solver of the subproblem in each iteration.

f(M, v2)
-0.1238417838899298
maximum(g(M, v2))
8.423262829014554e-13

These are the very similar to the previous values but the solver took much less time and less memory allocations.

Exact penalty method

As a second solver, we have the Exact Penalty Method, which currently is available with two smoothing variants, which make an inner solver for smooth optimization, that is by default again [quasi Newton] possible: LogarithmicSumOfExponentials and LinearQuadraticHuber. We compare both here as well. The first smoothing technique is the default, so we can just call

@time v3 = exact_penalty_method(
    M, f, grad_f!, p0; g=g2, grad_g=grad_g2!, evaluation=InplaceEvaluation(),
    debug=[:Iteration, :Cost, :Stop, " | ", :Change, 50, "\n"],
);
Initial f(x): 0.005748 | 
# 50    f(x): -0.123072 | Last Change: 0.980989
# 100   f(x): -0.123840 | Last Change: 0.014102
The value of the variable (ϵ) is smaller than or equal to its threshold (1.0e-6).
At iteration 102 the algorithm performed a step with a change (2.5721142829292014e-7) less than 1.0e-6.
  2.079786 seconds (7.12 M allocations: 5.025 GiB, 3.25% gc time, 59.63% compilation time)

We obtain a similar cost value as for the Augmented Lagrangian Solver from before, but here the constraint is actually fulfilled and not just numerically “on the boundary”.

f(M, v3)
-0.12384011134138288
maximum(g(M, v3))
-3.5997453361260124e-6

The second smoothing technique is often beneficial, when we have a lot of constraints (in the previously mentioned vectorial manner), since we can avoid several gradient evaluations for the constraint functions here. This leads to a faster iteration time.

@time v4 = exact_penalty_method(
    M, f, grad_f!, p0; g=g2, grad_g=grad_g2!,
    evaluation=InplaceEvaluation(),
    smoothing=LinearQuadraticHuber(),
    debug=[:Iteration, :Cost, :Stop, " | ", :Change, 50, "\n"],
);
Initial f(x): 0.005748 | 
# 50    f(x): -0.123844 | Last Change: 0.008021
# 100   f(x): -0.123841 | Last Change: 0.000026
The value of the variable (ϵ) is smaller than or equal to its threshold (1.0e-6).
At iteration 101 the algorithm performed a step with a change (1.0085214370089587e-8) less than 1.0e-6.
  1.756924 seconds (3.97 M allocations: 2.119 GiB, 3.52% gc time, 82.05% compilation time)

For the result we see the same behaviour as for the other smoothing.

f(M, v4)
-0.12384142450064946
maximum(g(M, v4))
2.72367275454451e-8

Comparing to the unconstrained solver

We can compare this to the global optimum on the sphere, which is the unconstrained optimisation problem, where we can just use Quasi Newton.

Note that this is much faster, since every iteration of the algorithm does a quasi-Newton call as well.

@time w1 = quasi_Newton(
    M, f, grad_f!, p0; evaluation=InplaceEvaluation()
);
  0.582718 seconds (504.40 k allocations: 50.537 MiB, 1.23% gc time, 98.29% compilation time)
f(M, w1)
-0.14021901809763088

But for sure here the constraints here are not fulfilled and we have quite positive entries in $g(w_1)$

maximum(g(M, w1))
0.11762276192957001

Technical details

This tutorial is cached. It was last run on the following package versions.

using Pkg
Pkg.status()
Status `~/work/Manopt.jl/Manopt.jl/tutorials/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [5ae59095] Colors v0.12.11
  [31c24e10] Distributions v0.25.112
  [26cc04aa] FiniteDifferences v0.12.32
  [7073ff75] IJulia v1.25.0
  [8ac3fa9e] LRUCache v1.6.1
  [af67fdf4] ManifoldDiff v0.3.12
  [1cead3c2] Manifolds v0.10.3
  [3362f125] ManifoldsBase v0.15.17
  [0fc0a36d] Manopt v0.5.2 `~/work/Manopt.jl/Manopt.jl`
  [91a5bcdd] Plots v1.40.8
  [731186ca] RecursiveArrayTools v3.27.0
using Dates
now()
2024-10-18T19:18:27.210

Literature

[BH19]
R. Bergmann and R. Herzog. Intrinsic formulation of KKT conditions and constraint qualifications on smooth manifolds. SIAM Journal on Optimization 29, 2423–2444 (2019), arXiv:1804.06214.
[LB19]
C. Liu and N. Boumal. Simple algorithms for optimization on Riemannian manifolds with constraints. Applied Mathematics & Optimization (2019), arXiv:1091.10000.