# How to count and cache function calls

Ronny Bergmann

In this tutorial, we want to investigate the caching and counting (statistics) features of Manopt.jl. We reuse the optimization tasks from the introductory tutorial Get started: optimize!.

## Introduction

There are surely many ways to keep track for example of how often the cost function is called, for example with a functor, as we used in an example in How to Record Data

```
mutable struct MyCost{I<:Integer}
count::I
end
MyCost() = MyCost{Int64}(0)
function (c::MyCost)(M, x)
c.count += 1
# [ .. Actual implementation of the cost here ]
end
```

This still leaves a bit of work to the user, especially for tracking more than just the number of cost function evaluations.

When a function like the objective or gradient is expensive to compute, it may make sense to cache its results. Manopt.jl tries to minimize the number of repeated calls but sometimes they are necessary and harmless when the function is cheap to compute. Caching of expensive function calls can for example be added using Memoize.jl by the user. The approach in the solvers of Manopt.jl aims to simplify adding both these capabilities on the level of calling a solver.

## Technical background

The two ingredients for a solver in Manopt.jl are the `AbstractManoptProblem`

and the `AbstractManoptSolverState`

, where the former consists of the domain, that is the `AsbtractManifold`

and `AbstractManifoldObjective`

.

Both recording and debug capabilities are implemented in a decorator pattern to the solver state. They can be easily added using the `record=`

and `debug=`

in any solver call. This pattern was recently extended, such that also the objective can be decorated. This is how both caching and counting are implemented, as decorators of the `AbstractManifoldObjective`

and hence for example changing/extending the behaviour of a call to `get_cost`

.

Let’s finish off the technical background by loading the necessary packages. Besides Manopt.jl and Manifolds.jl we also need LRUCaches.jl which are (since Julia 1.9) a weak dependency and provide the *least recently used* strategy for our caches.

```
using Manopt, Manifolds, Random, LRUCache, LinearAlgebra, ManifoldDiff
using ManifoldDiff: grad_distance
```

## Counting

We first define our task, the Riemannian Center of Mass from the Get started: optimize! tutorial.

```
n = 100
σ = π / 8
M = Sphere(2)
p = 1 / sqrt(2) * [1.0, 0.0, 1.0]
Random.seed!(42)
data = [exp(M, p, σ * rand(M; vector_at=p)) for i in 1:n];
f(M, p) = sum(1 / (2 * n) * distance.(Ref(M), Ref(p), data) .^ 2)
grad_f(M, p) = sum(1 / n * grad_distance.(Ref(M), data, Ref(p)));
```

to now count how often the cost and the gradient are called, we use the `count=`

keyword argument that works in any solver to specify the elements of the objective whose calls we want to count calls to. A full list is available in the documentation of the `AbstractManifoldObjective`

. To also see the result, we have to set `return_objective=true`

. This returns `(objective, p)`

instead of just the solver result `p`

. We can further also set `return_state=true`

to get even more information about the solver run.

`gradient_descent(M, f, grad_f, data[1]; count=[:Cost, :Gradient], return_objective=true, return_state=true)`

```
# Solver state for `Manopt.jl`s Gradient Descent
After 66 iterations
## Parameters
* retraction method: ExponentialRetraction()
## Stepsize
ArmijoLinesearch(;
initial_stepsize=1.0
retraction_method=ExponentialRetraction()
contraction_factor=0.95
sufficient_decrease=0.1
)
## Stopping criterion
Stop When _one_ of the following are fulfilled:
Max Iteration 200: not reached
|grad f| < 1.0e-8: reached
Overall: reached
This indicates convergence: Yes
## Statistics on function calls
* :Gradient : 199
* :Cost : 275
```

And we see that statistics are shown in the end.

## Caching

To now also cache these calls, we can use the `cache=`

keyword argument. Since now both the cache and the count “extend” the capability of the objective, the order is important: on the high-level interface, the `count`

is treated first, which means that only actual function calls and not cache look-ups are counted. With the proper initialisation, you can use any caches here that support the `get!(function, cache, key)!`

update. All parts of the objective that can currently be cached are listed at `ManifoldCachedObjective`

. The solver call has a keyword `cache`

that takes a tuple`(c, vs, n)`

of three arguments, where `c`

is a symbol for the type of cache, `vs`

is a vector of symbols, which calls to cache and `n`

is the size of the cache. If the last element is not provided, a suitable default (currently`n=10`

) is used.

Here we want to use `c=:LRU`

caches for `vs=[Cost, :Gradient]`

with a size of `n=25`

.

```
r = gradient_descent(M, f, grad_f, data[1];
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 25),
return_objective=true, return_state=true)
```

```
# Solver state for `Manopt.jl`s Gradient Descent
After 66 iterations
## Parameters
* retraction method: ExponentialRetraction()
## Stepsize
ArmijoLinesearch(;
initial_stepsize=1.0
retraction_method=ExponentialRetraction()
contraction_factor=0.95
sufficient_decrease=0.1
)
## Stopping criterion
Stop When _one_ of the following are fulfilled:
Max Iteration 200: not reached
|grad f| < 1.0e-8: reached
Overall: reached
This indicates convergence: Yes
## Cache
* :Cost : 25/25 entries of type Float64 used
* :Gradient : 25/25 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 66
* :Cost : 149
```

Since the default setup with `ArmijoLinesearch`

needs the gradient and the cost, and similarly the stopping criterion might (independently) evaluate the gradient, the caching is quite helpful here.

And of course also for this advanced return value of the solver, we can still access the result as usual:

`get_solver_result(r)`

```
3-element Vector{Float64}:
0.6868392807355564
0.006531599748261925
0.7267799809043942
```

## Advanced caching examples

There are more options other than caching single calls to specific parts of the objective. For example you may want to cache intermediate results of computing the cost and share that with the gradient computation. We present three solutions to this:

- An easy approach from within
`Manopt.jl`

: the`ManifoldCostGradientObjective`

- A shared storage approach using a functor
- A shared (internal) cache approach also using a functor

For that we switch to another example: the Rayleigh quotient. We aim to maximize the Rayleigh quotient $\displaystyle\frac{x^{\mathrm{T}}Ax}{x^{\mathrm{T}}x}$, for some $A∈ℝ^{m+1\times m+1}$ and $x∈ℝ^{m+1}$ but since we consider this on the sphere and `Manopt.jl`

(as many other optimization toolboxes) minimizes, we consider

\[g(p) = -p^{\mathrm{T}}Ap,\qquad p∈\mathbb S^{m}\]

The Euclidean gradient (that is in $ R^{m+1}$) is actually just $\nabla g(p) = -2Ap$, the Riemannian gradient the projection of $\nabla g(p)$ onto the tangent space $T_p\mathbb S^{m}$.

```
m = 25
Random.seed!(42)
A = randn(m + 1, m + 1)
A = Symmetric(A)
p_star = eigvecs(A)[:, end] # minimizer (or similarly -p)
f_star = -eigvals(A)[end] # cost (note that we get - the largest Eigenvalue)
N = Sphere(m);
g(M, p) = -p' * A*p
∇g(p) = -2 * A * p
grad_g(M,p) = project(M, p, ∇g(p))
grad_g!(M,X, p) = project!(M, X, p, ∇g(p))
```

`grad_g! (generic function with 1 method)`

But since both the cost and the gradient require the computation of the matrix-vector product $Ap$, it might be beneficial to only compute this once.

### The `ManifoldCostGradientObjective`

approach

The `ManifoldCostGradientObjective`

uses a combined function to compute both the gradient and the cost at the same time. We define the in-place variant as

```
function g_grad_g!(M::AbstractManifold, X, p)
X .= -A*p
c = p'*X
X .*= 2
project!(M, X, p, X)
return (c, X)
end
```

`g_grad_g! (generic function with 1 method)`

where we only compute the matrix-vector product once. The small disadvantage might be, that we always compute *both*, the gradient and the cost. Luckily, the cache we used before, takes this into account and caches both results, such that we indeed end up computing `A*p`

only once when asking to a cost and a gradient.

Let’s compare both methods

```
p0 = [(1/5 .* ones(5))..., zeros(m-4)...];
@time s1 = gradient_descent(N, g, grad_g!, p0;
stopping_criterion = StopWhenGradientNormLess(1e-5),
evaluation=InplaceEvaluation(),
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 25),
return_objective=true,
)
```

```
0.962349 seconds (1.08 M allocations: 73.423 MiB, 1.14% gc time, 99.47% compilation time)
## Cache
* :Cost : 25/25 entries of type Float64 used
* :Gradient : 25/25 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 602
* :Cost : 1449
To access the solver result, call `get_solver_result` on this variable.
```

versus

```
obj = ManifoldCostGradientObjective(g_grad_g!; evaluation=InplaceEvaluation())
@time s2 = gradient_descent(N, obj, p0;
stopping_criterion=StopWhenGradientNormLess(1e-5),
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 25),
return_objective=true,
)
```

```
0.796163 seconds (784.83 k allocations: 60.645 MiB, 2.03% gc time, 98.06% compilation time)
## Cache
* :Cost : 25/25 entries of type Float64 used
* :Gradient : 25/25 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 1448
* :Cost : 1448
To access the solver result, call `get_solver_result` on this variable.
```

first of all both yield the same result

```
p1 = get_solver_result(s1)
p2 = get_solver_result(s2)
[distance(N, p1, p2), g(N, p1), g(N, p2), f_star]
```

```
4-element Vector{Float64}:
0.0
-7.8032957637779
-7.8032957637779
-7.803295763793949
```

and we can see that the combined number of evaluations is once 2051, once just the number of cost evaluations 1449. Note that the involved additional 847 gradient evaluations are merely a multiplication with 2. On the other hand, the additional caching of the gradient in these cases might be less beneficial. It is beneficial, when the gradient and the cost are very often required together.

### A shared storage approach using a functor

An alternative to the previous approach is the usage of a functor that introduces a “shared storage” of the result of computing `A*p`

. We additionally have to store `p`

though, since we have to make sure that we are still evaluating the cost and/or gradient at the same point at which the cached `A*p`

was computed. We again consider the (more efficient) in-place variant. This can be done as follows

```
struct StorageG{T,M}
A::M
Ap::T
p::T
end
function (g::StorageG)(::Val{:Cost}, M::AbstractManifold, p)
if !(p==g.p) #We are at a new point -> Update
g.Ap .= g.A*p
g.p .= p
end
return -g.p'*g.Ap
end
function (g::StorageG)(::Val{:Gradient}, M::AbstractManifold, X, p)
if !(p==g.p) #We are at a new point -> Update
g.Ap .= g.A*p
g.p .= p
end
X .= -2 .* g.Ap
project!(M, X, p, X)
return X
end
```

Here we use the first parameter to distinguish both functions. For the mutating case the signatures are different regardless of the additional argument but for the allocating case, the signatures of the cost and the gradient function are the same.

```
#Define the new functor
storage_g = StorageG(A, zero(p0), zero(p0))
# and cost and gradient that use this functor as
g3(M,p) = storage_g(Val(:Cost), M, p)
grad_g3!(M, X, p) = storage_g(Val(:Gradient), M, X, p)
@time s3 = gradient_descent(N, g3, grad_g3!, p0;
stopping_criterion = StopWhenGradientNormLess(1e-5),
evaluation=InplaceEvaluation(),
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 2),
return_objective=true#, return_state=true
)
```

```
0.465353 seconds (330.75 k allocations: 23.824 MiB, 98.98% compilation time)
## Cache
* :Cost : 2/2 entries of type Float64 used
* :Gradient : 2/2 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 602
* :Cost : 1449
To access the solver result, call `get_solver_result` on this variable.
```

This of course still yields the same result

```
p3 = get_solver_result(s3)
g(N, p3) - f_star
```

`1.6049384043981263e-11`

And while we again have a split off the cost and gradient evaluations, we can observe that the allocations are less than half of the previous approach.

### A local cache approach

This variant is very similar to the previous one, but uses a whole cache instead of just one place to store `A*p`

. This makes the code a bit nicer, and it is possible to store more than just the last `p`

either cost or gradient was called with.

```
struct CacheG{C,M}
A::M
cache::C
end
function (g::CacheG)(::Val{:Cost}, M, p)
Ap = get!(g.cache, copy(M,p)) do
g.A*p
end
return -p'*Ap
end
function (g::CacheG)(::Val{:Gradient}, M, X, p)
Ap = get!(g.cache, copy(M,p)) do
g.A*p
end
X .= -2 .* Ap
project!(M, X, p, X)
return X
end
```

However, the resulting solver run is not always faster, since the whole cache instead of storing just `Ap`

and `p`

is a bit more costly. Then the tradeoff is, whether this pays off.

```
#Define the new functor
cache_g = CacheG(A, LRU{typeof(p0),typeof(p0)}(; maxsize=25))
# and cost and gradient that use this functor as
g4(M,p) = cache_g(Val(:Cost), M, p)
grad_g4!(M, X, p) = cache_g(Val(:Gradient), M, X, p)
@time s4 = gradient_descent(N, g4, grad_g4!, p0;
stopping_criterion = StopWhenGradientNormLess(1e-5),
evaluation=InplaceEvaluation(),
count=[:Cost, :Gradient],
cache=(:LRU, [:Cost, :Gradient], 25),
return_objective=true,
)
```

```
0.380845 seconds (328.84 k allocations: 24.083 MiB, 98.43% compilation time)
## Cache
* :Cost : 25/25 entries of type Float64 used
* :Gradient : 25/25 entries of type Vector{Float64} used
## Statistics on function calls
* :Gradient : 602
* :Cost : 1449
To access the solver result, call `get_solver_result` on this variable.
```

and for safety let’s verify that we are reasonably close

```
p4 = get_solver_result(s4)
g(N, p4) - f_star
```

`1.6049384043981263e-11`

For this example, or maybe even `gradient_descent`

in general it seems, this additional (second, inner) cache does not improve the result further, it is about the same effort both time and allocation-wise.

## Summary

While the second approach of `ManifoldCostGradientObjective`

is very easy to implement, both the storage and the (local) cache approach are more efficient. All three are an improvement over the first implementation without sharing interim results. The results with storage or cache have further advantage of being more flexible, since the stored information could also be reused in a third function, for example when also computing the Hessian.

## Technical details

This tutorial is cached. It was last run on the following package versions.

```
using Pkg
Pkg.status()
```

```
Status `~/work/Manopt.jl/Manopt.jl/tutorials/Project.toml`
[6e4b80f9] BenchmarkTools v1.5.0
[5ae59095] Colors v0.12.11
[31c24e10] Distributions v0.25.112
[26cc04aa] FiniteDifferences v0.12.32
[7073ff75] IJulia v1.25.0
[8ac3fa9e] LRUCache v1.6.1
[af67fdf4] ManifoldDiff v0.3.12
[1cead3c2] Manifolds v0.10.3
[3362f125] ManifoldsBase v0.15.17
[0fc0a36d] Manopt v0.5.2 `~/work/Manopt.jl/Manopt.jl`
[91a5bcdd] Plots v1.40.8
[731186ca] RecursiveArrayTools v3.27.0
```

```
using Dates
now()
```

`2024-10-05T12:26:21.944`