Speedup using in-place evaluation

Ronny Bergmann

When it comes to time critical operations, a main ingredient in Julia is given by mutating functions, that is those that compute in place without additional memory allocations. In the following, we illustrate how to do this with Manopt.jl.

Let’s start with the same function as in Get Started: Optimize! and compute the mean of some points, only that here we use the sphere $\mathbb S^{30}$ and $n=800$ points.

From the aforementioned example.

We first load all necessary packages.

using Manopt, Manifolds, Random, BenchmarkTools
Random.seed!(42);

And setup our data

Random.seed!(42)
m = 30
M = Sphere(m)
n = 800
Οƒ = Ο€ / 8
p = zeros(Float64, m + 1)
p[2] = 1.0
data = [exp(M, p, Οƒ * rand(M; vector_at=p)) for i in 1:n];

Classical Definition

The variant from the previous tutorial defines a cost $f(x)$ and its gradient $\operatorname{grad}f(p)$ β€œβ€œβ€

f(M, p) = sum(1 / (2 * n) * distance.(Ref(M), Ref(p), data) .^ 2)
grad_f(M, p) = sum(1 / n * grad_distance.(Ref(M), data, Ref(p)))
grad_f (generic function with 1 method)

We further set the stopping criterion to be a little more strict. Then we obtain

sc = StopWhenGradientNormLess(3e-10)
p0 = zeros(Float64, m + 1); p0[1] = 1/sqrt(2); p0[2] = 1/sqrt(2)
m1 = gradient_descent(M, f, grad_f, p0; stopping_criterion=sc);

We can also benchmark this as

@benchmark gradient_descent($M, $f, $grad_f, $p0; stopping_criterion=$sc)
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  48.285 ms … 56.649 ms  β”Š GC (min … max): 4.84% … 6.96%
 Time  (median):     49.552 ms              β”Š GC (median):    5.41%
 Time  (mean Β± Οƒ):   50.151 ms Β±  1.731 ms  β”Š GC (mean Β± Οƒ):  5.56% Β± 0.64%

   β–‚β–ƒ β–ˆβ–ƒβ–ƒβ–†    β–‚
  β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–ˆβ–‡β–ˆβ–„β–…β–‡β–β–…β–ˆβ–…β–‡β–„β–‡β–…β–β–…β–„β–„β–„β–β–„β–β–β–β–„β–„β–β–β–β–β–β–β–„β–β–β–β–β–β–β–„β–β–„β–β–β–β–β–β–β–„ β–„
  48.3 ms         Histogram: frequency by time        56.6 ms <

 Memory estimate: 194.10 MiB, allocs estimate: 655347.

In-place Computation of the Gradient

We can reduce the memory allocations by implementing the gradient to be evaluated in-place. We do this by using a functor. The motivation is twofold: on one hand, we want to avoid variables from the global scope, for example the manifold M or the data, being used within the function. Considering to do the same for more complicated cost functions might also be worth pursuing.

Here, we store the data (as reference) and one introduce temporary memory in order to avoid reallocation of memory per grad_distance computation. We get

struct GradF!{TD,TTMP}
    data::TD
    tmp::TTMP
end
function (grad_f!::GradF!)(M, X, p)
    fill!(X, 0)
    for di in grad_f!.data
        grad_distance!(M, grad_f!.tmp, di, p)
        X .+= grad_f!.tmp
    end
    X ./= length(grad_f!.data)
    return X
end

For the actual call to the solver, we first have to generate an instance of GradF! and tell the solver, that the gradient is provided in an InplaceEvaluation. We can further also use gradient_descent! to even work in-place of the initial point we pass.

grad_f2! = GradF!(data, similar(data[1]))
m2 = deepcopy(p0)
gradient_descent!(
    M, f, grad_f2!, m2; evaluation=InplaceEvaluation(), stopping_criterion=sc
);

We can again benchmark this

@benchmark gradient_descent!(
    $M, $f, $grad_f2!, m2; evaluation=$(InplaceEvaluation()), stopping_criterion=$sc
) setup = (m2 = deepcopy($p0))
BenchmarkTools.Trial: 176 samples with 1 evaluation.
 Range (min … max):  27.419 ms … 34.154 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     28.001 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   28.412 ms Β±  1.079 ms  β”Š GC (mean Β± Οƒ):  0.73% Β± 2.24%

    β–β–…β–‡β–ˆβ–…β–‚β–„ ▁
  β–„β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–‡β–ˆβ–„β–†β–ƒβ–ƒβ–ƒβ–ƒβ–β–β–ƒβ–β–β–ƒβ–β–ƒβ–ƒβ–β–„β–β–β–ƒβ–ƒβ–β–β–„β–β–β–ƒβ–…β–ƒβ–ƒβ–ƒβ–β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–ƒβ–β–β–ƒ β–ƒ
  27.4 ms         Histogram: frequency by time        31.9 ms <

 Memory estimate: 3.76 MiB, allocs estimate: 5949.

which is faster by about a factor of 2 compared to the first solver-call. Note that the results m1 and m2 are of course the same.

distance(M, m1, m2)
2.0004809792350595e-10