Illustration of how to Use Mutating Gradient Functions
When it comes to time critital operations, a main ingredient in Julia is given by mutating functions, i.e. those that compute in place without additional memory allocations. In the following, we illustrate how to do this with Manopt.jl
.
Let's start with the same function as in Get Started: Optimize! and compute the mean of some points, only that here we use the sphere $\mathbb S^{30}$ and n=800
points.
From the aforementioned example, the implementation looks like
using Manopt, Manifolds, Random, BenchmarkTools
begin
Random.seed!(42)
m = 30
M = Sphere(m)
n = 800
σ = π / 8
x = zeros(Float64, m + 1)
x[2] = 1.0
data = [exp(M, x, random_tangent(M, x, Val(:Gaussian), σ)) for i in 1:n]
end;
Classical Definition
The variant from the previous tutorial defines a cost $F(x)$ and its gradient $gradF(x)$
F(M, x) = sum(1 / (2 * n) * distance.(Ref(M), Ref(x), data) .^ 2)
F (generic function with 1 method)
gradF(M, x) = sum(1 / n * grad_distance.(Ref(M), data, Ref(x)))
gradF (generic function with 1 method)
We further set the stopping criterion to be a little more strict. Then we obtain
begin
sc = StopWhenGradientNormLess(1e-10)
x0 = random_point(M)
m1 = gradient_descent(M, F, gradF, x0; stopping_criterion=sc)
@benchmark gradient_descent($M, $F, $gradF, $x0; stopping_criterion=$sc)
end
BenchmarkTools.Trial: 312 samples with 1 evaluation. Range (min … max): 11.824 ms … 28.610 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 15.273 ms ┊ GC (median): 0.00% Time (mean ± σ): 16.059 ms ± 3.840 ms ┊ GC (mean ± σ): 11.50% ± 10.06% ▃▆█▁ ▂▂▃▄▅▁▃ █████▄▄▇▆▁▁▄▇▄▆▆▄▆▄▆▆▆▁▄▄▆▄▇▄▆▆▆▄▆▆▄▆▄▄▁▆▄▄▆▆▇▁▆▆▆▇███████▇ ▇ 11.8 ms Histogram: log(frequency) by time 21 ms < Memory estimate: 24.52 MiB, allocs estimate: 84722.
In-place Computation of the Gradient
We can reduce the memory allocations by implementing the gradient as a functor. The motivation is twofold: on one hand, we want to avoid variables from the global scope, for example the manifold M
or the data
, being used within the function. Considering to do the same for more complicated cost functions might also be worth it.
Here, we store the data (as reference) and one temporary memory in order to avoid reallocation of memory per grad_distance
computation. We have
begin
struct grad!{TD,TTMP}
data::TD
tmp::TTMP
end
function (gradf!::grad!)(M, X, x)
fill!(X, 0)
for di in gradf!.data
grad_distance!(M, gradf!.tmp, di, x)
X .+= gradf!.tmp
end
X ./= length(gradf!.data)
return X
end
end
Then we just have to initialize the gradient and perform our final benchmark. Note that we also have to interpolate all variables passed to the benchmark with a $
.
begin
gradF2! = grad!(data, similar(data[1]))
m2 = deepcopy(x0)
gradient_descent!(
M, F, gradF2!, m2; evaluation=MutatingEvaluation(), stopping_criterion=sc
)
@benchmark gradient_descent!(
$M, $F, $gradF2!, m2; evaluation=$(MutatingEvaluation()), stopping_criterion=$sc
) setup = (m2 = deepcopy($x0))
end
BenchmarkTools.Trial: 904 samples with 1 evaluation. Range (min … max): 5.122 ms … 5.908 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 5.528 ms ┊ GC (median): 0.00% Time (mean ± σ): 5.526 ms ± 127.220 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▁ ▃█▄ ▂▁▂▂▂▂▂▂▁▁▂▂▃▃▂▄▄▄▃▃▄▄▆▅▄▆▆▆▅██████▇▅▄▄▃▄▄▃▄▃▃▃▃▃▃▄▃▃▄▃▃▂▃▂ ▃ 5.12 ms Histogram: frequency by time 5.87 ms < Memory estimate: 57.77 KiB, allocs estimate: 940.
Note that the results m1
and m2
are of course (approximately) the same.
distance(M, m1, m2)
0.0