How to Run Stochastic Gradient Descent
Ronny Bergmann
This tutorial illustrates how to use the stochastic_gradient_descent
solver and different DirectionUpdateRule
s in order to introduce the average or momentum variant, see Stochastic Gradient Descent.
Computationally, we look at a very simple but large scale problem, the Riemannian Center of Mass or FrΓ©chet mean: for given points $p_i β\mathcal M$, $i=1,β¦,N$ this optimization problem reads
\[\operatorname*{arg\,min}_{xβ\mathcal M} \frac{1}{2}\sum_{i=1}^{N} \operatorname{d}^2_{\mathcal M}(x,p_i),\]
which of course can be (and is) solved by a gradient descent, see the introductory tutorial or Statistics in Manifolds.jl. If $N$ is very large, evaluating the complete gradient might be quite expensive. A remedy is to evaluate only one of the terms at a time and choose a random order for these.
We first initialize the packages
using Manifolds, Manopt, Random, BenchmarkTools, ManifoldDiff
using ManifoldDiff: grad_distance
Random.seed!(42);
We next generate a (little) large(r) data set
n = 5000
Ο = Ο / 12
M = Sphere(2)
p = 1 / sqrt(2) * [1.0, 0.0, 1.0]
data = [exp(M, p, Ο * rand(M; vector_at=p)) for i in 1:n];
Note that due to the construction of the points as zero mean tangent vectors, the mean should be very close to our initial point p
.
In order to use the stochastic gradient, we now need a function that returns the vector of gradients. There are two ways to define it in Manopt.jl
: either as a single function that returns a vector, or as a vector of functions.
The first variant is of course easier to define, but the second is more efficient when only evaluating one of the gradients.
For the mean, the gradient is
\[\operatorname{grad}f(p) = \sum_{i=1}^N \operatorname{grad}f_i(x) \quad \text{where} \operatorname{grad}f_i(x) = -\log_x p_i\]
which we define in Manopt.jl
in two different ways: either as one function returning all gradients as a vector (see gradF
), or, maybe more fitting for a large scale problem, as a vector of small gradient functions (see gradf
)
F(M, p) = 1 / (2 * n) * sum(map(q -> distance(M, p, q)^2, data))
gradF(M, p) = [grad_distance(M, p, q) for q in data]
gradf = [(M, p) -> grad_distance(M, q, p) for q in data];
p0 = 1 / sqrt(3) * [1.0, 1.0, 1.0]
3-element Vector{Float64}:
0.5773502691896258
0.5773502691896258
0.5773502691896258
The calls are only slightly different, but notice that accessing the second gradient element requires evaluating all logs in the first function, while we only call one of the functions in the second array of functions. So while you can use both gradF
and gradf
in the following call, the second one is (much) faster:
p_opt1 = stochastic_gradient_descent(M, gradF, p)
3-element Vector{Float64}:
-0.19197815360666842
0.14005182005854327
0.026342223307325087
@benchmark stochastic_gradient_descent($M, $gradF, $p0)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 6.815 s (3.81% GC) to evaluate,
with a memory estimate of 7.83 GiB, over 100148148 allocations.
p_opt2 = stochastic_gradient_descent(M, gradf, p0)
3-element Vector{Float64}:
0.09661685716733948
0.30673827339026233
0.9468774020688563
@benchmark stochastic_gradient_descent($M, $gradf, $p0)
BenchmarkTools.Trial: 1327 samples with 1 evaluation.
Range (min β¦ max): 3.528 ms β¦ 6.679 ms β GC (min β¦ max): 0.00% β¦ 25.43%
Time (median): 3.615 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.764 ms Β± 466.808 ΞΌs β GC (mean Β± Ο): 3.01% Β± 7.77%
βββββββββ β
βββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββ β
3.53 ms Histogram: log(frequency) by time 5.38 ms <
Memory estimate: 3.13 MiB, allocs estimate: 40028.
This result is reasonably close. But we can improve it by using a DirectionUpdateRule
, namely:
On the one hand MomentumGradient
, which requires both the manifold and the initial value, in order to keep track of the iterate and parallel transport the last direction to the current iterate. The necessary vector_transport_method
keyword is set to a suitable default on every manifold, see default_vector_transport_method
. We get βββ
p_opt3 = stochastic_gradient_descent(
M, gradf, p0; direction=MomentumGradient(M, p0; direction=StochasticGradient(M))
)
3-element Vector{Float64}:
0.5011647464914469
-0.8387050653644194
-0.2130908496539228
MG = MomentumGradient(M, p0; direction=StochasticGradient(M));
@benchmark stochastic_gradient_descent($M, $gradf, $p0; direction=$MG)
BenchmarkTools.Trial: 399 samples with 1 evaluation.
Range (min β¦ max): 11.733 ms β¦ 14.962 ms β GC (min β¦ max): 0.00% β¦ 19.50%
Time (median): 12.007 ms β GC (median): 0.00%
Time (mean Β± Ο): 12.524 ms Β± 963.691 ΞΌs β GC (mean Β± Ο): 4.38% Β± 6.80%
βββ
β
ββββββ
ββββββββββββββββββββββββββββββββββββββββ
βββ
ββββββββββ β
11.7 ms Histogram: frequency by time 14.6 ms <
Memory estimate: 10.75 MiB, allocs estimate: 229514.
And on the other hand the AverageGradient
computes an average of the last n
gradients. This is done by
p_opt4 = stochastic_gradient_descent(
M, gradf, p0; direction=AverageGradient(M, p0; n=10, direction=StochasticGradient(M))
)
3-element Vector{Float64}:
0.6718677562271265
0.036703234240620684
0.7397611714185921
AG = AverageGradient(M, p0; n=10, direction=StochasticGradient(M));
@benchmark stochastic_gradient_descent($M, $gradf, $p0; direction=$AG)
BenchmarkTools.Trial: 126 samples with 1 evaluation.
Range (min β¦ max): 37.586 ms β¦ 54.377 ms β GC (min β¦ max): 0.00% β¦ 4.34%
Time (median): 40.086 ms β GC (median): 5.58%
Time (mean Β± Ο): 39.968 ms Β± 1.701 ms β GC (mean Β± Ο): 4.35% Β± 2.43%
β ββ
β
β
β ββ β
β
βββββ
ββ
ββββββββββ
ββββββββββββββββββββββββ
βββββββββββββββββ β
37.6 ms Histogram: frequency by time 43.8 ms <
Memory estimate: 33.64 MiB, allocs estimate: 549514.
Note that the default StoppingCriterion
is a fixed number of iterations which helps the comparison here.
For both update rules we have to internally specify that we are still in the stochastic setting, since both rules can also be used with the IdentityUpdateRule
within gradient_descent
.
For this not-that-large-scale example we can of course also use a gradient descent with ArmijoLinesearch
,
fullGradF(M, p) = sum(grad_distance(M, q, p) for q in data)
p_opt5 = gradient_descent(M, F, fullGradF, p0; stepsize=ArmijoLinesearch(M))
3-element Vector{Float64}:
0.7212810407545467
-0.08714530556591063
0.6871385274934468
but it will be a little slower usually
AL = ArmijoLinesearch(M);
@benchmark gradient_descent($M, $F, $fullGradF, $p0; stepsize=$AL)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min β¦ max): 640.268 ms β¦ 659.038 ms β GC (min β¦ max): 4.11% β¦ 4.28%
Time (median): 647.474 ms β GC (median): 4.37%
Time (mean Β± Ο): 648.395 ms Β± 6.735 ms β GC (mean Β± Ο): 4.36% Β± 0.15%
β β β β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
640 ms Histogram: frequency by time 659 ms <
Memory estimate: 749.00 MiB, allocs estimate: 12025142.
Note that all 5 runs are very close to each other, here we check the distance to the first