How to Run Stochastic Gradient Descent

Ronny Bergmann

This tutorial illustrates how to use the stochastic_gradient_descent solver and different DirectionUpdateRules in order to introduce the average or momentum variant, see Stochastic Gradient Descent.

Computationally, we look at a very simple but large scale problem, the Riemannian Center of Mass or FrΓ©chet mean: for given points $p_i ∈\mathcal M$, $i=1,…,N$ this optimization problem reads

\[\operatorname*{arg\,min}_{x∈\mathcal M} \frac{1}{2}\sum_{i=1}^{N} \operatorname{d}^2_{\mathcal M}(x,p_i),\]

which of course can be (and is) solved by a gradient descent, see the introductory tutorial or Statistics in Manifolds.jl. If $N$ is very large, evaluating the complete gradient might be quite expensive. A remedy is to evaluate only one of the terms at a time and choose a random order for these.

We first initialize the packages

using Manifolds, Manopt, Random, BenchmarkTools, ManifoldDiff
using ManifoldDiff: grad_distance
Random.seed!(42);

We next generate a (little) large(r) data set

n = 5000
Οƒ = Ο€ / 12
M = Sphere(2)
p = 1 / sqrt(2) * [1.0, 0.0, 1.0]
data = [exp(M, p,  Οƒ * rand(M; vector_at=p)) for i in 1:n];

Note that due to the construction of the points as zero mean tangent vectors, the mean should be very close to our initial point p.

In order to use the stochastic gradient, we now need a function that returns the vector of gradients. There are two ways to define it in Manopt.jl: either as a single function that returns a vector, or as a vector of functions.

The first variant is of course easier to define, but the second is more efficient when only evaluating one of the gradients.

For the mean, the gradient is

\[\operatorname{grad}f(p) = \sum_{i=1}^N \operatorname{grad}f_i(x) \quad \text{where} \operatorname{grad}f_i(x) = -\log_x p_i\]

which we define in Manopt.jl in two different ways: either as one function returning all gradients as a vector (see gradF), or, maybe more fitting for a large scale problem, as a vector of small gradient functions (see gradf)

F(M, p) = 1 / (2 * n) * sum(map(q -> distance(M, p, q)^2, data))
gradF(M, p) = [grad_distance(M, p, q) for q in data]
gradf = [(M, p) -> grad_distance(M, q, p) for q in data];
p0 = 1 / sqrt(3) * [1.0, 1.0, 1.0]
3-element Vector{Float64}:
 0.5773502691896258
 0.5773502691896258
 0.5773502691896258

The calls are only slightly different, but notice that accessing the second gradient element requires evaluating all logs in the first function, while we only call one of the functions in the second array of functions. So while you can use both gradF and gradf in the following call, the second one is (much) faster:

p_opt1 = stochastic_gradient_descent(M, gradF, p)
3-element Vector{Float64}:
 -0.19197815360666842
  0.14005182005854327
  0.026342223307325087
@benchmark stochastic_gradient_descent($M, $gradF, $p0)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 6.815 s (3.81% GC) to evaluate,
 with a memory estimate of 7.83 GiB, over 100148148 allocations.
p_opt2 = stochastic_gradient_descent(M, gradf, p0)
3-element Vector{Float64}:
 0.09661685716733948
 0.30673827339026233
 0.9468774020688563
@benchmark stochastic_gradient_descent($M, $gradf, $p0)
BenchmarkTools.Trial: 1327 samples with 1 evaluation.
 Range (min … max):  3.528 ms …   6.679 ms  β”Š GC (min … max): 0.00% … 25.43%
 Time  (median):     3.615 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.764 ms Β± 466.808 ΞΌs  β”Š GC (mean Β± Οƒ):  3.01% Β±  7.77%

  β–β–†β–ˆβ–†β–ƒβ–‚β–‚β–β– ▁                                                  
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–…β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–„β–„β–β–β–β–„β–†β–†β–‡β–ˆβ–ˆβ–‡β–ˆβ–‡β–‡ β–ˆ
  3.53 ms      Histogram: log(frequency) by time      5.38 ms <

 Memory estimate: 3.13 MiB, allocs estimate: 40028.

This result is reasonably close. But we can improve it by using a DirectionUpdateRule, namely:

On the one hand MomentumGradient, which requires both the manifold and the initial value, in order to keep track of the iterate and parallel transport the last direction to the current iterate. The necessary vector_transport_method keyword is set to a suitable default on every manifold, see default_vector_transport_method. We get β€œβ€œβ€

p_opt3 = stochastic_gradient_descent(
    M, gradf, p0; direction=MomentumGradient(M, p0; direction=StochasticGradient(M))
)
3-element Vector{Float64}:
  0.5011647464914469
 -0.8387050653644194
 -0.2130908496539228
MG = MomentumGradient(M, p0; direction=StochasticGradient(M));
@benchmark stochastic_gradient_descent($M, $gradf, $p0; direction=$MG)
BenchmarkTools.Trial: 399 samples with 1 evaluation.
 Range (min … max):  11.733 ms …  14.962 ms  β”Š GC (min … max): 0.00% … 19.50%
 Time  (median):     12.007 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   12.524 ms Β± 963.691 ΞΌs  β”Š GC (mean Β± Οƒ):  4.38% Β±  6.80%

    β–†β–ˆβ–„                                                         
  β–…β–‡β–ˆβ–ˆβ–ˆβ–‡β–…β–†β–†β–†β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–β–‚β–β–‚β–β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–ƒβ–„β–…β–„β–ƒβ–…β–„β–ƒβ–„β–ƒβ–„β–„β–‚β–β–‚β–ƒ β–ƒ
  11.7 ms         Histogram: frequency by time         14.6 ms <

 Memory estimate: 10.75 MiB, allocs estimate: 229514.

And on the other hand the AverageGradient computes an average of the last n gradients. This is done by

p_opt4 = stochastic_gradient_descent(
    M, gradf, p0; direction=AverageGradient(M, p0; n=10, direction=StochasticGradient(M))
)
3-element Vector{Float64}:
 0.6718677562271265
 0.036703234240620684
 0.7397611714185921
AG = AverageGradient(M, p0; n=10, direction=StochasticGradient(M));
@benchmark stochastic_gradient_descent($M, $gradf, $p0; direction=$AG)
BenchmarkTools.Trial: 126 samples with 1 evaluation.
 Range (min … max):  37.586 ms … 54.377 ms  β”Š GC (min … max): 0.00% … 4.34%
 Time  (median):     40.086 ms              β”Š GC (median):    5.58%
 Time  (mean Β± Οƒ):   39.968 ms Β±  1.701 ms  β”Š GC (mean Β± Οƒ):  4.35% Β± 2.43%

      β–‚                β–„β–…β–…β–…β–ƒ β–ˆβ–ƒ β–ƒ                              
  β–…β–ƒβ–ˆβ–†β–ˆβ–…β–ˆβ–…β–β–ƒβ–β–β–β–ƒβ–β–β–β–…β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ƒβ–‡β–β–β–β–β–ƒβ–β–ƒβ–β–…β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒ β–ƒ
  37.6 ms         Histogram: frequency by time        43.8 ms <

 Memory estimate: 33.64 MiB, allocs estimate: 549514.

Note that the default StoppingCriterion is a fixed number of iterations which helps the comparison here.

For both update rules we have to internally specify that we are still in the stochastic setting, since both rules can also be used with the IdentityUpdateRule within gradient_descent.

For this not-that-large-scale example we can of course also use a gradient descent with ArmijoLinesearch,

fullGradF(M, p) = sum(grad_distance(M, q, p) for q in data)
p_opt5 = gradient_descent(M, F, fullGradF, p0; stepsize=ArmijoLinesearch(M))
3-element Vector{Float64}:
  0.7212810407545467
 -0.08714530556591063
  0.6871385274934468

but it will be a little slower usually

AL = ArmijoLinesearch(M);
@benchmark gradient_descent($M, $F, $fullGradF, $p0; stepsize=$AL)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  640.268 ms … 659.038 ms  β”Š GC (min … max): 4.11% … 4.28%
 Time  (median):     647.474 ms               β”Š GC (median):    4.37%
 Time  (mean Β± Οƒ):   648.395 ms Β±   6.735 ms  β”Š GC (mean Β± Οƒ):  4.36% Β± 0.15%

  ▁   ▁           ▁ ▁         β–ˆ                         ▁     ▁  
  β–ˆβ–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–ˆβ–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–ˆ ▁
  640 ms           Histogram: frequency by time          659 ms <

 Memory estimate: 749.00 MiB, allocs estimate: 12025142.

Note that all 5 runs are very close to each other, here we check the distance to the first