How to run stochastic gradient descent

Ronny Bergmann

This tutorial illustrates how to use the stochastic_gradient_descent solver and different DirectionUpdateRules to introduce the average or momentum variant, see Stochastic Gradient Descent.

Computationally, we look at a very simple but large scale problem, the Riemannian Center of Mass or Fréchet mean: for given points $p_i ∈\mathcal M$, $i=1,…,N$ this optimization problem reads

\[\operatorname*{arg\,min}_{x∈\mathcal M} \frac{1}{2}\sum_{i=1}^{N} \operatorname{d}^2_{\mathcal M}(x,p_i),\]

which of course can be (and is) solved by a gradient descent, see the introductory tutorial or Statistics in Manifolds.jl. If $N$ is very large, evaluating the complete gradient might be quite expensive. A remedy is to evaluate only one of the terms at a time and choose a random order for these.

We first initialize the packages

using Manifolds, Manopt, Random, BenchmarkTools, ManifoldDiff
using ManifoldDiff: grad_distance
Random.seed!(42);

We next generate a (little) large(r) data set

n = 5000
σ = π / 12
M = Sphere(2)
p = 1 / sqrt(2) * [1.0, 0.0, 1.0]
data = [exp(M, p,  σ * rand(M; vector_at=p)) for i in 1:n];

Note that due to the construction of the points as zero mean tangent vectors, the mean should be very close to our initial point p.

In order to use the stochastic gradient, we now need a function that returns the vector of gradients. There are two ways to define it in Manopt.jl: either as a single function that returns a vector, or as a vector of functions.

The first variant is of course easier to define, but the second is more efficient when only evaluating one of the gradients.

For the mean, the gradient is

\[\operatorname{grad}f(p) = \sum_{i=1}^N \operatorname{grad}f_i(x) \quad \text{where} \operatorname{grad}f_i(x) = -\log_x p_i\]

which we define in Manopt.jl in two different ways: either as one function returning all gradients as a vector (see gradF), or, maybe more fitting for a large scale problem, as a vector of small gradient functions (see gradf)

F(M, p) = 1 / (2 * n) * sum(map(q -> distance(M, p, q)^2, data))
gradF(M, p) = [grad_distance(M, p, q) for q in data]
gradf = [(M, p) -> grad_distance(M, q, p) for q in data];
p0 = 1 / sqrt(3) * [1.0, 1.0, 1.0]

3-element Vector{Float64}:
 0.5773502691896258
 0.5773502691896258
 0.5773502691896258

The calls are only slightly different, but notice that accessing the second gradient element requires evaluating all logs in the first function, while we only call one of the functions in the second array of functions. So while you can use both gradF and gradf in the following call, the second one is (much) faster:

p_opt1 = stochastic_gradient_descent(M, gradF, p)

3-element Vector{Float64}:
 -0.19197815360666842
  0.14005182005854327
  0.026342223307325087

@benchmark stochastic_gradient_descent($M, $gradF, $p0)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 6.953 s (5.52% GC) to evaluate,
 with a memory estimate of 7.83 GiB, over 100148141 allocations.

p_opt2 = stochastic_gradient_descent(M, gradf, p0)

3-element Vector{Float64}:
 0.09661685716733948
 0.30673827339026233
 0.9468774020688563

@benchmark stochastic_gradient_descent($M, $gradf, $p0)

BenchmarkTools.Trial: 1259 samples with 1 evaluation.
 Range (min … max):  3.605 ms … 163.533 ms  ┊ GC (min … max): 0.00% … 97.42%
 Time  (median):     3.674 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.971 ms ±   4.521 ms  ┊ GC (mean ± σ):  5.33% ±  6.97%

  ▁▇█▆▅▄▁▂▁                                            ▁▃▁     
  █████████▇▅▅▅▅▁▁▁▄▄▄▄▁▄▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▄▅▇█████▇▇ █
  3.6 ms       Histogram: log(frequency) by time      5.09 ms <

 Memory estimate: 3.13 MiB, allocs estimate: 40021.

This result is reasonably close. But we can improve it by using a DirectionUpdateRule, namely:

On the one hand MomentumGradient, which requires both the manifold and the initial value, to keep track of the iterate and parallel transport the last direction to the current iterate. The necessary vector_transport_method keyword is set to a suitable default on every manifold, see default_vector_transport_method. We get ““”

p_opt3 = stochastic_gradient_descent(
    M, gradf, p0; direction=MomentumGradient(M, p0; direction=StochasticGradient(M))
)

3-element Vector{Float64}:
 0.3379790946398705
 0.3387798770292505
 0.8780651037972409

MG = MomentumGradient(M, p0; direction=StochasticGradient(M));
@benchmark stochastic_gradient_descent($M, $gradf, $p0; direction=$MG)

BenchmarkTools.Trial: 389 samples with 1 evaluation.
 Range (min … max):  11.514 ms … 173.887 ms  ┊ GC (min … max): 0.00% … 92.04%
 Time  (median):     11.710 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.852 ms ±   8.303 ms  ┊ GC (mean ± σ):  6.57% ±  7.36%

  ▆█▆▃                  ▃▄▂                               ▁     
  ████▇█▅▄▄▁▁▁▁▁▄▁▁▁▁▁▁█████▇▄▄▁▅▁▁▁▁▁▁▁▁▄▁▁▁▄▄▁▁▁▁▁▁▄▁▁▁▄█▇▄▄ ▆
  11.5 ms       Histogram: log(frequency) by time      16.8 ms <

 Memory estimate: 10.75 MiB, allocs estimate: 229507.

And on the other hand the AverageGradient computes an average of the last n gradients. This is done by

p_opt4 = stochastic_gradient_descent(
    M, gradf, p0; direction=AverageGradient(M, p0; n=10, direction=StochasticGradient(M)), debug=[],
)

3-element Vector{Float64}:
 0.7715614094892072
 0.026687369554185804
 0.6355948203795436

AG = AverageGradient(M, p0; n=10, direction=StochasticGradient(M));
@benchmark stochastic_gradient_descent($M, $gradf, $p0; direction=$AG, debug=[])

BenchmarkTools.Trial: 118 samples with 1 evaluation.
 Range (min … max):  38.040 ms … 202.791 ms  ┊ GC (min … max): 0.00% … 80.12%
 Time  (median):     40.218 ms               ┊ GC (median):    5.02%
 Time  (mean ± σ):   42.412 ms ±  15.494 ms  ┊ GC (mean ± σ):  6.71% ±  7.40%

  ▅   █▄                                                        
  █▆▆▆███▁▁▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁█▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  38 ms         Histogram: log(frequency) by time      66.3 ms <

 Memory estimate: 33.64 MiB, allocs estimate: 549509.

Note that the default StoppingCriterion is a fixed number of iterations which helps the comparison here.

For both update rules we have to internally specify that we are still in the stochastic setting, since both rules can also be used with the IdentityUpdateRule within gradient_descent.

For this not-that-large-scale example we can of course also use a gradient descent with ArmijoLinesearch,

fullGradF(M, p) = sum(grad_distance(M, q, p) for q in data)
p_opt5 = gradient_descent(M, F, fullGradF, p0; stepsize=ArmijoLinesearch(M))

3-element Vector{Float64}:
  0.7864767902648482
 -0.06253844040250767
  0.6144454425306826

but in general it is expected to be a bit slow.

AL = ArmijoLinesearch(M);
@benchmark gradient_descent($M, $F, $fullGradF, $p0; stepsize=$AL)

BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  685.182 ms … 958.779 ms  ┊ GC (min … max): 3.83% … 20.25%
 Time  (median):     704.015 ms               ┊ GC (median):    4.65%
 Time  (mean ± σ):   739.568 ms ±  97.531 ms  ┊ GC (mean ± σ):  7.29% ±  6.03%

  ▁▁  █▁  ▁                                                   ▁  
  ██▁▁██▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  685 ms           Histogram: frequency by time          959 ms <

 Memory estimate: 748.99 MiB, allocs estimate: 12025136.

Note that all 5 runs are very close to each other.

Technical details

This tutorial is cached. It was last run on the following package versions.

using Pkg
Pkg.status()

Status `~/work/Manopt.jl/Manopt.jl/tutorials/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [5ae59095] Colors v0.12.11
  [31c24e10] Distributions v0.25.109
  [26cc04aa] FiniteDifferences v0.12.32
  [7073ff75] IJulia v1.25.0
  [8ac3fa9e] LRUCache v1.6.1
  [af67fdf4] ManifoldDiff v0.3.10
  [1cead3c2] Manifolds v0.9.20
  [3362f125] ManifoldsBase v0.15.10
  [0fc0a36d] Manopt v0.4.67 `..`
  [91a5bcdd] Plots v1.40.5

using Dates
now()

2024-07-25T08:13:59.500