To this point, all `torch`

use instances we’ve mentioned right here have been in deep studying. Nonetheless, its computerized differentiation characteristic is helpful in different areas. One outstanding instance is numerical optimization: We will use `torch`

to search out the minimal of a perform.

In reality, perform minimization is *precisely* what occurs in coaching a neural community. However there, the perform in query usually is much too advanced to even think about discovering its minima analytically. Numerical optimization goals at build up the instruments to deal with simply this complexity. To that finish, nonetheless, it begins from features which are far much less deeply composed. As an alternative, they’re hand-crafted to pose particular challenges.

This submit is a primary introduction to numerical optimization with `torch`

. Central takeaways are the existence and usefulness of its L-BFGS optimizer, in addition to the affect of operating L-BFGS with line search. As a enjoyable add-on, we present an instance of constrained optimization, the place a constraint is enforced by way of a quadratic penalty perform.

To heat up, we take a detour, minimizing a perform “ourselves” utilizing nothing however tensors. This may grow to be related later, although, as the general course of will nonetheless be the identical. All modifications will likely be associated to integration of `optimizer`

s and their capabilities.

## Operate minimization, DYI method

To see how we will reduce a perform “by hand”, let’s attempt the long-lasting Rosenbrock function. This can be a perform with two variables:

[

f(x_1, x_2) = (a – x_1)^2 + b * (x_2 – x_1^2)^2

]

, with (a) and (b) configurable parameters usually set to 1 and 5, respectively.

In R:

Its minimal is positioned at (1,1), inside a slender valley surrounded by breakneck-steep cliffs:

Our purpose and technique are as follows.

We wish to discover the values (x_1) and (x_2) for which the perform attains its minimal. We have now to start out someplace; and from wherever that will get us on the graph we observe the damaging of the gradient “downwards”, descending into areas of consecutively smaller perform worth.

Concretely, in each iteration, we take the present ((x1,x2)) level, compute the perform worth in addition to the gradient, and subtract some fraction of the latter to reach at a brand new ((x1,x2)) candidate. This course of goes on till we both attain the minimal – the gradient is zero – or enchancment is beneath a selected threshold.

Right here is the corresponding code. For no particular causes, we begin at `(-1,1)`

. The educational price (the fraction of the gradient to subtract) wants some experimentation. (Attempt 0.1 and 0.001 to see its affect.)

```
num_iterations <- 1000
# fraction of the gradient to subtract
lr <- 0.01
# perform enter (x1,x2)
# that is the tensor w.r.t. which we'll have torch compute the gradient
x_star <- torch_tensor(c(-1, 1), requires_grad = TRUE)
for (i in 1:num_iterations) {
if (i %% 100 == 0) cat("Iteration: ", i, "n")
# name perform
worth <- rosenbrock(x_star)
if (i %% 100 == 0) cat("Worth is: ", as.numeric(worth), "n")
# compute gradient of worth w.r.t. params
worth$backward()
if (i %% 100 == 0) cat("Gradient is: ", as.matrix(x_star$grad), "nn")
# handbook replace
with_no_grad({
x_star$sub_(lr * x_star$grad)
x_star$grad$zero_()
})
}
```

```
Iteration: 100
Worth is: 0.3502924
Gradient is: -0.667685 -0.5771312
Iteration: 200
Worth is: 0.07398106
Gradient is: -0.1603189 -0.2532476
...
...
Iteration: 900
Worth is: 0.0001532408
Gradient is: -0.004811743 -0.009894371
Iteration: 1000
Worth is: 6.962555e-05
Gradient is: -0.003222887 -0.006653666
```

Whereas this works, it actually serves for instance the precept. With `torch`

offering a bunch of confirmed optimization algorithms, there isn’t any want for us to manually compute the candidate (mathbf{x}) values.

## Operate minimization with `torch`

optimizers

As an alternative, we let a `torch`

optimizer replace the candidate (mathbf{x}) for us. Habitually, our first attempt is *Adam*.

### Adam

With Adam, optimization proceeds quite a bit sooner. Reality be informed, although, selecting a very good studying price *nonetheless* takes non-negligeable experimentation. (Attempt the default studying price, 0.001, for comparability.)

```
num_iterations <- 100
x_star <- torch_tensor(c(-1, 1), requires_grad = TRUE)
lr <- 1
optimizer <- optim_adam(x_star, lr)
for (i in 1:num_iterations) {
if (i %% 10 == 0) cat("Iteration: ", i, "n")
optimizer$zero_grad()
worth <- rosenbrock(x_star)
if (i %% 10 == 0) cat("Worth is: ", as.numeric(worth), "n")
worth$backward()
optimizer$step()
if (i %% 10 == 0) cat("Gradient is: ", as.matrix(x_star$grad), "nn")
}
```

```
Iteration: 10
Worth is: 0.8559565
Gradient is: -1.732036 -0.5898831
Iteration: 20
Worth is: 0.1282992
Gradient is: -3.22681 1.577383
...
...
Iteration: 90
Worth is: 4.003079e-05
Gradient is: -0.05383469 0.02346456
Iteration: 100
Worth is: 6.937736e-05
Gradient is: -0.003240437 -0.006630421
```

It took us a couple of hundred iterations to reach at an honest worth. This can be a lot sooner than the handbook method above, however nonetheless quite a bit. Fortunately, additional enhancements are potential.

### L-BFGS

Among the many many `torch`

optimizers generally utilized in deep studying (Adam, AdamW, RMSprop …), there’s one “outsider”, a lot better identified in basic numerical optimization than in neural-networks house: L-BFGS, a.ok.a. Limited-memory BFGS, a memory-optimized implementation of the Broyden–Fletcher–Goldfarb–Shanno optimization algorithm (BFGS).

BFGS is maybe probably the most extensively used among the many so-called Quasi-Newton, second-order optimization algorithms. Versus the household of first-order algorithms that, in deciding on a descent path, make use of gradient info solely, second-order algorithms moreover take curvature info under consideration. To that finish, actual Newton strategies really compute the Hessian (a expensive operation), whereas Quasi-Newton strategies keep away from that price and, as a substitute, resort to iterative approximation.

Trying on the contours of the Rosenbrock perform, with its extended, slender valley, it isn’t troublesome to think about that curvature info would possibly make a distinction. And, as you’ll see in a second, it actually does. Earlier than although, one observe on the code. When utilizing L-BFGS, it’s essential to wrap each perform name and gradient analysis in a closure (`calc_loss()`

, within the beneath snippet), for them to be callable a number of instances per iteration. You possibly can persuade your self that the closure is, in reality, entered repeatedly, by inspecting this code snippet’s chatty output:

```
num_iterations <- 3
x_star <- torch_tensor(c(-1, 1), requires_grad = TRUE)
optimizer <- optim_lbfgs(x_star)
calc_loss <- perform() {
optimizer$zero_grad()
worth <- rosenbrock(x_star)
cat("Worth is: ", as.numeric(worth), "n")
worth$backward()
cat("Gradient is: ", as.matrix(x_star$grad), "nn")
worth
}
for (i in 1:num_iterations) {
cat("Iteration: ", i, "n")
optimizer$step(calc_loss)
}
```

```
Iteration: 1
Worth is: 4
Gradient is: -4 0
Worth is: 6
Gradient is: -2 10
...
...
Worth is: 0.04880721
Gradient is: -0.262119 -0.1132655
Worth is: 0.0302862
Gradient is: 1.293824 -0.7403332
Iteration: 2
Worth is: 0.01697086
Gradient is: 0.3468466 -0.3173429
Worth is: 0.01124081
Gradient is: 0.2420997 -0.2347881
...
...
Worth is: 1.111701e-09
Gradient is: 0.0002865837 -0.0001251698
Worth is: 4.547474e-12
Gradient is: -1.907349e-05 9.536743e-06
Iteration: 3
Worth is: 4.547474e-12
Gradient is: -1.907349e-05 9.536743e-06
```

Regardless that we ran the algorithm for 3 iterations, the optimum worth actually is reached after two. Seeing how nicely this labored, we attempt L-BFGS on a harder perform, named *flower*, for fairly self-evident causes.

## (But) extra enjoyable with L-BFGS

Right here is the *flower* perform. Mathematically, its minimal is close to `(0,0)`

, however technically the perform itself is undefined at `(0,0)`

, for the reason that `atan2`

used within the perform shouldn’t be outlined there.

```
a <- 1
b <- 1
c <- 4
flower <- perform(x) {
a * torch_norm(x) + b * torch_sin(c * torch_atan2(x[2], x[1]))
}
```

We run the identical code as above, ranging from `(20,20)`

this time.

```
num_iterations <- 3
x_star <- torch_tensor(c(20, 0), requires_grad = TRUE)
optimizer <- optim_lbfgs(x_star)
calc_loss <- perform() {
optimizer$zero_grad()
worth <- flower(x_star)
cat("Worth is: ", as.numeric(worth), "n")
worth$backward()
cat("Gradient is: ", as.matrix(x_star$grad), "n")
cat("X is: ", as.matrix(x_star), "nn")
worth
}
for (i in 1:num_iterations) {
cat("Iteration: ", i, "n")
optimizer$step(calc_loss)
}
```

```
Iteration: 1
Worth is: 28.28427
Gradient is: 0.8071069 0.6071068
X is: 20 20
...
...
Worth is: 19.33546
Gradient is: 0.8100872 0.6188223
X is: 12.957 14.68274
...
...
Worth is: 18.29546
Gradient is: 0.8096464 0.622064
X is: 12.14691 14.06392
...
...
Worth is: 9.853705
Gradient is: 0.7546976 0.7025688
X is: 5.763702 8.895616
Worth is: 2635.866
Gradient is: -0.7407354 -0.6717985
X is: -1949.697 -1773.551
Iteration: 2
Worth is: 1333.113
Gradient is: -0.7413024 -0.6711776
X is: -985.4553 -897.5367
Worth is: 30.16862
Gradient is: -0.7903821 -0.6266789
X is: -21.02814 -21.72296
Worth is: 1281.39
Gradient is: 0.7544561 0.6563575
X is: 964.0121 843.7817
Worth is: 628.1306
Gradient is: 0.7616636 0.6480014
X is: 475.7051 409.7372
Worth is: 4965690
Gradient is: -0.7493951 -0.662123
X is: -3721262 -3287901
Worth is: 2482306
Gradient is: -0.7503822 -0.6610042
X is: -1862675 -1640817
Worth is: 8.61863e+11
Gradient is: 0.7486113 0.6630091
X is: 645200412672 571423064064
Worth is: 430929412096
Gradient is: 0.7487153 0.6628917
X is: 322643460096 285659529216
Worth is: Inf
Gradient is: 0 0
X is: -2.826342e+19 -2.503904e+19
Iteration: 3
Worth is: Inf
Gradient is: 0 0
X is: -2.826342e+19 -2.503904e+19
```

This has been much less of a hit. At first, loss decreases properly, however all of a sudden, the estimate dramatically overshoots, and retains bouncing between damaging and optimistic outer house ever after.

Fortunately, there’s something we will do.

### L-BFGS with line search

Taken in isolation, what a Quasi-Newton methodology like L-BFGS does is decide the perfect descent path. Nonetheless, as we simply noticed, a very good path shouldn’t be sufficient. With the flower perform, wherever we’re, the optimum path results in catastrophe if we keep on it lengthy sufficient. Thus, we’d like an algorithm that fastidiously evaluates not solely the place to go, but in addition, how far.

Because of this, L-BFGS implementations generally incorporate *line search*, that’s, a algorithm indicating whether or not a proposed step size is an effective one, or needs to be improved upon.

Particularly, `torch`

’s L-BFGS optimizer implements the Strong Wolfe conditions. We re-run the above code, altering simply two strains. Most significantly, the one the place the optimizer is instantiated:

`optimizer <- optim_lbfgs(x_star, line_search_fn = "strong_wolfe")`

And secondly, this time I discovered that after the third iteration, loss continued to lower for some time, so I let it run for 5 iterations. Right here is the output:

```
Iteration: 1
...
...
Worth is: -0.8838741
Gradient is: 3.742207 7.521572
X is: 0.09035123 -0.03220009
Worth is: -0.928809
Gradient is: 1.464702 0.9466625
X is: 0.06564617 -0.026706
Iteration: 2
...
...
Worth is: -0.9991404
Gradient is: 39.28394 93.40318
X is: 0.0006493925 -0.0002656128
Worth is: -0.9992246
Gradient is: 6.372203 12.79636
X is: 0.0007130796 -0.0002947929
Iteration: 3
...
...
Worth is: -0.9997789
Gradient is: 3.565234 5.995832
X is: 0.0002042478 -8.457939e-05
Worth is: -0.9998025
Gradient is: -4.614189 -13.74602
X is: 0.0001822711 -7.553725e-05
Iteration: 4
...
...
Worth is: -0.9999917
Gradient is: -382.3041 -921.4625
X is: -6.320081e-06 2.614706e-06
Worth is: -0.9999923
Gradient is: -134.0946 -321.2681
X is: -6.921942e-06 2.865841e-06
Iteration: 5
...
...
Worth is: -0.9999999
Gradient is: -3446.911 -8320.007
X is: -7.267168e-08 3.009783e-08
Worth is: -0.9999999
Gradient is: -3419.361 -8253.501
X is: -7.404627e-08 3.066708e-08
```

It’s nonetheless not good, however quite a bit higher.

Lastly, let’s go one step additional. Can we use `torch`

for constrained optimization?

### Quadratic penalty for constrained optimization

In constrained optimization, we nonetheless seek for a minimal, however that minimal can’t reside simply wherever: Its location has to satisfy some variety of further situations. In optimization lingo, it needs to be *possible*.

As an instance, we stick with the flower perform, however add on a constraint: (mathbf{x}) has to lie outdoors a circle of radius (sqrt(2)), centered on the origin. Formally, this yields the inequality constraint

[

2 – {x_1}^2 – {x_2}^2 <= 0

]

A approach to reduce *flower* and but, on the similar time, honor the constraint is to make use of a penalty perform. With penalty strategies, the worth to be minimized is a sum of two issues: the goal perform’s output and a penalty reflecting potential constraint violation. Use of a *quadratic* *penalty*, for instance, ends in including a a number of of the sq. of the constraint perform’s output:

```
# x^2 + y^2 >= 2
# 2 - x^2 - y^2 <= 0
constraint <- perform(x) 2 - torch_square(torch_norm(x))
# quadratic penalty
penalty <- perform(x) torch_square(torch_max(constraint(x), different = 0))
```

A priori, we will’t understand how huge that a number of needs to be to implement the constraint. Due to this fact, optimization proceeds iteratively. We begin with a small multiplier, (1), say, and enhance it for so long as the constraint remains to be violated:

```
penalty_method <- perform(f, p, x, k_max, rho = 1, gamma = 2, num_iterations = 1) {
for (ok in 1:k_max) {
cat("Beginning step: ", ok, ", rho = ", rho, "n")
reduce(f, p, x, rho, num_iterations)
cat("Worth: ", as.numeric(f(x)), "n")
cat("X: ", as.matrix(x), "n")
current_penalty <- as.numeric(p(x))
cat("Penalty: ", current_penalty, "n")
if (current_penalty == 0) break
rho <- rho * gamma
}
}
```

`reduce()`

, referred to as from `penalty_method()`

, follows the same old proceedings, however now it minimizes the sum of the goal and up-weighted penalty perform outputs:

```
reduce <- perform(f, p, x, rho, num_iterations) {
calc_loss <- perform() {
optimizer$zero_grad()
worth <- f(x) + rho * p(x)
worth$backward()
worth
}
for (i in 1:num_iterations) {
cat("Iteration: ", i, "n")
optimizer$step(calc_loss)
}
}
```

This time, we begin from a low-target-loss, however unfeasible worth. With one more change to default L-BFGS (specifically, a lower in tolerance), we see the algorithm exiting efficiently after twenty-two iterations, on the level `(0.5411692,1.306563)`

.

```
x_star <- torch_tensor(c(0.5, 0.5), requires_grad = TRUE)
optimizer <- optim_lbfgs(x_star, line_search_fn = "strong_wolfe", tolerance_change = 1e-20)
penalty_method(flower, penalty, x_star, k_max = 30)
```

```
Beginning step: 1 , rho = 1
Iteration: 1
Worth: 0.3469974
X: 0.5154735 1.244463
Penalty: 0.03444662
Beginning step: 2 , rho = 2
Iteration: 1
Worth: 0.3818618
X: 0.5288152 1.276674
Penalty: 0.008182613
Beginning step: 3 , rho = 4
Iteration: 1
Worth: 0.3983252
X: 0.5351116 1.291886
Penalty: 0.001996888
...
...
Beginning step: 20 , rho = 524288
Iteration: 1
Worth: 0.4142133
X: 0.5411959 1.306563
Penalty: 3.552714e-13
Beginning step: 21 , rho = 1048576
Iteration: 1
Worth: 0.4142134
X: 0.5411956 1.306563
Penalty: 1.278977e-13
Beginning step: 22 , rho = 2097152
Iteration: 1
Worth: 0.4142135
X: 0.5411962 1.306563
Penalty: 0
```

## Conclusion

Summing up, we’ve gotten a primary impression of the effectiveness of `torch`

’s L-BFGS optimizer, particularly when used with Sturdy-Wolfe line search. In reality, in numerical optimization – versus deep studying, the place computational velocity is far more of a problem – there’s rarely a purpose to *not* use L-BFGS with line search.

We’ve then caught a glimpse of the right way to do constrained optimization, a process that arises in lots of real-world purposes. In that regard, this submit feels much more like a starting than a stock-taking. There’s a lot to discover, from basic methodology match – when is L-BFGS nicely suited to an issue? – by way of computational efficacy to applicability to completely different species of neural networks. Evidently, if this evokes you to run your personal experiments, and/or in case you use L-BFGS in your personal tasks, we’d love to listen to your suggestions!

Thanks for studying!

## Appendix

### Rosenbrock perform plotting code

```
library(tidyverse)
a <- 1
b <- 5
rosenbrock <- perform(x) {
x1 <- x[1]
x2 <- x[2]
(a - x1)^2 + b * (x2 - x1^2)^2
}
df <- expand_grid(x1 = seq(-2, 2, by = 0.01), x2 = seq(-2, 2, by = 0.01)) %>%
rowwise() %>%
mutate(x3 = rosenbrock(c(x1, x2))) %>%
ungroup()
ggplot(information = df,
aes(x = x1,
y = x2,
z = x3)) +
geom_contour_filled(breaks = as.numeric(torch_logspace(-3, 3, steps = 50)),
present.legend = FALSE) +
theme_minimal() +
scale_fill_viridis_d(path = -1) +
theme(facet.ratio = 1)
```

### Flower perform plotting code

```
a <- 1
b <- 1
c <- 4
flower <- perform(x) {
a * torch_norm(x) + b * torch_sin(c * torch_atan2(x[2], x[1]))
}
df <- expand_grid(x = seq(-3, 3, by = 0.05), y = seq(-3, 3, by = 0.05)) %>%
rowwise() %>%
mutate(z = flower(torch_tensor(c(x, y))) %>% as.numeric()) %>%
ungroup()
ggplot(information = df,
aes(x = x,
y = y,
z = z)) +
geom_contour_filled(present.legend = FALSE) +
theme_minimal() +
scale_fill_viridis_d(path = -1) +
theme(facet.ratio = 1)
```

Picture by Michael Trimble on Unsplash