Stochastic Gradient Descent (SGD)
This post explores SGD, which is an optimization technique (optimizer) that is commonly used in neural networks
import matplotlib.pyplot as plt
plt.style.use('dark_background')
from fastai.basics import *
import torch
from torch import nn
import numpy as np
import matplotlib.pyplot as plt
from fastai.torch_core import tensor
n = 100
x = torch.ones(n, 2)
len(x), x[:5]
randomize in an uniform distribution from -1 to 1
x[:,0].uniform_(-1., 1)
x[:5], x.shape
- Any linear model is
y=mx+b
-
m
,x
, andb
are matrices - We have
x
m = tensor(3.,2); m, m.shape
-
b
is a random bias
b = torch.rand(n); b[:5], b.shape
Now we can make our y
- Matrix multiplication is denoted with
@
y = x@m + b
We'll know if we got a size wrong if:
m@x + b
Plot our results
plt.scatter(x[:,0], y)
Our weights from last lesson should minimize the distance between points and our line.
-
mean squared error: Take distance from
pred
andy
, square, then average
def mse(y_hat, y): return ((y_hat-y)**2).mean()
When we run our model, we are trying to predict m
For example, say a = (0.5, 0.75)
.
- Make a prediction
- Calculate the error
a = tensor(.5, .75)
Make prediction
y_pred = x@a
Calculate error
mse(y_pred, y)
plt.scatter(x[:,0],y)
plt.scatter(x[:,0],y_pred)
Model doesn't seen to quite fit. What's next? Optimization
a = nn.Parameter(a); a
Next let's create an update
function to check if the current a
improved. If so, move even closer.
We'll print out every 10 iterations to see how we are doing
def update():
y_hat = x@a
loss = mse(y, y_hat)
if i % 10 == 0: print(loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
a.grad.zero_()
-
torch.no_grad
: No back propogation (no updating of our weights) -
sub_
: Subtracts some value (lr * our gradient) -
grad.zero_
: Zeros our gradients
lr = 1e-1
for i in range(100): update()
Now let's see how this new a
compares.
- Detach removes all gradients
plt.scatter(x[:,0],y)
plt.scatter(x[:,0], (x@a).detach())
plt.scatter(x[:,0],y_pred)
We fit our line much better here
from matplotlib import animation, rc
rc('animation', html='jshtml')
a = nn.Parameter(tensor(0.5, 0.75)); a
def animate(i):
update()
line.set_ydata((x@a).detach())
return line,
fig = plt.figure()
plt.scatter(x[:,0], y, c='orange')
line, = plt.plot(x[:,0], (x@a).detach())
plt.close()
animation.FuncAnimation(fig, animate, np.arange(0,100), interval=20)
Ideally we split things up into batches of data to fit, and then work with all those batches (else we'd run out of memory!
If this were a classification problem, we would want to use Cross Entropy Loss
, where we penalize incorrect confident predictions along with correct unconfident predictions. It's also called negative loss likelihood