Stochastic Gradient Descent (SGD)
This post explores SGD, which is an optimization technique (optimizer) that is commonly used in neural networks
import matplotlib.pyplot as plt
plt.style.use('dark_background')
from fastai.basics import *
import torch
from torch import nn
import numpy as np
import matplotlib.pyplot as plt
from fastai.torch_core import tensor
n = 100
x = torch.ones(n, 2)
len(x), x[:5]
randomize in an uniform distribution from -1 to 1
x[:,0].uniform_(-1., 1)
x[:5], x.shape
- Any linear model is
y=mx+b -
m,x, andbare matrices - We have
x
m = tensor(3.,2); m, m.shape
-
bis a random bias
b = torch.rand(n); b[:5], b.shape
Now we can make our y
- Matrix multiplication is denoted with
@
y = x@m + b
We'll know if we got a size wrong if:
m@x + b
Plot our results
plt.scatter(x[:,0], y)
Our weights from last lesson should minimize the distance between points and our line.
-
mean squared error: Take distance from
predandy, square, then average
def mse(y_hat, y): return ((y_hat-y)**2).mean()
When we run our model, we are trying to predict m
For example, say a = (0.5, 0.75).
- Make a prediction
- Calculate the error
a = tensor(.5, .75)
Make prediction
y_pred = x@a
Calculate error
mse(y_pred, y)
plt.scatter(x[:,0],y)
plt.scatter(x[:,0],y_pred)
Model doesn't seen to quite fit. What's next? Optimization
a = nn.Parameter(a); a
Next let's create an update function to check if the current a improved. If so, move even closer.
We'll print out every 10 iterations to see how we are doing
def update():
y_hat = x@a
loss = mse(y, y_hat)
if i % 10 == 0: print(loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
a.grad.zero_()
-
torch.no_grad: No back propogation (no updating of our weights) -
sub_: Subtracts some value (lr * our gradient) -
grad.zero_: Zeros our gradients
lr = 1e-1
for i in range(100): update()
Now let's see how this new a compares.
- Detach removes all gradients
plt.scatter(x[:,0],y)
plt.scatter(x[:,0], (x@a).detach())
plt.scatter(x[:,0],y_pred)
We fit our line much better here
from matplotlib import animation, rc
rc('animation', html='jshtml')
a = nn.Parameter(tensor(0.5, 0.75)); a
def animate(i):
update()
line.set_ydata((x@a).detach())
return line,
fig = plt.figure()
plt.scatter(x[:,0], y, c='orange')
line, = plt.plot(x[:,0], (x@a).detach())
plt.close()
animation.FuncAnimation(fig, animate, np.arange(0,100), interval=20)
Ideally we split things up into batches of data to fit, and then work with all those batches (else we'd run out of memory!
If this were a classification problem, we would want to use Cross Entropy Loss, where we penalize incorrect confident predictions along with correct unconfident predictions. It's also called negative loss likelihood