Stochastic Gradient Descent (SGD)
This post explores SGD, which is an optimization technique (optimizer) that is commonly used in neural networks
import matplotlib.pyplot as plt
plt.style.use('dark_background')
from fastai.basics import *
import torch
from torch import nn
import numpy as np
import matplotlib.pyplot as plt
from fastai.torch_core import tensor
n = 100
x = torch.ones(n, 2)
len(x), x[:5]
randomize in an uniform distribution from -1 to 1
x[:,0].uniform_(-1., 1)
x[:5], x.shape
- Any linear model is y=mx+b
- 
m,x, andbare matrices
- We have x
m = tensor(3.,2); m, m.shape
- 
bis a random bias
b = torch.rand(n); b[:5], b.shape
Now we can make our y
- Matrix multiplication is denoted with @
y = x@m + b
We'll know if we got a size wrong if:
m@x + b
Plot our results
plt.scatter(x[:,0], y)
Our weights from last lesson should minimize the distance between points and our line.
- 
mean squared error: Take distance from predandy, square, then average
def mse(y_hat, y): return ((y_hat-y)**2).mean()
When we run our model, we are trying to predict m
For example, say a = (0.5, 0.75).
- Make a prediction
- Calculate the error
a = tensor(.5, .75)
Make prediction
y_pred = x@a
Calculate error
mse(y_pred, y)
plt.scatter(x[:,0],y)
plt.scatter(x[:,0],y_pred)
Model doesn't seen to quite fit. What's next? Optimization
a = nn.Parameter(a); a
Next let's create an update function to check if the current a improved. If so, move even closer.
We'll print out every 10 iterations to see how we are doing
def update():
    y_hat = x@a
    loss = mse(y, y_hat)
    if i % 10 == 0: print(loss)
    loss.backward()
    with torch.no_grad():
        a.sub_(lr * a.grad)
        a.grad.zero_()
- 
torch.no_grad: No back propogation (no updating of our weights)
- 
sub_: Subtracts some value (lr * our gradient)
- 
grad.zero_: Zeros our gradients
lr = 1e-1
for i in range(100): update()
Now let's see how this new a compares.
- Detach removes all gradients
plt.scatter(x[:,0],y)
plt.scatter(x[:,0], (x@a).detach())
plt.scatter(x[:,0],y_pred)
We fit our line much better here
from matplotlib import animation, rc
rc('animation', html='jshtml')
a = nn.Parameter(tensor(0.5, 0.75)); a
def animate(i):
    update()
    line.set_ydata((x@a).detach())
    return line,
fig = plt.figure()
plt.scatter(x[:,0], y, c='orange')
line, = plt.plot(x[:,0], (x@a).detach())
plt.close()
animation.FuncAnimation(fig, animate, np.arange(0,100), interval=20)
Ideally we split things up into batches of data to fit, and then work with all those batches (else we'd run out of memory!
If this were a classification problem, we would want to use Cross Entropy Loss, where we penalize incorrect confident predictions along with correct unconfident predictions. It's also called negative loss likelihood