Cross-Validation with fastai v2
This notebook contains the training process of Convolutional Neural Networks with Cross Validation using a subsection of the MAMMOSET dataset
from fastai2.vision.all import *
path = Path('./DDSM_NOBARS/'); path.ls()
batch_tfms = [IntToFloatTensor(),
*aug_transforms(size=224, max_warp=0, min_scale=0.75),
Normalize.from_stats(*imagenet_stats)]
item_tfms = [ToTensor(), Resize(460)]
bs=64
train_imgs = get_image_files(path/'train')
tst_imgs = get_image_files(path/'test')
random.shuffle(train_imgs)
len(train_imgs)
Here we do an 80/20 split
start_val = len(train_imgs) - int(len(train_imgs)*.2) # last 20% validation
idxs = list(range(start_val, len(train_imgs)))
splits = IndexSplitter(idxs)
split = splits(train_imgs)
len(train_imgs)
split_list = [split[0], split[1]]
split_list.append(L(range(len(train_imgs), len(train_imgs)+len(tst_imgs))))
We have 992 training images, 247 validation images and 142 test images
split_list
dsrc = Datasets(train_imgs+tst_imgs, tfms=[[PILImage.create], [parent_label, Categorize]],
splits = split_list)
show_at(dsrc.train, 3)
dsrc.n_subsets
dls = dsrc.dataloaders(bs=bs, after_item=item_tfms, after_batch=batch_tfms)
dls.show_batch()
learn = cnn_learner(dls, resnet34, pretrained=True, metrics=accuracy).to_fp16()
learn.fit_one_cycle(5)
The baseline model got and accuracy of ~61.27% on the test set.
learn.validate(ds_idx=2)[1]
from sklearn.model_selection import StratifiedKFold
train_labels = L()
for i in range(len(dsrc.train)):
train_labels.append(dsrc.train[i][1])
for i in range(len(dsrc.valid)):
train_labels.append(dsrc.valid[i][1])
train_labels
random.shuffle(train_imgs)
Our training loop will include the same process we've done above on the DDSM Mass Dataset and splits
section. That is, we will be splitting the training dataset 10 times and training a model on each of these datasets, after the training process is done we take the average of all the predictions for the test set on each split.
val_pct = []
tst_preds = []
skf = StratifiedKFold(n_splits=10, shuffle=True)
for _, val_idx in skf.split(np.array(train_imgs), train_labels):
splits = IndexSplitter(val_idx)
split = splits(train_imgs)
split_list = [split[0], split[1]]
split_list.append(L(range(len(train_imgs), len(train_imgs)+len(tst_imgs))))
dsrc = Datasets(train_imgs+tst_imgs, tfms=[[PILImage.create], [parent_label, Categorize]],
splits=split_list)
dls = dsrc.dataloaders(bs=bs, after_item=item_tfms, after_batch=batch_tfms)
learn = cnn_learner(dls, resnet34, pretrained=True, metrics=accuracy).to_fp16()
learn.fit_one_cycle(5)
val_pct.append(learn.validate()[1])
a,b = learn.get_preds(ds_idx=2)
tst_preds.append(a)
Now how do we combine all our predictions? We sum them all together then divide by our total
tst_preds_copy = tst_preds.copy()
accuracy(tst_preds_copy[0], b)
Now let's add them all together and get an average prediction
for i in tst_preds_copy:
print(accuracy(i, b))
hat = tst_preds[0]
for pred in tst_preds[1:]:
hat += pred
hat[:5]
len(hat)
hat /= len(tst_preds)
This accuracy is the average of all the predictions of the models trained on the splits. ~~65.49%
accuracy(hat, b)
This is the process we'll be using to conduct our experiments on several CNN models.