# How to write a decent training loop with enough flexibility.

Posted on Sat 15 June 2019 in Posts

In this post, I briefly describe my experience in setting up training with PyTorch.

## Introduction

PyTorch is an extremely useful and convenient framework for deep learning. When it comes to working on a deep learning project, I am more comfortable with PyTorch rather than TensorFlow.

In this quick post, I would like to show how one can go about building a custom training loop, something that I struggled when I was getting started. It is a useful skill to be able to build the training loop on your own because that can help you understand better what happens under the hood of a deep learning package that abstracts a lot of nuts and bolts away from the end-user.

## The Overview of Training

When one trains a network, we need to follow a certain paradigm.

First, set the model into training mode.

Second, start iterating through the training set.

For every batch we must:

• compute the output of the network
• compute the loss
• start descending using the optimizer

This last step we acknowledge that the method for optimization of our model is based on a gradient descent. It can be eqither Adam, SGD, or any other (RAdam is the brand new one which seems to beat state of the art).

In code, we can put it in the form like this:

``````def train(epoch):
model.train()  # preparing model for training
for batch in training_set:
x, y = batch  # unpack the batch
# the step below is necessary so that we update the gradient only pertinent to the current batch
# compute the output
output = model(x.cuda())
# calculate the loss function
loss = criterion(output, y.cuda())
# calculate the gradient using backpropagation
loss.backward()
# take a step with the optimizer
optimizer.step()
``````

### A Trick for Better Training with Lower Memory

A small batch can result in a small gradient. This, in turn, leads to a problem called vanishing gradient problem: the value is so small, computer simple treats it as zero (underflow). To avoid it, a trick of accumulating gradient as you iterate through the dataset. I saw a practical implementation in this discussion.

``````def train_accumulate(epoch, accumulation_step = 1):
model.eval()  # preparing model for training
for idx, batch in enumerate(training_set):
x, y = batch  # unpack the batch
# compute the output
output = model(x.cuda())
# calculate the loss function
loss = criterion(output, y.cuda())
# calculate the gradient using backpropagation
loss.backward()
if idx%accumulation_step==0:
# take a step with the optimizer once