Demystifying PyTorch: Understanding interaction between various PyTorch abstractions.

Ritesh Agrawal
7 min readDec 11, 2020

PyTorch is one of the most used libraries for deep learning but is also one of the very difficult libraries to understand due to lot of side-effects that one object can have over another. For instance, calling the “step” method of an optimizer updates the module object’s parameters. Trying to wrap my head around PyTorch objects better and how they interact with each other, I found this Coursera course to be very helpful. It shows how different abstractions (such as Dataset, Dataloader, Module, Optim, etc) interact with each other. This post has a similar motivation.

I start with the explicit implementation of linear regression algorithms using minimal PyTorch abstractions. After that, each iteration introduces a new PyTorch abstraction that helps simplify the code and brings much more flexibility overall.

Version 1: Implementing Linear Regression The Hard Way

import numpy as np
import torch

# fix seed
np.random.seed(10)
torch.manual_seed(10)

# Randomly generate 1000 x values and compute Y as -1 + 3 * X . The goal
# is to learn parameters -1 (bias) and +3 (w1). We will fold bias in X
# to help simplify parameter updates

samples=1000

X = torch.hstack([
# Insert ones to simplify computing bias
torch.ones(samples, requires_grad=False).view(-1, 1),
# Randomly generate X
torch.randn(samples, requires_grad=False).view(-1, 1)
]) # (generates a 2D array of (1000, 2)

# Compute Y. The goal is to learn bias (1) and coefficient for X i.e. 3
Y = -1 * X[:,0] + 3 * X[:,1] # generates (1000,) shape vector

# Helper Array: Generate index ids. We will shuffle this and split it to generate random batches
idx = np.arange(X.size()[0])

# Generate parameter tensor. Note to set requires_grad to True over here
w = torch.randn(X.size()[1], requires_grad=True)

# Prediction function
def forward(w, x):
return (w * x).sum(axis=1)

# we are using mean square error as the cost function.
def mse(y, yhat):
return torch.mean((y - yhat) ** 2)

# learning rate
lr = 0.01

# run 100 epochs through the data
for epoch in range(100):

# Randomize indexes and split into batch size of 100.
np.random.shuffle(idx)
for batch in np.split(idx, 100):
curX = X[batch, :]
curY = Y[batch]

# compute predicted value
yhat = forward(w, curX)

# compute cost
cost = mse(curY, yhat)

# compute gradient -- this will update grad variable of w tensor
cost.backward()

# update parameters
w.data = w.data - lr * w.grad.data

# reset grad to zero for w.
w.grad.data.zero_()

# use detach as we don't want the parameters to update again and print them
print(w.detach().numpy())

Version 2: Encapsulating data management Using Dataset and DataLoader

PyTorch provides Dataset abstraction to hide how data is managed. This provides a better encapsulation of data. Further, it provides the concept of DataLoader to split data into batches. We will use these concepts to hide some of the implementations of how our data is stored and organized. Below is the second version of the code. Using DataSet and DataLoader helps encapsulate some of the data managed related code pieces and make them irrelevant in version 2. These are highlighted in the code below.

import numpy as np
import torch

# fix seed
np.random.seed(10)
torch.manual_seed(10)

# Randomly generate 1000 x values and compute Y as -1 + 3 * X . The goal
# is to learn parameters -1 (bias) and +3 (w1). We will fold bias in X
# to help simplify parameter updates

samples=1000

class RandomData(torch.utils.data.Dataset):

def __init__(self, samples):
self.X = torch.hstack([
# Insert ones to simplify computing bias
torch.ones(samples, requires_grad=False).view(-1, 1),
# Randomly generate X
torch.randn(samples, requires_grad=False).view(-1, 1)
])

# Compute Y. The goal is to learn bias (1) and coefficient for X i.e. 3
self.Y = -1 * self.X[:,0] + 3 * self.X[:,1]

def __getitem__(self, index):
return (self.X[index, :], self.Y[index])

def __len__(self):
return self.X.size()[0]


data = RandomData(samples)
dataloader = torch.utils.data.DataLoader(dataset=data, batch_size=100, shuffle=True)

# Helper Array: Generate index ids. We will shuffle this and split it to generate random batches
# idx = np.arange(X.size()[0])

# Generate parameter tensor. Note to set requires_grad to True over here
w = torch.randn(X.size()[1], requires_grad=True)

# Prediction function
def forward(w, x):
return (w * x).sum(axis=1)

# we are using mean square error as the cost function.
def mse(y, yhat):
return torch.mean((y - yhat) ** 2)

# learning rate
lr = 0.01

# run 100 epochs through the data
for epoch in range(100):

# Randomize indexes and split into batch size of 100.
# np.random.shuffle(idx)
# for batch in np.split(idx, 100):
# curX = X[batch, :]
# curY = Y[batch]
for (curX, curY) in dataloader:

# compute predicted value
yhat = forward(w, curX)

# compute cost
cost = mse(curY, yhat)

# compute gradient -- this will update grad variable of w tensor
cost.backward()

# update parameters
w.data = w.data - lr * w.grad.data

# reset grad to zero for w.
w.grad.data.zero_()

# use detach as we don't want the parameters to update again and print them
print(w.detach().numpy())

Version 3: Encapsulating Model using nn.Module class

Version 2 encapsulated data management using Dataset and DataLoader. In Version 3 we leverage nn.Module class to encapsulate model related things. There are two things: coefficient parameter (w) and forward function to compute prediction for a given data point.

import numpy as np
import torch

# fix seed
np.random.seed(10)
torch.manual_seed(10)

# Randomly generate 1000 x values and compute Y as -1 + 3 * X . The goal
# is to learn parameters -1 (bias) and +3 (w1). We will fold bias in X
# to help simplify parameter updates

samples=1000

class RandomData(torch.utils.data.Dataset):

def __init__(self, samples):
self.X = torch.hstack([
# Insert ones to simplify computing bias
torch.ones(samples, requires_grad=False).view(-1, 1),
# Randomly generate X
torch.randn(samples, requires_grad=False).view(-1, 1)
])

# Compute Y. The goal is to learn bias (1) and coefficient for X i.e. 3
self.Y = -1 * self.X[:,0] + 3 * self.X[:,1]

def __getitem__(self, index):
return (self.X[index, :], self.Y[index])

def __len__(self):
return self.X.shape[0]

data = RandomData(samples)
dataloader = torch.utils.data.DataLoader(dataset=data, batch_size=100, shuffle=True)

class CustomLinearModel(torch.nn.Module):

def __init__(self, num_parameters):
# Generate parameter tensor. Note to set requires_grad to True over here
self.w = torch.randn(num_parameters, requires_grad=True)

def forward(self, x):
return (self.w * x).sum(axis=1)

model = CustomLinearModel(2)

# Helper Array: Generate index ids. We will shuffle this and split it to generate random batches
# idx = np.arange(X.size()[0]) # NOT RELEVANT

# learning rate
lr = 0.01

# we are using mean square error as the cost function.
def mse(y, yhat):
return torch.mean((y - yhat) ** 2)


# run 100 epochs through the data
for epoch in range(100):

# NOT RELEVANT -- Randomize indexes and split into batch size of 100.
# np.random.shuffle(idx) -- not required anymore
# for batch in np.split(idx, 100): # nore required anymore

for (curX, curY) in dataloader:
# curX = X[batch, :] # NOT RELEVANT
# curY = Y[batch] # NOT RELEVANT

# compute predicted value
yhat = model.forward(curX)

# compute cost
cost = mse(curY, yhat)

# compute gradient -- this will update grad variable of w tensor
cost.backward()

# update parameters
model.w.data = model.w.data - lr * model.w.grad.data

# reset grad to zero for w.
model.w.grad.data.zero_()

# use detach as we don't want the parameters to update again and print them
print(model.w.detach().numpy())

Version 4: Using Optimizer

Above, we are manually updating the parameters (w) and limited to a few simple implementations of gradient descent. There are so many other forms of gradient descents, such as momentum, Adam, etc. PyTorch provides these variants of gradient descents as part of the “optim” module. To leverage this module, we will also need to make a minor change to our “CustomLinearModel” class. We will need to wrap “w” tensor as “nn.Parameter” (see line 41 below).

import numpy as np
import torch

# fix seed
np.random.seed(10)
torch.manual_seed(10)

# Randomly generate 1000 x values and compute Y as -1 + 3 * X . The goal
# is to learn parameters -1 (bias) and +3 (w1). We will fold bias in X
# to help simplify parameter updates

samples=1000

class RandomData(torch.utils.data.Dataset):

def __init__(self, samples):
self.X = torch.hstack([
# Insert ones to simplify computing bias
torch.ones(samples, requires_grad=False).view(-1, 1),
# Randomly generate X
torch.randn(samples, requires_grad=False).view(-1, 1)
])

# Compute Y. The goal is to learn bias (1) and coefficient for X i.e. 3
self.Y = -1 * self.X[:,0] + 3 * self.X[:,1]

def __getitem__(self, index):
return (self.X[index, :], self.Y[index])

def __len__(self):
return self.X.shape[0]

data = RandomData(samples)
dataloader = torch.utils.data.DataLoader(dataset=data, batch_size=100, shuffle=True)

class CustomLinearModel(torch.nn.Module):

def __init__(self, num_parameters):
super(CustomLinearModel, self).__init__()
# Generate parameter tensor. Note to set requires_grad to True over here
self.w = torch.nn.Parameter(torch.randn(num_parameters, requires_grad=True))

def forward(self, x):
return (self.w * x).sum(axis=1)

model = CustomLinearModel(2)

# Helper Array: Generate index ids. We will shuffle this and split it to generate random batches
# idx = np.arange(X.size()[0]) # NOT RELEVANT

# learning rate
# lr = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# we are using mean square error as the cost function.
def mse(y, yhat):
return torch.mean((y - yhat) ** 2)



# run 100 epochs through the data
for epoch in range(100):

# NOT RELEVANT -- Randomize indexes and split into batch size of 100.
# np.random.shuffle(idx) -- not required anymore
# for batch in np.split(idx, 100): # nore required anymore

for (curX, curY) in dataloader:
# curX = X[batch, :] # NOT RELEVANT
# curY = Y[batch] # NOT RELEVANT

# compute predicted value
yhat = model.forward(curX)

# compute cost
cost = mse(curY, yhat)

# compute gradient -- this will update grad variable of w tensor
cost.backward()

# update parameters
# model.w.data = model.w.data - lr * model.w.grad.data
optimizer.step() # this will update model parameters

# reset grad to zero for w.
# model.w.grad.data.zero_()
optimizer.zero_grad() # reset gradients

# use detach as we don't want the parameters to update again and print them
print(model.w.detach().numpy())

What’s next

There are other abstractions that you can use. For instance, pytorch already implements most of the common loss functions and hence we don’t need to implement “mse” function above. We can use torch.nn.MSELoss. Also we don’t need to handle parameters. Lot of models are already for you and the custom module class can build upon that. Checkout the list of already implemented models over here.

Originally published at http://ragrawal.wordpress.com on December 11, 2020.

--

--

Ritesh Agrawal

Senior Machine Learning Engineer, Varo Money; Contributor and Maintainer of sklearn-pandas library