Data

Outline: Working with Data in PyTorch for Deep Learning

Analyze Data

Understanding the structure and content of your data is paramount to good data-driven models

import pandas as pd

df = pd.read_pickle("_dlcourse/code/RegressionData.pkl") #This could be any file format

pd.set_option('display.colheader_justify', 'center') #pretty print option
pd.set_option('display.precision',3) #pretty print option

print(f"Total number of samples: {len(df)}")
print("Top 5 rows:")
print(df.head())
Total number of samples: 100
Top 5 rows:
     A      B      C      D      E      F    Target_0
0  0.730  5.540  2.244 -1.601 -1.691  0.625   30.720 
1  0.228 -8.955  2.642  7.802 -0.855 -0.304  -54.216 
2 -0.835 -1.549  2.271  0.314  0.071 -5.181  -27.621 
3  0.628 -0.536 -0.007  5.097  1.917 -2.232  -23.083 
4  0.474  7.877 -0.879  2.787  1.869 -5.179    3.191 

In this example we have 6 inputs (A thru F) and 1 output (Target_0) of all numeric data. While not required, it is useful to create a torch dataset object to manage it. My prefered data flow is shown below, but it is up to you how to structure and handle your own projects.

flowchart LR
  A((Data File)) --> B(Dataset)
  B --> C((Prepared <br/>Tensors))
  C --> D(Dataloader)
  D --> E((Batches))
  E--> F(Model)

Create a Dataset

Whenever you want to create a custom pytorch Dataset, there are three requirements: Inhereit the Dataset class, create a len function, and create a getitem function.

from torch.utils.data import Dataset

class SomeDataset(Dataset):
    def __init__(self,...):
        super().__init__() #calls the init function of super class Dataset
        ...
    
    def __len(self):
      ...
    
    def __getitem__(self, index):
      ...

You can include any other function and pass any number of initial class inputs, but these three requirements must be met.

For this example we are going to use the RegressionDataset class.

class RegressionDataset(Dataset):
    def __init__(self,
                 data_file:str,
                 normalize:bool=False,
                 ):
        super().__init__()

Here we are creating the class RegressionDataset which is a subclass of torch.utils.data.Dataset. It will take the argument data_file as the path to the regression data, and optionally normalize which will normalize the data from 0 to 1

df_data:pd.DataFrame = pd.read_pickle(data_file) #opens base PKL file
target_labels = [c for c in df_data.columns if c.startswith('Target')] #Finds all target columns
df_inputs = df_data.drop(target_labels,axis=1) # Creates DF of only inputs
df_targets = df_data[target_labels] # Creates DF of only targets

Here we have loaded all the data and split up the inputs and the targets.

self._tensor_inputs = torch.tensor(df_inputs.values).float()
self._tensor_targets = torch.tensor(df_targets.values).view(-1,len(target_labels)).float()

Here is an easy pitfall in data management. This chunk converts the code into tensors and shapes them into the appropriate size.

df.values → Converts DataFrame in a numpy array .float() → Converts the Double(float64) tensor into Float(float32) .view(-1,len(target_labels)) → Converts the 1D target tensor into a 2D tensor

One thing we can check is the documentation of the layer we are going to be using, Linear. This informs the structure of the data that needs to be fed into the network. We see that the layer takes (*,Hin​), so all of the “action” will be taking place on the last dimension of your inputs.

def __len__(self):
    return len(self._tensor_targets)

def __getitem__(self, index):
    inputs = self._tensor_inputs[index]
    targets = self._tensor_targets[index]
    return inputs,targets

Skipping ahead, these are the other two required parts of a Dataset. the len function needs to return the total number of samples (that will be going across axis 0). and the getitem function returns the data that your network is utilizing based on some index

## Data Representation in PyTorch
if normalize:
    self.normalize(self._tensor_inputs)
    self.normalize(self._tensor_targets)

def normalize(self,tensor:torch.Tensor) -> None:
      tensor[:] = (tensor-tensor.amin(0))/(tensor.amax(0)-tensor.amin(0))

This last chunk of code does a basic 0 to 1 normalization of the data (i.e. it does not fit it to a gaussian). I prefer to have my dataset class handle transformations, normalizations, and augmentations.

Create a Dataloader

I prefer to make a DataLoader class to handle the management of my model data. This is not required, but it tends to clean up my code and make life easier in the long run

## Data Representation in PyTorch

class MLPDataLoader():
    def __init__(
            self,
            dataset,
            train_split:float = .8,
            batch_size:int = 10,

        ):
        indices = list(range(len(dataset)))
        random.shuffle(indices)
        split = int(len(dataset)*train_split)
        self._batch_size = batch_size
        self._train_idx = indices[:split]
        self._val_idx = indices[split:]
        self._trainset = Subset(dataset,self._train_idx)
        self._valset = Subset(dataset,self._val_idx)

    def train_dataloader(self):
        return DataLoader(
            self._trainset,
            batch_size=self._batch_size,
            shuffle=True,
        )

    def val_dataloader(self):
        return DataLoader(
            self._valset,
            batch_size=self._batch_size,
            shuffle=False,
        )