Outline: Working with Data in PyTorch for Deep Learning
Analyze Data
Understanding the structure and content of your data is paramount to good data-driven models
import pandas as pddf = pd.read_pickle("_dlcourse/code/RegressionData.pkl") #This could be any file formatpd.set_option('display.colheader_justify', 'center') #pretty print optionpd.set_option('display.precision',3) #pretty print optionprint(f"Total number of samples: {len(df)}")print("Top 5 rows:")print(df.head())
Total number of samples: 100
Top 5 rows:
A B C D E F Target_0
0 0.730 5.540 2.244 -1.601 -1.691 0.625 30.720
1 0.228 -8.955 2.642 7.802 -0.855 -0.304 -54.216
2 -0.835 -1.549 2.271 0.314 0.071 -5.181 -27.621
3 0.628 -0.536 -0.007 5.097 1.917 -2.232 -23.083
4 0.474 7.877 -0.879 2.787 1.869 -5.179 3.191
In this example we have 6 inputs (A thru F) and 1 output (Target_0) of all numeric data. While not required, it is useful to create a torch dataset object to manage it. My prefered data flow is shown below, but it is up to you how to structure and handle your own projects.
flowchart LR
A((Data File)) --> B(Dataset)
B --> C((Prepared <br/>Tensors))
C --> D(Dataloader)
D --> E((Batches))
E--> F(Model)
Create a Dataset
Whenever you want to create a custom pytorch Dataset, there are three requirements: Inhereit the Dataset class, create a len function, and create a getitem function.
from torch.utils.data import Datasetclass SomeDataset(Dataset):def__init__(self,...):super().__init__() #calls the init function of super class Dataset ...def __len(self): ...def__getitem__(self, index): ...
You can include any other function and pass any number of initial class inputs, but these three requirements must be met.
For this example we are going to use the RegressionDataset class.
class RegressionDataset(Dataset):def__init__(self, data_file:str, normalize:bool=False, ):super().__init__()
Here we are creating the class RegressionDataset which is a subclass of torch.utils.data.Dataset. It will take the argument data_file as the path to the regression data, and optionally normalize which will normalize the data from 0 to 1
df_data:pd.DataFrame = pd.read_pickle(data_file) #opens base PKL filetarget_labels = [c for c in df_data.columns if c.startswith('Target')] #Finds all target columnsdf_inputs = df_data.drop(target_labels,axis=1) # Creates DF of only inputsdf_targets = df_data[target_labels] # Creates DF of only targets
Here we have loaded all the data and split up the inputs and the targets.
Here is an easy pitfall in data management. This chunk converts the code into tensors and shapes them into the appropriate size.
df.values → Converts DataFrame in a numpy array .float() → Converts the Double(float64) tensor into Float(float32) .view(-1,len(target_labels)) → Converts the 1D target tensor into a 2D tensor
One thing we can check is the documentation of the layer we are going to be using, Linear. This informs the structure of the data that needs to be fed into the network. We see that the layer takes (*,Hin), so all of the “action” will be taking place on the last dimension of your inputs.
Skipping ahead, these are the other two required parts of a Dataset. the len function needs to return the total number of samples (that will be going across axis 0). and the getitem function returns the data that your network is utilizing based on some index
## Data Representation in PyTorchif normalize:self.normalize(self._tensor_inputs)self.normalize(self._tensor_targets)def normalize(self,tensor:torch.Tensor) ->None: tensor[:] = (tensor-tensor.amin(0))/(tensor.amax(0)-tensor.amin(0))
This last chunk of code does a basic 0 to 1 normalization of the data (i.e. it does not fit it to a gaussian). I prefer to have my dataset class handle transformations, normalizations, and augmentations.
Create a Dataloader
I prefer to make a DataLoader class to handle the management of my model data. This is not required, but it tends to clean up my code and make life easier in the long run