I am working with multiple csv files, each containing multiple 1D data. I have about 9000 such files and total combined data is about 40 GB.
I have written a dataloader like this:
class data_gen(torch.utils.data.Dataset):
def __init__(self, files):
self.files = files
my_data = np.genfromtxt('/data/'+files, delimiter=',')
self.dim = my_data.shape[1]
self.data = []
def __getitem__(self, i):
file1 = self.files
my_data = np.genfromtxt('/data/'+file1, delimiter=',')
self.dim = my_data.shape[1]
for j in range(my_data.shape[1]):
tmp = np.reshape(my_data[:,j],(1,my_data.shape[0]))
tmp = torch.from_numpy(tmp).float()
self.data.append(tmp)
return self.data[i]
def __len__(self):
return self.dim
The way I am loading the whole dataset into the dataloader is like through a for
loop:
for x_train in tqdm(train_files):
train_dl_spec = data_gen(x_train)
train_loader = torch.utils.data.DataLoader(
train_dl_spec, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)
for data in train_loader:
But this is working terribly slow. I was wondering if I could store all of that data in one file but I don’t have enough RAM. So is there a way around it?
Let me know if there’s a way.