0

I am working with multiple csv files, each containing multiple 1D data. I have about 9000 such files and total combined data is about 40 GB.

I have written a dataloader like this:

class data_gen(torch.utils.data.Dataset):
    def __init__(self, files):
        
        self.files = files
        my_data = np.genfromtxt('/data/'+files, delimiter=',')
        self.dim = my_data.shape[1]
        self.data = []
        
    def __getitem__(self, i):

        file1 = self.files
        my_data = np.genfromtxt('/data/'+file1, delimiter=',')
        self.dim = my_data.shape[1]

        for j in range(my_data.shape[1]):
            tmp = np.reshape(my_data[:,j],(1,my_data.shape[0]))
            tmp = torch.from_numpy(tmp).float()
            self.data.append(tmp)        
        
        return self.data[i]

    def __len__(self): 
        
        return self.dim

The way I am loading the whole dataset into the dataloader is like through a for loop:

for x_train in tqdm(train_files):
    train_dl_spec = data_gen(x_train)
        train_loader = torch.utils.data.DataLoader(
        train_dl_spec, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)
        for data in train_loader:

But this is working terribly slow. I was wondering if I could store all of that data in one file but I don’t have enough RAM. So is there a way around it?

Let me know if there’s a way.

4

1 回答 1

1

我以前从未使用过 pytorch,我承认我真的不知道发生了什么。尽管如此,我几乎可以肯定你用Dataset错了。

据我了解,数据集是每个索引返回样本的所有数据的抽象。假设您的 9000 个文件中的每一个都有 10 行(样本),21 个将引用第 3 个文件和第 2 行(使用 0 索引)。

因为您有太多数据,所以您不想将所有内容都加载到内存中。因此,Dataset 应该只获取一个值,而 DataLoader 会创建一批值。

几乎可以肯定有一些优化可以应用于我所做的事情,但也许这可以让你开始。我csvs用这些文件创建了目录:

❯ cat csvs/1.csv
1,2,3
2,3,4
3,4,5

❯ cat csvs/2.csv
21,21,21
34,34,34
66,77,88

然后我创建了这个 Dataset 类。它需要一个目录作为输入(所有 CSV 都存储在其中)。然后唯一的事情是存储在内存中的是每个文件的名称和它的行数。当请求一个项目时,我们找出哪个文件包含该索引,然后返回该行的张量。

通过只遍历文件,我们永远不会将文件内容存储在内存中。虽然这里的改进不会遍历文件列表以找出哪个是相关的,并在访问连续索引时使用生成器和状态。

(因为在访问索引 8 时访问,在一个 10 行的文件中,我们毫无用处地遍历了前 7 个,这我们无能为力。但是当访问索引 9 时,最好计算出我们可以只返回下一个,而不是再次遍历前 8 行。)

import numpy as np
from functools import lru_cache
from pathlib import Path
from pprint import pprint
from torch.utils.data import Dataset, DataLoader

@lru_cache()
def get_sample_count_by_file(path: Path) -> int:
    c = 0
    with path.open() as f:
        for line in f:
            c += 1
    return c


class CSVDataset:
    def __init__(self, csv_directory: str, extension: str = ".csv"):
        self.directory = Path(csv_directory)
        self.files = sorted((f, get_sample_count_by_file(f)) for f in self.directory.iterdir() if f.suffix == extension)
        self._sample_count = sum(f[-1] for f in self.files)

    def __len__(self):
        return self._sample_count

    def __getitem__(self, idx):
        current_count = 0
        for file_, sample_count in self.files:
            if current_count <= idx < current_count + sample_count:
                # stop when the index we want is in the range of the sample in this file
                break  # now file_ will be the file we want
            current_count += sample_count

        # now file_ has sample_count samples
        file_idx = idx - current_count  # the index we want to access in file_
        with file_.open() as f:
            for i, line in enumerate(f):
                if i == file_idx:
                    data = np.array([float(v) for v in line.split(",")])
                    return torch.from_numpy(data)

现在我们可以按照我的预期使用 DataLoader:

dataset = CSVDataset("csvs")
loader = DataLoader(dataset, batch_size=4)

pprint(list(enumerate(loader)))

"""
[(0,
  tensor([[ 1.,  2.,  3.],
        [ 2.,  3.,  4.],
        [ 3.,  4.,  5.],
        [21., 21., 21.]], dtype=torch.float64)),
 (1, tensor([[34., 34., 34.],
        [66., 77., 88.]], dtype=torch.float64))]
"""

您可以看到这正确返回了批量数据。您可以处理每个批次并将该批次仅存储在内存中,而不是将其打印出来。

有关更多信息,请参阅文档:https ://pytorch.org/tutorials/recipes/recipes/custom_dataset_transforms_loader.html#part-3-the-dataloader

于 2021-06-12T22:34:30.090 回答