python - 加快将数百个 3D numpy 数组写入 hdf5 文件的速度

Question

我正在开发的应用程序将包含多个图像（视野）、平面 (Z) 和荧光通道的专有 tiff 文件格式（Nikon nd2 文件）转换为 numpy 数组，然后保存在 HDF5 文件中。通常一个典型的数据集有 50 个视野（fov），每个视野有 5 个通道，每个通道有 40 个 z 平面）。整个文件大约 6 Gb。

这是我写的代码：

脚步：

0) 导入所有需要的库

import nd2reader as nd2
from matplotlib import pyplot as plt
import numpy as np
import h5py as h5
import itertools
import ast
import glob as glob
from joblib import Parallel, delayed
import time

1) 运行nd2文件转换的函数。转换为 numpy 数组是使用 nd2reader 一个 python 程序完成的，而且速度很快。为了减少循环次数并使用列表理解，我制作了一个元组列表，每个元组都包含通道和 fov 示例：[('DAPI', 0), ('DAPI', 1)] 其中 DAPI 是通道，fov 是号码。

注意：实验通道列表是一个包含字典的文件，该文件将通道（键）与感兴趣的基因（值）匹配。

def ConvertND2File(ND2file):

    ChannelFileName=ND2file.replace('.nd2','ChannelsInfo.txt')


    # Read the file with the channels and raise an error if the file is missing
    try:
        ExperimentChannelList = ast.literal_eval(open(ChannelFileName).read())
    except IOError:
        print("The file:", ChannelFileName, "with the channels dictionary is missing")
        raise

    DataFileName=ND2file.replace('.nd2','.h5')
    with h5.File(DataFileName, 'w') as DataFile:
        ImgRef=nd2.Nd2(ND2file)
        Channels_Fields=itertools.product(ImgRef.channels,ImgRef.fields_of_view)

        # Create the empty array that will contain the 3D image
        ImgStack=np.empty([len(ImgRef.z_levels),ImgRef.height,ImgRef.width])

        # Use list comprehension to save the 3D arrays of the fov for each channel
        _=[SaveImg(DataFile,ImgRef,ExperimentChannelList,ImgStack,*x) for x in Channels_Fields]

2) 将图像组合成 3D 数组的功能，然后将其写入 HDF5 文件。我使用 h5py。我在生成后立即将每个 3D numpy 数组写入磁盘。

def SaveImg(DataFile,ImgRef,ExperimentChannelList,ImgStack,*args):
        channel=args[0]
        fov=args[1]
        for idx,image in enumerate(ImgRef.select(channels=channel,z_levels=ImgRef.z_levels,fields_of_view=fov)):
            ImgStack[idx,:,:]=image
        gene=ExperimentChannelList[channel]
        ChannelGroup=DataFile.require_group(gene)
        FovDataSet=ChannelGroup.create_dataset(str(fov), data=ImgStack,dtype=np.float64,compression="gzip")

3) 脚本主体和joblib 调用并行处理目录中的所有文件。

if __name__=='__main__':

    # Run the
    # Directory where ND2 file is stored (Ex. User/Data/)
    WorkingDirectory=input('Enter the directory with the files to process (ex. /User/):  ')
    #WorkingDirectory='/Users/simone/Box Sync/test/ND2conversion/'
    NumberOfProcesses=int(input('Enter the number of processes to use:  '))
    #NumberOfProcesses=2
    FileExt='nd2'
    # Iterator with the name of the files to process
    FilesIter=glob.iglob(WorkingDirectory+'*.'+FileExt)  

    now = time.time()
    Parallel(n_jobs=NumberOfProcesses,verbose=5)(delayed(ConvertND2File)(ND2file) for ND2file in FilesIter)
    print("Finished in", time.time()-now , "sec")

运行时间

转换两个 5.9 Gb 文件的总时间
[Parallel(n_jobs=2)]: Done 1 out of 2 | 已用时间：7.4 分钟剩余时间：7.4 分钟
[并行（n_jobs=2）]：完成 2 出 2 | 经过：7.4 分钟完成
444.8717038631439 秒完成

问题：

我只是想知道是否有更好的方法来处理 io 到 hdf5 文件以加快转换速度，考虑到如果我想扩大流程，我将无法将所有 3D numpy 数组（fov ) 然后在处理完每个通道后写入它们。谢谢！

python - 加快将数百个 3D numpy 数组写入 hdf5 文件的速度

脚步：

运行时间

问题：

0 回答 0

Related

Reference