memory-leaks - 内存泄漏与 en_core_web_trf 模型，Spacy

Question

使用 en_core_web_trf 模型的管道时存在内存泄漏，我使用具有 16GB RAM 的 GPU 运行模型，这是代码示例。

!python -m spacy download en_core_web_trf

import en_core_web_trf
nlp = en_core_web_trf.load()

#it's just an array of 100K sentences.
data = dataload()

for index, review in enumerate( nlp.pipe(data, batch_size=100) ):
    #doing some processing here
    if index % 1000: print(index)

此代码在达到 31K 时会破解，并引发 OOM 错误。

CUDA out of memory. Tried to allocate 46.00 MiB (GPU 0; 11.17 GiB total capacity; 10.44 GiB already allocated; 832.00 KiB free; 10.72 GiB reserved in total by PyTorch)

我只是使用管道来预测，而不是训练任何数据或其他东西，并尝试使用不同的批量大小，但什么也没发生，仍然崩溃。

你的环境

spaCy 版本： 3.0.5
平台： Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python版本： 3.7.10
管道： en_core_web_trf (3.0.0)

score 1 · Accepted Answer

幸运的是你有 GPU - 我仍然试图通过 Windows 上的（火炬 GPU）DLL Hell :-)。但看起来 Spacy 3 使用的 GPU 内存比 Spacy 2 多——我的 6GB GPU 可能已经没用了。

也就是说，您是否尝试过在没有 GPU 的情况下运行您的案例（并观察内存使用情况）？

Spacy 2 在大型数据集上的“泄漏”（主要）是由于词汇量的增长 - 每个数据行可能会添加更多的单词，并且建议的“解决方案”是重新加载模型和/或每 nnn 行重新加载词汇表。GPU使用可能有同样的问题......

memory-leaks - 内存泄漏与 en_core_web_trf 模型，Spacy

你的环境

1 回答 1

Related

Reference