我正在使用Caffe,它是带有 GPU(或 CPU)的卷积神经网络的框架。它主要使用 CUDA 6.0,我正在使用大量图像数据集(ImageNet 数据集 = 120 万张图像)训练 CNN,并且需要大量内存。但是,我正在对原始子集进行小型实验(这也需要大量内存)。我也在研究 gpu 集群。这是命令 $ nvidia-smi 的输出
+------------------------------------------------------+
| NVIDIA-SMI 331.62 Driver Version: 331.62 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M2050 Off | 0000:08:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 1585MiB / 2687MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M2050 Off | 0000:09:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M2050 Off | 0000:0A:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M2050 Off | 0000:15:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla M2050 Off | 0000:16:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla M2050 Off | 0000:19:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla M2050 Off | 0000:1A:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla M2050 Off | 0000:1B:00.0 Off | 0 |
| N/A N/A P1 N/A / N/A | 6MiB / 2687MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 10242 ../../../build/tools/train_net.bin 1577MiB |
+-----------------------------------------------------------------------------+
但是当我尝试运行这些多个进程(例如,在不同的数据集上运行相同的 train_net.bin)时,它们会失败,因为它们在同一个 GPU 上运行,我想知道如何强制使用另一个 GPU。我将不胜感激任何帮助。