我正在尝试在计算集群中同时运行一段代码的多个实例(2000 个左右)。它的工作方式是我提交作业,集群将在节点经常打开时运行它们,每个节点有几个作业。这似乎在使用时间种子的随机数生成中为大量实例生成相同的值。
我可以使用一个简单的替代方案吗?可重复性和安全性并不重要,快速生成独特的种子才是。什么是最简单的方法,如果可能的话,跨平台方法会很好。
我正在尝试在计算集群中同时运行一段代码的多个实例(2000 个左右)。它的工作方式是我提交作业,集群将在节点经常打开时运行它们,每个节点有几个作业。这似乎在使用时间种子的随机数生成中为大量实例生成相同的值。
我可以使用一个简单的替代方案吗?可重复性和安全性并不重要,快速生成独特的种子才是。什么是最简单的方法,如果可能的话,跨平台方法会很好。
该rdtsc
指令是一个非常可靠(和随机)的种子。
在 Windows 中,它可以通过__rdtsc()
内在函数访问。
在 GNU C 中,它可以通过以下方式访问:
unsigned long long rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((unsigned long long)hi << 32) | lo;
}
该指令测量自处理器上电以来的总伪周期。鉴于当今机器的高频率,即使两个处理器同时启动并以相同的速度运行,它们也极不可能返回相同的值。
我假设您有一些进程启动其他进程。让它通过种子使用。然后,您可以让该主进程为每个进程传入一个随机数以用作其种子。这样一来,实际上只选择了一个任意种子……您可以为此花时间。
If you don't have a master process launching the others, then if each process at least has a unique index, then what you can do is have one process generate a series of random numbers in memory (if shared memory) or in a file (if shared disk) and then have each process pull the index'th random number out to use as their seed.
Nothing will give you a more even distribution of seeds than a series of random numbers from a single seed.
A combination of the PID and the time should be enough to get a unique seed. It's not 100% cross-platform, but getpid(3)
on *nix platforms and GetProcessId
on Windows will get you 99.9% of the way there. Something like this should work:
srand((time(NULL) & 0xFFFF) | (getpid() << 16));
You could also read data from /dev/urandom
on *nix systems, but there's no equivalent to that on Windows.
unsigned seed;
read(open("/dev/urandom", O_RDONLY), &seed, sizeof seed);
srand(seed); // IRL, check for errors, close the fd, etc...
I would also recommend a better random number generator.
If C++11 can be used then consider std::random_device
. I would suggest you to watch link for a comprehensive guide.
Extracting the essential message from the video link : You should never use srand
& rand
, but instead use std::random_device
and std::mt19937
-- for most cases, the following would be what you want:
#include <iostream>
#include <random>
int main() {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_int_distribution<int> dist(0,99);
for (int i = 0; i < 16; i++) {
std::cout << dist(mt) << " ";
}
std::cout << std::endl;
}
而不是从 C std lib time() 函数以秒为单位测量的直接时间,您可以改用处理器的计数器吗?大多数处理器都有一个自由运行的滴答计数,例如在 x86/x64 中有时间戳计数器:
时间戳计数器是一个 64 位寄存器,自 Pentium 以来所有 x86 处理器都存在。它计算自重置以来的刻度数。
(那个页面也有很多方法可以在不同平台上访问这个计数器——gcc/ms visual c/etc)
请记住,时间戳计数器并非没有缺陷,它可能不会跨处理器同步(您可能不关心您的应用程序)。省电功能可能会增加或减少处理器的时钟(同样你可能不在乎)。
Just an idea... generate a GUID (which is 16 bytes) and sum its 4-byte or 8-byte chunks (depending on the expected width of the seed), allowing integer wrap-around. Use the result as a seed.
GUIDs typically encapsulate characteristics of the computer that generated them (such as MAC address), which should make it rather improbable that two different machines will end-up generating the same random sequence.
This is obviously not portable, but finding appropriate APIs/libraries for your system should not be too hard (e.g. UuidCreate
on Win32, uuid_generate
on Linux).
Provides CryptGenRandom()
and RtlGenRandom()
. They will give you an array of random bytes, which you can use as seeds.
You can find the docs on the msdn pages.
You can use Openssl's RAND_bytes()
to get a random number of bytes on linux. It will use /dev/random
by default.
#ifdef _WIN32
#include <NTSecAPI.h>
#else
#include <openssl/rand.h>
#endif
uint32_t get_seed(void)
{
uint32_t seed = 0;
#ifdef _WIN32
RtlGenRandom(&seed, sizeof(uint32_t) );
#else
RAND_bytes(&seed, sizeof(uint32_t) );
#endif
return seed;
}
Note that openssl provides a Cryptographically secure PRNG by default, so you could use it directly. More info here.
Assuming you're on a reasonably POSIX-ish system, you should have clock_gettime
. This will give the current time in nanoseconds, which means for all practical purposes it's impossible to ever get the same value twice. (In theory bad implementations could have much lower resolution, e.g. just multiplying milliseconds by 1 million, but even half-decent systems like Linux give real nanosecond results.)
If uniqueness is important, you need to arrange for each node to know what IDs have been claimed by others. You could do this with a protocol asking "anyone claimed ID x?" or arranging in advance for each node to have a selection of IDs which have not been allocated to others.
(GUIDs use the machine's MAC, so would fall into the "arrange in advance" category.)
Without some form of agreement, you'll risk two nodes climing the same ID.