c++ - 蛮力，单线程素数分解

Question

需要考虑的是以下函数，该函数可用于（相对快速）将 64 位无符号整数分解为其素数。请注意，因式分解不是概率的（即，它是精确的）。在现代硬件上，该算法已经足够快，可以在几秒钟内找到一个数是素数或几乎没有非常大的因数。

问题：可以对所提出的算法进行任何改进，同时保持单线程，以便它可以更快地分解（任意）非常大的无符号 64 位整数，最好不使用概率方法（例如 Miller-Rabin）确定素数？

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef unsigned __int64 ulong;
typedef std::vector<ulong> ULongVector;

// Caller needs to pass in an empty factors vector
void GetFactors(ULongVector &factors, ulong num)
{
  // Num has to be at least 2 to contain "prime" factors
  if (num<2)
    return;

  ulong workingNum=num;
  ulong nextOffset=2; // Will be used to skip multiples of 3, later

  // Factor out factors of 2
  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  // Factor out factors of 3
  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  // If all of the factors were 2s and 3s, done...
  if (workingNum==1)
    return;

  // sqrtNum is the (inclusive) upper bound of our search for factors
  ulong sqrtNum=(ulong) sqrt(double(workingNum+0.5));

  // Factor out potential factors that are greate than or equal to 5
  // The variable n represents the next potential factor to be tested
  for (ulong n=5;n<=sqrtNum;)
  {
    // Is n a factor of the current working number?
    if (workingNum%n==0)
    {
      // n is a factor, so add it to the list of factors
      factors.push_back(n);

      // Divide current working number by n, to get remaining number to factor
      workingNum/=n;

      // Check if we've found all factors
      if (workingNum==1)
        return;

      // Recalculate the new upper bound for remaining factors
      sqrtNum=(ulong) sqrt(double(workingNum+0.5));

      // Recheck if n is a factor of the new working number, 
      // in case workingNum contains multiple factors of n
      continue;
    }

    // n is not or is no longer a factor, try the next odd number 
    // that is not a multiple of 3
    n+=nextOffset;
    // Adjust nextOffset to be an offset from n to the next non-multiple of 3
    nextOffset=(nextOffset==2UL ? 4UL : 2UL);
  }

  // Current workingNum is prime, add it as a factor
  factors.push_back(workingNum);
}

谢谢

编辑：我添加了更多评论。通过引用传入向量的原因是允许在调用之间重用向量并避免动态分配。向量没有在函数中清空的原因是允许将当前“num's”因子附加到向量中已经存在的因子的奇怪要求。

函数本身并不漂亮，可以重构，但问题是如何让算法更快。所以，请不要提出关于如何使函数更漂亮、更易读或 C++ish 的建议。那是儿戏。改进这个算法，使其能够更快地找到（证明）因子是困难的部分。

更新：到目前为止， Potatoswatter有一些出色的解决方案，请务必在底部附近查看他的 MMX 解决方案。

score 19 · Accepted Answer

将这种方法与（预先生成的）筛子进行比较。取模很昂贵，所以这两种方法本质上都做两件事：生成潜在因子，并执行取模运算。任何一个程序都应该在比模数更少的周期内合理地生成一个新的候选因子，因此任何一个程序都是模数限制的。

给定的方法过滤掉所有整数的恒定比例，即 2 和 3 的倍数，即 75%。四分之一（如给定的）数字用作模运算符的参数。我将其称为跳过过滤器。

另一方面，筛子仅使用素数作为模运算符的参数，并且连续素数之间的平均差由素数定理控制为 1/ln(N)。例如，e ^20 略低于 5 亿，因此超过 5 亿的数字有不到 5% 的机会成为素数。如果考虑到 2^32 以内的所有数字，则 5% 是一个很好的经验法则。

因此，筛子的操作时间将比div跳过过滤器少 5 倍。下一个要考虑的因素是筛子产生素数的速度，即从内存或磁盘读取它们的速度。如果获取一个素数比 4div秒快，则筛子更快。根据我的表格div，Core2 上的吞吐量最多为每 12 个周期一个。这些将是硬除法问题，所以让我们保守地预算每个素数 50 个周期。对于 2.5 GHz 处理器，这是 20 纳秒。

在 20 ns 内，一个 50 MB/秒的硬盘可以读取大约一个字节。简单的解决方案是每个素数使用 4 个字节，因此驱动器会更慢。但是，我们可以更聪明。如果我们想按顺序编码所有素数，我们可以只编码它们的差异。同样，预期的差异是 1/ln(N)。而且，它们都是偶数，这节省了额外的一点。而且它们永远不会为零，这使得对多字节编码的扩展是免费的。因此，每个素数使用一个字节，最多可以将 512 的差异存储在一个字节中，根据该 Wikipedia 文章，这使我们最多可以达到 303371455241 。

因此，根据硬盘驱动器，存储的素数列表在验证素数时的速度应该大致相同。如果它可以存储在 RAM 中（它是 203 MB，因此后续运行可能会遇到磁盘缓存），那么问题就完全消失了，因为 FSB 速度通常与处理器速度的差异小于 FSB 宽度（以字节为单位） — 即，FSB 每个周期可以传输多个素数。那么改进的因素是除法操作的减少，即五倍。下面的实验结果证实了这一点。

当然，还有多线程。可以将素数或跳过过滤的候选者的范围分配给不同的线程，从而使这两种方法都令人尴尬地并行。没有不涉及增加并行除法器电路数量的优化，除非您以某种方式消除模数。

这是这样一个程序。它是模板化的，因此您可以添加 bignums。

/*
 *  multibyte_sieve.cpp
 *  Generate a table of primes, and use it to factorize numbers.
 *
 *  Created by David Krauss on 10/12/10.
 *
 */

#include <cmath>
#include <bitset>
#include <limits>
#include <memory>
#include <fstream>
#include <sstream>
#include <iostream>
#include <iterator>
#include <stdint.h>
using namespace std;

char const primes_filename[] = "primes";
enum { encoding_base = (1<< numeric_limits< unsigned char >::digits) - 2 };

template< typename It >
unsigned decode_gap( It &stream ) {
    unsigned gap = static_cast< unsigned char >( * stream ++ );

    if ( gap ) return 2 * gap; // only this path is tested

    gap = ( decode_gap( stream )/2-1 ) * encoding_base; // deep recursion
    return gap + decode_gap( stream ); // shallow recursion
}

template< typename It >
void encode_gap( It &stream, uint32_t gap ) {
    unsigned len = 0, bytes[4];

    gap /= 2;
    do {
        bytes[ len ++ ] = gap % encoding_base;
        gap /= encoding_base;
    } while ( gap );

    while ( -- len ) { // loop not tested
        * stream ++ = 0;
        * stream ++ = bytes[ len + 1 ];
    }
    * stream ++ = bytes[ 0 ];
}

template< size_t lim >
void generate_primes() {
    auto_ptr< bitset< lim / 2 > > sieve_p( new bitset< lim / 2 > );
    bitset< lim / 2 > &sieve = * sieve_p;

    ofstream out_f( primes_filename, ios::out | ios::binary );
    ostreambuf_iterator< char > out( out_f );

    size_t count = 0;

    size_t last = sqrtl( lim ) / 2 + 1, prev = 0, x = 1;
    for ( ; x != last; ++ x ) {
        if ( sieve[ x ] ) continue;
        size_t n = x * 2 + 1; // translate index to number
        for ( size_t m = x + n; m < lim/2; m += n ) sieve[ m ] = true;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    for ( ; x != lim / 2; ++ x ) {
        if ( sieve[ x ] ) continue;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    cout << prev * 2 + 1 << endl;
}

template< typename I >
void factorize( I n ) {
    ifstream in_f( primes_filename, ios::in | ios::binary );
    if ( ! in_f ) {
        cerr << "Could not open primes file.\n"
                "Please generate it with 'g' command.\n";
        return;
    }

    while ( n % 2 == 0 ) {
        n /= 2;
        cout << "2 ";
    }
    unsigned long factor = 1;

    for ( istreambuf_iterator< char > in( in_f ), in_end; in != in_end; ) {
        factor += decode_gap( in );

        while ( n % factor == 0 ) {
            n /= factor;
            cout << factor << " ";
        }

        if ( n == 1 ) goto finish;
    }

    cout << n;
finish:
    cout << endl;
}

int main( int argc, char *argv[] ) {
    if ( argc != 2 ) goto print_help;

    unsigned long n;

    if ( argv[1][0] == 'g' ) {
        generate_primes< (1ul<< 32) >();
    } else if ( ( istringstream( argv[1] ) >> n ).rdstate() == ios::eofbit )
        factorize( n );
    } else goto print_help;

    return 0;

print_help:
    cerr << "Usage:\n\t" << argv[0] << " <number> -- factorize number.\n"
            "\t" << argv[0] << " g -- generate primes file in current directory.\n";
}

2.2 GHz MacBook Pro 上的性能：

dkrauss$ time ./multibyte_sieve g
4294967291

real    2m8.845s
user    1m15.177s
sys    0m2.446s
dkrauss$ time ./multibyte_sieve 18446743721522234449
4294967231 4294967279 

real    0m5.405s
user    0m4.773s
sys 0m0.458s
dkrauss$ time ./mike 18446743721522234449
4294967231 4294967279
real    0m25.147s
user    0m24.170s
sys 0m0.096s

score 9 · Accepted Answer

我的另一个答案相当长并且与这个答案完全不同，所以这里有别的东西。

这个程序不只是过滤掉前两个素数的倍数，或者将所有相关的素数编码到一个字节中，而是过滤掉所有适合八位的素数的倍数，特别是 2 到 211。所以不是通过 33%数字，这会将大约 10% 传递给除法运算符。

素数保存在三个 SSE 寄存器中，它们与运行计数器的模数保存在另外三个中。如果任何素数与计数器的模数为零，则计数器不能是素数。此外，如果任何模数等于 1，则 (counter+2) 不能是素数，依此类推，直到 (counter+30)。偶数会被忽略，并且像 +3、+6 和 +5 这样的偏移会被跳过。矢量处理允许一次更新 16 个字节大小的变量。

在对厨房水槽进行了全面的微优化（但没有比内联指令更具体的平台）之后，我的性能提升了 1.78 倍（我的笔记本电脑上是 24 秒对 13.4 秒）。如果使用 bignum 库（甚至是非常快的库），则优势更大。请参阅下文以获得更易读的预优化版本。

/*
 *  factorize_sse.cpp
 *  Filter out multiples of the first 47 primes while factorizing a number.
 *
 *  Created by David Krauss on 10/14/10.
 *
 */

#include <cmath>
#include <sstream>
#include <iostream>
#include <xmmintrin.h>
using namespace std;

inline void remove_factor( unsigned long &n, unsigned long factor ) __attribute__((always_inline));
void remove_factor( unsigned long &n, unsigned long factor ) {
    while ( n % factor == 0 ) {
        n /= factor;
        cout << factor << " ";
    }
}

int main( int argc, char *argv[] ) {
    unsigned long n;

    if ( argc != 2
        || ( istringstream( argv[1] ) >> n >> ws ).rdstate() != ios::eofbit ) {
        cerr << "Usage: " << argv[0] << " <number>\n";
        return 1;
    }

    int primes[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
                     53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127,
                     131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211 };
    for ( int *p = primes; p < primes + sizeof primes/sizeof *primes; ++ p ) {
        remove_factor( n, * p );
    }

    //int histo[8] = {}, total = 0;

    enum { bias = 15 - 128 };
    __m128i const prime1 =       _mm_set_epi8( 21, 21, 21, 22, 22, 26, 26, 17, 19, 23, 29, 31, 37, 41, 43, 47 ),
            prime2 =             _mm_set_epi8( 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127 ),
            prime3 =             _mm_set_epi8( 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211 ),
            vbias = _mm_set1_epi8( bias ),
            v3 = _mm_set1_epi8( 3+bias ), v5 = _mm_set1_epi8( 5+bias ), v6 = _mm_set1_epi8( 6+bias ), v8 = _mm_set1_epi8( 8+bias ),
            v9 = _mm_set1_epi8( 9+bias ), v11 = _mm_set1_epi8( 11+bias ), v14 = _mm_set1_epi8( 14+bias ), v15 = _mm_set1_epi8( 15+bias );
    __m128i mod1 = _mm_add_epi8( _mm_set_epi8(  3, 10, 17,  5, 16,  6, 19,  8,  9, 11, 14, 15, 18, 20, 21, 23 ), vbias ),
            mod2 = _mm_add_epi8( _mm_set_epi8( 26, 29, 30, 33, 35, 36, 39, 41, 44, 48,  50,  51,  53,  54,  56,  63 ), vbias ),
            mod3 = _mm_add_epi8( _mm_set_epi8(  65,  68,  69,  74,  75,  78,  81,  83,  86,  89,  90,  95,  96,  98,  99, 105 ), vbias );

    for ( unsigned long factor = 1, limit = sqrtl( n ); factor <= limit + 30; factor += 30 ) {
        if ( n == 1 ) goto done;

        // up to 2^32, distribution of number candidates produced (0 up to 7) is
        // 0.010841     0.0785208   0.222928    0.31905     0.246109    0.101023    0.0200728   0.00145546 
        unsigned candidates[8], *cand_pen = candidates;
        * cand_pen = 6;
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v3 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v3 ), _mm_cmpeq_epi8( mod3,  v3 ) ) ) );
        * cand_pen = 10;                                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v5 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v5 ), _mm_cmpeq_epi8( mod3,  v5 ) ) ) );
        * cand_pen = 12;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v6 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v6 ), _mm_cmpeq_epi8( mod3,  v6 ) ) ) );
        * cand_pen = 16;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v8 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v8 ), _mm_cmpeq_epi8( mod3,  v8 ) ) ) );
        * cand_pen = 18;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v9 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v9 ), _mm_cmpeq_epi8( mod3,  v9 ) ) ) );
        * cand_pen = 22;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v11 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v11 ), _mm_cmpeq_epi8( mod3, v11 ) ) ) );
        * cand_pen = 28;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v14 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v14 ), _mm_cmpeq_epi8( mod3, v14 ) ) ) );
        * cand_pen = 30;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v15 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v15 ), _mm_cmpeq_epi8( mod3, v15 ) ) ) );

        /*++ total;
        ++ histo[ cand_pen - candidates ];

        cout << "( ";
        while ( cand_pen != candidates ) cout << factor + * -- cand_pen << " ";
        cout << ")" << endl; */

        mod1 = _mm_sub_epi8( mod1, _mm_set1_epi8( 15 ) ); // update residuals
        __m128i mask1 = _mm_cmplt_epi8( mod1, _mm_set1_epi8( 1+bias ) );
        mask1 = _mm_and_si128( mask1, prime1 ); // residual goes to zero or negative?
        mod1 = _mm_add_epi8( mask1, mod1 ); // combine reset into zero or negative

        mod2 = _mm_sub_epi8( mod2, _mm_set1_epi8( 15 ) );
        __m128i mask2 = _mm_cmplt_epi8( mod2, _mm_set1_epi8( 1+bias ) );
        mask2 = _mm_and_si128( mask2, prime2 );
        mod2 = _mm_add_epi8( mask2, mod2 );

        mod3 = _mm_sub_epi8( mod3, _mm_set1_epi8( 15 ) );
        __m128i mask3 = _mm_cmplt_epi8( mod3, _mm_set1_epi8( 1+bias ) );
        mask3 = _mm_and_si128( mask3, prime3 );
        mod3 = _mm_add_epi8( mask3, mod3 );

        if ( cand_pen - candidates == 0 ) continue;
        remove_factor( n, factor + candidates[ 0 ] );
        if ( cand_pen - candidates == 1 ) continue;
        remove_factor( n, factor + candidates[ 1 ] );
        if ( cand_pen - candidates == 2 ) continue;
        remove_factor( n, factor + candidates[ 2 ] );
        if ( cand_pen - candidates == 3 ) continue;
        remove_factor( n, factor + candidates[ 3 ] );
        if ( cand_pen - candidates == 4 ) continue;
        remove_factor( n, factor + candidates[ 4 ] );
        if ( cand_pen - candidates == 5 ) continue;
        remove_factor( n, factor + candidates[ 5 ] );
        if ( cand_pen - candidates == 6 ) continue;
        remove_factor( n, factor + candidates[ 6 ] );
    }

    cout << n;
done:
    /*cout << endl;
    for ( int hx = 0; hx < 8; ++ hx ) cout << (float) histo[hx] / total << " ";*/
    cout << endl;
}

.

dkrauss$ /usr/local/bin/g++ main.cpp -o factorize_sse -O3 --profile-use
dkrauss$ time ./factorize_sse 18446743721522234449
4294967231 4294967279 

real    0m13.437s
user    0m13.393s
sys 0m0.011s

以下是上述内容的初稿。包括优化

使循环计数器无条件合并（避免分支）。
通过将循环展开 15 倍来获得 ILP，将步幅增加到 30。
- 灵感来自您的优化。
- 30 似乎是一个甜蜜点，因为它可以免费删除 2、3 和 5 的倍数。
- 5 到 15 之间的素数可能有多个倍数，因此在向量中以不同的相位放置多个副本。
排除因素remove_factor。
将有条件的、不可预测remove_factor的调用更改为非分支数组写入。
使用 call 完全展开最终循环，remove_factor并确保函数始终是内联的。
- 消除最终展开的迭代，因为候选中总是有 7 的倍数。
添加另一个向量，其中包含所有剩余的足够小的素数。
通过向计数器添加偏差来腾出更多空间，并添加另一个向量。现在只有 6 个素数可以在不增加 16 位的情况下被过滤掉，而且我的寄存器也用完了：循环需要 3 个素数向量、3 个模数向量、8 个要搜索的常数，以及每个常数递增并进行范围检查。这使得 16。
- 在此应用中增益最小（但为正），但该技术的最初目的是在另一个答案中过滤筛子的素数。敬请期待……</li>

可读版本：

/*
 *  factorize_sse.cpp
 *  Filter out multiples of the first 17 primes while factorizing a number.
 *
 *  Created by David Krauss on 10/14/10.
 *
 */

#include <cmath>
#include <sstream>
#include <iostream>
#include <xmmintrin.h>
using namespace std;

int main( int argc, char *argv[] ) {
    unsigned long n;

    if ( argc != 2
        || ( istringstream( argv[1] ) >> n >> ws ).rdstate() != ios::eofbit ) {
        cerr << "Usage: " << argv[0] << " <number>\n";
        return 1;
    }

    int primes[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59 };
    for ( int *p = primes; p < primes + sizeof primes/sizeof *primes; ++ p ) {
        while ( n % * p == 0 ) {
            n /= * p;
            cout << * p << " ";
        }
    }

    if ( n != 1 ) {
        __m128i       mod   = _mm_set_epi8( 1, 2, 3,  5,  6,  8,  9, 11, 14, 15, 18, 20, 21, 23, 26, 29 );
        __m128i const prime = _mm_set_epi8( 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59 ),
                      one = _mm_set1_epi8( 1 );

        for ( unsigned long factor = 1, limit = sqrtl( n ); factor < limit; ) {
            factor += 2;
            __m128i mask = _mm_cmpeq_epi8( mod, one ); // residual going to zero?
            mod = _mm_sub_epi8( mod, one ); // update other residuals
            if ( _mm_movemask_epi8( mask ) ) {
                mask = _mm_and_si128( mask, prime ); // reset cycle if going to zero
                mod = _mm_or_si128( mask, mod ); // combine reset into zeroed position

            } else while ( n % factor == 0 ) {
                n /= factor;
                cout << factor << " ";
                if ( n == 1 ) goto done;
            }
        }
        cout << n;
    }
done:
    cout << endl;
}

score 5 · Accepted Answer

费马的因式分解方法简单快速地找到成对的大素因数，只要在它走得太远和变慢之前停止它。然而，在我对随机数的测试中，这种情况很少见，看不到任何改进。

...不使用概率方法（例如 Miller-Rabin）来确定素数

在均匀分布的情况下，您 75% 的输入将需要十亿次循环迭代，因此即使您得到一个不确定的答案并且不得不返回到试验部门，也值得首先在不太确定的技术上花费一百万次操作。

我发现 Pollard 的 Rho 方法的 Brent 变体非常好，尽管编码和理解起来更复杂。我见过的最好的例子来自这个论坛讨论。该方法依赖于运气，但通常有足够的帮助是值得的。

Miller-Rabin 素数检验实际上是确定性的，最高可达 10^15 左右，可以省去搜索无果的麻烦。

我尝试了几十种变体，并确定了以下用于分解 int64 值的方法：

小因素试划分。（我使用前 8000 个预先计算的素数。）
使用 Pollard 的 Rho 进行 10 次尝试，每次使用 16 次迭代
试除以 sqrt(n)。

请注意，Pollard 的 Rho 找到不一定是素数的因子，因此可以使用递归来分解这些因子。

score 2 · Accepted Answer

这段代码比较慢，我很确定我明白为什么。它并不是慢得令人难以置信，但在 10-20% 的范围内肯定会慢一些。除法不应该为每个循环进行一次，但唯一的方法是调用sqrt或类似的方法。

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef std::vector<ulong> ULongVector;

void GetFactors(ULongVector &factors, ulong num)
{
  if (num<2)
    return;

  ulong workingNum=num;
  ulong nextOffset=2;

  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  ulong n = 5;
  while ((workingNum != 1) && ((workingNum / n) >= n)) {
    // Is workingNum divisible by n?
    if (workingNum%n==0)
    {
      // n is a factor!
      // so is the number multiplied by n to get workingNum

      // Insert n into the list of factors
      factors.push_back(n);

      // Divide working number by n
      workingNum/=n;
    } else {
      n+=nextOffset;
      nextOffset=(nextOffset==2UL ? 4UL : 2UL);
    }
  }

  if (workingNum != 1) {
    // workingNum is prime, add it to the list of factors        
    factors.push_back(workingNum);
  }
}

score 2 · Accepted Answer

融合了 Omnifarious 的一些想法，以及其他改进：

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef unsigned __int64 ulong;
typedef std::vector<ulong> ULongVector;

// Caller needs to pass in an empty factors vector
void GetFactors(ULongVector &factors, ulong num)
{
  if (num<2)
    return;

  ulong workingNum=num;

  // Factor out factors of 2
  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  // Factor out factors of 3
  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  if (workingNum==1)
    return;

  // Factor out factors >=5
  ulong nextOffset=2;
  char nextShift = 1;
  ulong n = 5;
  ulong nn = 25;
  do {
    // Is workingNum divisible by n?
    if (workingNum%n==0)
    {
      // n is a factor!
      // so is the number multiplied by n to get workingNum

      // Insert n into the list of factors
      factors.push_back(n);

      // Divide working number by n
      workingNum/=n;

      // Test for done...
      if (workingNum==1)
        return;

      // Try n again
    }  
    else {
      nn += (n << (nextShift+1)) + (1<<(nextShift*2)); // (n+b)^2 = n^2 + 2*n*b + b*2
      n += nextOffset;
      nextOffset ^= 6;
      nextShift ^= 3;
      // invariant: nn == n*n
      if (n & 0x100000000LL) break; // careful of integer wraparound in n^2
    }
  } while (nn <= workingNum);

  // workingNum is prime, add it to the list of factors        
  factors.push_back(workingNum);
}

score 2 · Accepted Answer

自然的概括是使用比 2 和 3 更多的已知素数来预先计算跳页。比如 2、3、5、7、11，模式周期为 2310（嗯，不错的数字）。也许更多，但它的收益递减 - 运行时间图可以准确确定预计算开始产生负面影响的位置，但当然它取决于要考虑的数字数量......

哈，我把编码细节留给你们。:-)

干杯&hth.，

– 阿尔夫

score 2 · Accepted Answer

我不确定这些会有多有效，但不是

while (workingNum%2==0)

你可以做

while (workingNum & 1 == 0)

我不确定 gcc 或 msvc（或您正在使用的任何编译器）是否足够聪明以更改 workingNum%2 表达式，但很可能它正在执行除法并查看 edx 中的模数......

根据您的编译器，我的下一个建议可能完全没有必要，但您可以尝试将 workingNum /= 3 放在方法调用之前。G++ 可能足够聪明，可以看到不必要的除法，只需使用 eax 中的商（您也可以在更大的循环中执行此操作）。或者，一种更彻底（但很痛苦）的方法是内联汇编以下代码。

while (workingNum%3==0)
{
  factors.push_back(3);
  workingNum/=3;
}

编译器可能正在将模运算转换为除法，然后查看 edx 中的模数。问题是，您正在再次执行除法（我怀疑编译器是否看到您只是在循环条件下隐式执行了除法）。所以，你可以内联组装它。这提出了两个问题：

1) push_back(3) 的方法调用。这可能会弄乱寄存器，使这完全没有必要。

2) 获取workingNum 的寄存器，但这可以通过进行初始模块化检查（强制它进入%eax）来确定，或者在当前时刻，它将/应该在eax 中。

您可以将循环编写为（假设 workingNum 在 eax 中，这是 32 位 AT&T 语法，只是因为我不知道 64 位汇编或 Intel 语法）

asm( "
     movl     $3, %ebx
  WorkNumMod3Loop: 
     movl     %eax, %ecx    # just to be safe, backup workingNUm
     movl     $0, %edx      # zero out edx
     idivl    $3            # divide by 3. quotient in eax, remainder in edx
     cmpl     $0, %edx      # compare if it's 0
     jne      AfterNumMod3Loop    # if 0 is the remainder, jump out

     # no need to perform division because new workingNum is already in eax
     #factors.push_back(3) call

     je       WorkNumMod3Loop
  AfterNumMod3Loop: 
     movl     %ecx, %eax"
);

您应该查看这些循环的程序集输出。您的编译器可能已经在进行这些优化，但我对此表示怀疑。如果在某些情况下将 workingNum /= n 放在方法调用之前可以稍微提高性能，我不会感到惊讶。

c++ - 蛮力，单线程素数分解

7 回答 7

Related

Reference