parallel-processing - 任务怎么写？（平行码）

Question

我对英特尔线程构建块印象深刻。我喜欢我应该如何编写任务而不是线程代码，我喜欢它在我有限的理解下如何工作（任务在一个池中，4核上不会有100个线程，一个任务不能保证运行，因为它不在它自己的线程并且可能在池中很远。但它可能与另一个相关任务一起运行，所以你不能做像典型的线程不安全代码这样的坏事）。

我想了解更多关于写作任务的信息。我喜欢这里的“基于任务的多线程 - 如何为 100 个内核编程”视频http://www.gdcvault.com/sponsor.php?sponsor_id=1（目前是倒数第二个链接。警告它不是“很棒”）。我最喜欢的部分是“最好并行解决迷宫”，大约在 48 分钟左右（您可以单击左侧的链接。如果有的话，这部分确实是您需要观看的全部内容）。

但是我喜欢看更多的代码示例和一些关于如何编写任务的 API。谁有好的资源？我不知道一个类或一段代码在将其推送到池后会是什么样子，或者当您需要复制所有内容以及将所有内容推送到池中时，代码看起来会如何。

score 7 · Accepted Answer

Java has a parallel task framework similar to Thread Building Blocks - it's called the Fork-Join framework. It's available for use with the current Java SE 6 and to be included in the upcoming Java SE 7.

There are resources available for getting started with the framework, in addition to the javadoc class documentation. From the jsr166 page, mentions that "There is also a wiki containing additional documentation, notes, advice, examples, and so on for these classes."

The fork-join examples, such as matrix multiplication are a good place to start.

I used the fork-join framework in solving some of Intel's 2009 threading challenges. The framework is lightweight and low-overhead - mine was the only Java entry for the Kight's Tour problem and it outperformed other entries in the competition. The java sources and writeup are available from the challenge site for download.

EDIT:

I have no idea how a class or pieces of code may look after pushing it onto a pool [...]

You can make your own task by subclassing one of the ForKJoinTask subclasses, such as RecursiveTask. Here's how to compute the fibonacci sequence in parallel. (Taken from the RecursiveTask javadocs - comments are mine.)

 // declare a new task, that itself spawns subtasks. 
 // The task returns an Integer result.
 class Fibonacci extends RecursiveTask<Integer> {
   final int n;      // the n'th number in the fibonacci sequence to compute
   Fibonnaci(int n) { this.n = n; } // constructor
   Integer compute() {   // this method is the main work of the task
     if (n <= 1)         // 1 or 0, base case to end recursion
        return n;
     Fibonacci f1 = new Fibonacci(n - 1);  // create a new task to compute n-1
     f1.fork();                            // schedule to run asynchronously
     Fibonacci f2 = new Fibonacci(n - 2);  // create a new task to compute n-2
     return f2.invoke() + f1.join();       // wait for both tasks to compute.
       // f2 is run as part of this task, f1 runs asynchronously. 
       // (you could create two separate tasks and wait for them both, but running
       // f2 as part of this task is a little more efficient.
   }
 }

You then run this task and get the result

// default parallelism is number of cores
ForkJoinPool pool = new ForkJoinPool(); 
Fibonacci f = new Fibonacci(100);
int result = pool.invoke(f);

This is a trivial example to keep things simple. In practice, performance would not be so good, since the work executed by the task is trivial compared to the overhead of the task framework. As a rule of thumb, a task should perform some significant computation - enough to make the framework overhead insignificant, yet not so much that you end up with one core at the end of the problem running one large task. Splitting large tasks into smaller ones ensures that one core isn't left doing lots of work while other cores are idle - using smaller tasks keeps more cores busy, but not so small that the task does no real work.

[...] or how weird code may look when you need to make a copy of everything and how much of everything is pushed onto a pool.

Only the tasks themselves are pushed into a pool. Ideally you don't want to be copying anything: to avoid interference and the need for locking, which would slow down your program, your tasks should ideally be working with independent data. Read-only data can be shared amongst all tasks, and doesn't need to be copied. If threads need to co-operate building some large data structure, it's best they build the pieces separately and then combine them at the end. The combining can be done as a separate task, or each task can add it's piece of the puzzle to the overall solution. This often does require some form of locking, but it's not a considerable performance issue if the work of the task is much greater than the the work updating the solution. My Knight's Tour solution takes this approach to update a common repository of tours on the board.

Working with tasks and concurrency is quite a paradigm shift from regular single-threaded programming. There are often several designs possible to solve a given problem, but only some of these will be suitable for a threaded solution. It can take a few attempts to get the feel for how to recast familiar problems in a multi-threaded way. The best way to learn is to look at the examples, and then try it for yourself. Always profile, and meausre the effects of varying the number of threads. You can explicitly set the number of threads (cores) to use in the pool in the pool constructor. When tasks are broken up linearly, you can expect near linear speedup as the number of threads increases.

score 0 · Accepted Answer

使用声称可以解决无法解决的“框架”（最佳任务调度是 NP 困难）根本不会帮助您 - 阅读书籍和有关并发算法的文章会。所谓的“任务”只不过是定义问题可分性（可以相互独立计算的部分）的花哨名称。可分离问题的类别非常小 - 并且它们已经包含在旧书中。

对于不可分离的问题，您必须计划阶段和阶段之间的数据屏障以交换数据。用于同时数据交换的数据障碍的最佳编排不仅是 NP 困难，而且原则上不可能以一般方式解决 - 你需要检查所有可能的交错的历史 - 这就像已经指数集的幂集（比如从数学中的 N 到 R）。我提到的原因是要明确没有软件可以为您做到这一点，并且如何做到这一点本质上取决于实际算法以及并行化是否可行（即使理论上可行）的成败。

当您进入高并行度时，您甚至无法维护队列，甚至不再拥有内存总线 - 想象一下 100 个 CPU 尝试在单个共享 int 上同步或尝试进行内存总线套利。您必须预先计划和预先配置要运行的所有内容，并在白板上基本上证明其正确性。英特尔的线程构建模块是那个世界的小孩子。它们适用于仍然可以共享内存总线的少量内核。运行可分离的问题是不费吹灰之力的，你可以在没有任何“框架”的情况下完成。

所以你又不得不尽可能多地阅读不同的并行算法。研究一个问题的近似最优数据屏障布局通常需要 1-3 年。当您在单个芯片上使用 16 个以上的内核时，它就变成了布局，因为只有第一个邻居可以有效地交换数据（在一个数据屏障周期内）。因此，通过查看 CUDA 以及 IBM 的实验性 30 核 CPU 的论文和结果，您实际上会学到更多，而不是英特尔的推销或一些 Java 玩具。

提防那些浪费的资源（内核和内存的数量）的大小比它们实现的加速要大得多的演示问题。如果需要 4 核和 4 倍 RAM 才能以 2 倍的速度解决问题，则该解决方案无法扩展以进行并行化。

parallel-processing - 任务怎么写？（平行码）

2 回答 2

Related

Reference