Parallel Numerical Methods
A number of specific numerical problems arise in parallel programming, including use of halos in computation, parallel linear algebra, long range forces, and random number generation.
(To read this with no frames type this location into your web browser:
training/numan_par.html)High Performance Computing
High Performance Computing (HPC) is an engineering discipline which develops the technology (hardware, algorithms, and software) to deliver cost-effective computational results.
Parallel Numerical Methods
Selection of the appropriate algorithm often requires some knowledge of the likely platform to be used.
Bibliography
http://www.cscs.ch/Official/TechReports/1996/TR-96-08.ps.gz and http://random.mat.sbg.ac.at/news/
Parallel Computing
A parallel computer is a set of processors that work together to solve a computational problem. But first: get to know the beast.
Reasons to use a parallel machine
Why use a large number of processors?
What sort of parallel machine?
A generic parallel computer consists of a number of processors connected together.
However this covers a multitude of sins:
Increasingly we need to focus on "cost-effective" use of computational resources. It may not be cost effective to use an expensive parallel machine with a fast interconnection network to perform independent simulations on each node if there a cluster of machines available which will perform the same task.
Scalability
The card example demonstrates:
With 6 decks and 6 people, better to simply give each person a deck: "task farming".
With 1000 decks and 6 people, we would need a queue: "scheduling" (or hungry puppies). Would each person take a single deck, or several decks at each visit to the supervisor? What if the people worked at different speeds?
Speedup
(1) |
In parallel processing "superlinear" effects may occur when a speedup of >P is obtained on P processors. This may be for a number of reasons
Speedup is of academic interest. It is possible to obtain perfect speedup curves with a poor parallel algorithm if the single node time is slow.
Speed is what counts. When do I get the answer? How large a simulation can I run overnight?
Part of the job of a scheduler is to maximise the number of computational tasks to run subject to the constraints imposed by the user. This is a non-trivial task.
General Tricks
Partitioning
Divide up the work into a large number of tasks.
Communication
How and when do the tasks need to communicate with other? Do they communicate globally, or locally?
Agglomeration
Gather the tasks together. Can tasks which need to communicate often be brought together? Are there load balancing issues?
Mapping
Map the tasks onto the processors.
Data Layout
In this simple example, we have 16 tasks and 4 (and 8) processors.
Column |
Row |
Block |
Cyclic |
|||||||||||||||
1 |
2 |
3 |
4 |
1 |
1 |
1 |
1 |
1 |
1 |
2 |
2 |
1 |
2 |
3 |
4 |
|||
1 |
2 |
3 |
4 |
2 |
2 |
2 |
2 |
1 |
1 |
2 |
2 |
5 |
6 |
7 |
8 |
|||
1 |
2 |
3 |
4 |
3 |
3 |
3 |
3 |
3 |
3 |
4 |
4 |
1 |
2 |
3 |
4 |
|||
1 |
2 |
3 |
4 |
4 |
4 |
4 |
4 |
3 |
3 |
4 |
4 |
5 |
6 |
7 |
8 |
Halos
A number of tricks exist to reduce the communication in addition to the agglomeration. In this case some of the data is replicated in a "buffer zone" to decrease the amount of communication required.
1 |
1 |
1,2 |
1,2 |
2 |
2 |
1 |
1 |
1,2 |
1,2 |
2 |
2 |
1,3 |
1,3 |
1,2,3,4 |
1,2,3,4 |
2,4 |
2,4 |
1,3 |
1,3 |
1,2,3,4 |
1,2,3,4 |
2,4 |
2,4 |
3 |
3 |
3,4 |
3,4 |
4 |
4 |
3 |
3 |
3,4 |
3,4 |
4 |
4 |
Other Layouts
1 |
1 |
1 |
1 |
1 |
1 |
1 |
2 |
2 |
2 |
2 |
2 |
1 |
2 |
3 |
3 |
3 |
3 |
1 |
2 |
3 |
4 |
4 |
4 |
1 |
2 |
3 |
4 |
5 |
5 |
1 |
2 |
3 |
4 |
5 |
6 |
This one suffers from load balancing problems
Techniques
Parallel Linear Algebra
Good libraries exist for most common linear algebra tasks.
(see references in
training/package.html )The hardest part is that the optimal layout for one part of a complex piece of code may be the worst in another part. Compromises have to be made!
Though it is bad practice, this sometimes means a global data transfer midway through a run, or replicated data sets on each node.
Time complexity of algorithms
Scalable parallel computing requires O(N) algorithms (where N is some number of grid points, for example).
This is because the most general constraint on running a code is how long you are prepared to wait for results. Given a larger number of nodes, this usually remains the same, and so the question becomes "how large a problem can I now run given the additional nodes."
An O(N2) algorithm would only solve a problem twice as large given 4 nodes, if it were 100% efficient.
Long Range Forces
A naïve implementation of an N body problem scales with N2. Techniques exist which scale with O(N log N) and O(N).
http://www.epcc.ed.ac.uk/~mario/nbody.htmlComputational Geometry
Considerable effort has been put into designing efficient O(N log N) and O(N) algorithms to perform e.g. convex hull, Delaunay triangulation.
O’ Rourke, J., 1993. Computational Geometry in C. Cambridge: Cambridge University Press.
Parallel Random Number Generation
Pseudo-Random number generators exhibits small fluctuations from truly random behaviour.
A random number generator typically has a seed, from which subsequent random numbers are generated.
Generating random numbers on parallel machines requires:
The previous generation of random number generators does not measure up on most of these accounts.
The new generation of generators, pioneered by Marsaglia and others meet all of these requirements.
It is worth noting that the new generation of random number generators were only designed to fix the quality issue on sequential processors, but happen simultaneously to fix all the problems of parallel random number generation. Oh yes, and they are also faster by up to a factor of 2 (or more) than other generators. There may not be such a thing as a free lunch, but…
A new generation random number generator typically has an immense period, a seed vector, from which any number in the sequence may be precomputed.
As an example of the incredible periods available recall that those in Numerical Recipes in Fortran 77 have a period of 2^31-1 (~2e9). Generators now exist with a period of 2^19937 - 1, a number with about 6000 decimal digits; the 32-bit random numbers exhibit best possible equidistribution properties in dimensions up to 623. A DEC Alpha Station would require about 10^5983 millennia to exhaust the period.
When any number in the sequence can be precomputed in advance, parallelism is trivial. Simply start each processor on the random number m p / P, where p = processor number p of P in total, m = period.
Summary
Parallel programming for scientific and engineering problems is (or should be) "the norm."
Parallel computing can be summed up by single node performance and scalability.
Last updated 10-Mar-98. Maintained by SJ Cox