4,720 bytes added
, 17:30, 4 September 2012
== coordination ==
processors working together to solve a problem
== synchronization ==
protection of a single object (locks)
coordination of processors (barriers)
size of a unit of work by a processor
need to manage two issues
:* load balance - processors have equal work
:* coordination overhead - communication and synchronization
ofen called "grain" size -coarse vs fine
more processors are hard to balance in workload
== sources of parallelism ==
statments
: called "control parallel"
: can perform a series of steps in parallel
: basis of dataflow computers
loops
: called "data parallel"
: most common source of parallelism for most programs
: each processor gets one (or more) iterations to perform
dependency between loops makes it hard
= examples =
== easy ==
multiple independent jobs (different simulations)
== scientific ==
dense linear algebra (divide up matrix) (require coordination)
physical system simulations (divide physical space)
== databases ==
Biggest success of parallel computing (divide tuples)
exploits semantics of relational algebra
== AI ==
search problems (divide search space)
pattern recognition and image processing (divide image)
= metrics in application performance =
== speedup ==
: ratio of time on one node to time on <i>n</i> nodes
: hold problem size fixed
: should really compare to best serial time
: goal is linear speedup
: super-linear speedup is possible due to:
::* adding more memory/cache
::* search problems
== iso-speedup (or scaled speedup) ==
: scale data sisze up with number of nodes
: goal is a flat horizontal curve
== amdahl's law ==
: max speed up is 1/(serial fraction of time), or 1/(1-f+f/s) as s -> \inf
f = parallel fraction
s = speed up of parallel part
t_m = (1-f) + f/s
== computation to communication ratio ==
: goal is to maximize this ratio
= how to write parallel programs =
== use old serial code ==
: compiler converts it to parallel
: called the dusty deck problem
== serial language plus communication library ==
: no compiler changes required
: PVM and MPI use this approach
== new language for parallel computing ==
: requires all code to be re-written
: hard to create a language that provides high performance on different platforms
== hybrid approach -- old language(s), new constructs ==
: HPF - add data distribution commands to code
: add parallel loops and synchronization operations
= application example - weather =
== typical of many scientific codes ==
: computes results for three dimensional space
: compute results at multiple time steps
: uses equations to describe physics/chemistry of the problem
: grids are used to discretize continuous space
:: * granularity of grids is important to speed/accuracy
== simplifications (for example, not in real code) ==
: earth is flat (no mountains)
: earth is round (poles are really flat, earth bulges at equator)
: second order properties
= grid points =
== divide continuous space into discrete parts ==
: for this code, grid size is fixed and uniform
::* possible to change grid size or use multiple grids
: use three dimensional grid
::* two for latitude and longitude
::* one for elevation
::* total of M * N * L points
== design choice: where is the grid point? ==
: left, right, or cener of the interval for a grid element
: in multiple dimensions this multiplies:
= variables =
== one dimensional ==
: m geo-potential (gravitational effects)
== two dimensional ==
: pi shifted surface pressure
: sigmadot vertical component of the wind velocity
== three dimensional (primary variables ==
= shared memory version =
: in each loop nest, interations are independent
: use a parallel for-loop for each loop nest
: synchronize (barrier) after each loop nest
::* this is overly conservative, but works
::* could use a single sync variable per element, but would incur excessive overhead
: potential parallelism is M * N * L
: private variables: D, i, j, k
: advantages of shared memeory
::* easier to get something working (ignoring performance)
: hard to debug
::* other processors can modify shared data
= distributed memory version =
== decompose data to specific processors ==
: assign a cube to each processsor
::* maximize volume to surface ratio
::* which minimizes communication/computation ratio
: called a <block,block,block> distribution
== need to communicate {i,j,k}{+,-}{1,2} terms at boundaries ==
: use send/receive to move the data
: no need for barriers, send/receive operations provide sync
::* do sends earlier in computation to hide communication time
== advantage ==
: easier to debug? maybe
: consider data locality explicitly with data decomposition
::* better performance/scaling
== problems ==
: hard to get the code running