coordination

processors working together to solve a problem

synchronization

protection of a single object (locks) coordination of processors (barriers)

size of a unit of work by a processor need to manage two issues

load balance - processors have equal work
coordination overhead - communication and synchronization

ofen called "grain" size -coarse vs fine more processors are hard to balance in workload

sources of parallelism

statments

called "control parallel"

can perform a series of steps in parallel

basis of dataflow computers

loops

called "data parallel"

most common source of parallelism for most programs

each processor gets one (or more) iterations to perform

dependency between loops makes it hard

examples

easy

multiple independent jobs (different simulations)

scientific

dense linear algebra (divide up matrix) (require coordination) physical system simulations (divide physical space)

databases

Biggest success of parallel computing (divide tuples) exploits semantics of relational algebra

AI

search problems (divide search space) pattern recognition and image processing (divide image)

metrics in application performance

speedup

ratio of time on one node to time on n nodes

hold problem size fixed

should really compare to best serial time

goal is linear speedup

super-linear speedup is possible due to:

adding more memory/cache
search problems

iso-speedup (or scaled speedup)

scale data sisze up with number of nodes

goal is a flat horizontal curve

amdahl's law

max speed up is 1/(serial fraction of time), or 1/(1-f+f/s) as s -> \inf

f = parallel fraction s = speed up of parallel part t_m = (1-f) + f/s

computation to communication ratio

goal is to maximize this ratio

how to write parallel programs

use old serial code

compiler converts it to parallel

called the dusty deck problem

serial language plus communication library

no compiler changes required

PVM and MPI use this approach

new language for parallel computing

requires all code to be re-written

hard to create a language that provides high performance on different platforms

hybrid approach -- old language(s), new constructs

HPF - add data distribution commands to code

add parallel loops and synchronization operations

application example - weather

typical of many scientific codes

computes results for three dimensional space

compute results at multiple time steps

uses equations to describe physics/chemistry of the problem

grids are used to discretize continuous space

* granularity of grids is important to speed/accuracy

simplifications (for example, not in real code)

earth is flat (no mountains)

earth is round (poles are really flat, earth bulges at equator)

second order properties

grid points

divide continuous space into discrete parts

for this code, grid size is fixed and uniform

possible to change grid size or use multiple grids

use three dimensional grid

two for latitude and longitude
one for elevation
total of M * N * L points

design choice: where is the grid point?

left, right, or cener of the interval for a grid element

in multiple dimensions this multiplies:

variables

one dimensional

m geo-potential (gravitational effects)

two dimensional

pi shifted surface pressure

sigmadot vertical component of the wind velocity

three dimensional (primary variables

shared memory version

in each loop nest, interations are independent

use a parallel for-loop for each loop nest

synchronize (barrier) after each loop nest

this is overly conservative, but works
could use a single sync variable per element, but would incur excessive overhead

potential parallelism is M * N * L

private variables: D, i, j, k

advantages of shared memeory

easier to get something working (ignoring performance)

hard to debug

other processors can modify shared data

distributed memory version

decompose data to specific processors

assign a cube to each processsor

maximize volume to surface ratio
which minimizes communication/computation ratio

called a <block,block,block> distribution

need to communicate {i,j,k}{+,-}{1,2} terms at boundaries

use send/receive to move the data

no need for barriers, send/receive operations provide sync

do sends earlier in computation to hide communication time

advantage

easier to debug? maybe

consider data locality explicitly with data decomposition

better performance/scaling

problems

hard to get the code running

User:Szha/Notes/CMSC714/note0904

Contents