coordination[edit | edit source]

processors working together to solve a problem

synchronization[edit | edit source]

protection of a single object (locks) coordination of processors (barriers)

size of a unit of work by a processor need to manage two issues

load balance - processors have equal work
coordination overhead - communication and synchronization

ofen called "grain" size -coarse vs fine more processors are hard to balance in workload

sources of parallelism[edit | edit source]

statments

called "control parallel"

can perform a series of steps in parallel

basis of dataflow computers

loops

called "data parallel"

most common source of parallelism for most programs

each processor gets one (or more) iterations to perform

dependency between loops makes it hard

examples[edit | edit source]

easy[edit | edit source]

multiple independent jobs (different simulations)

scientific[edit | edit source]

dense linear algebra (divide up matrix) (require coordination) physical system simulations (divide physical space)

databases[edit | edit source]

Biggest success of parallel computing (divide tuples) exploits semantics of relational algebra

AI[edit | edit source]

search problems (divide search space) pattern recognition and image processing (divide image)

metrics in application performance[edit | edit source]

speedup[edit | edit source]

ratio of time on one node to time on n nodes

hold problem size fixed

should really compare to best serial time

goal is linear speedup

super-linear speedup is possible due to:

adding more memory/cache
search problems

iso-speedup (or scaled speedup)[edit | edit source]

scale data sisze up with number of nodes

goal is a flat horizontal curve

amdahl's law[edit | edit source]

max speed up is 1/(serial fraction of time), or 1/(1-f+f/s) as s -> \inf

f = parallel fraction s = speed up of parallel part t_m = (1-f) + f/s

computation to communication ratio[edit | edit source]

goal is to maximize this ratio

how to write parallel programs[edit | edit source]

use old serial code[edit | edit source]

compiler converts it to parallel

called the dusty deck problem

serial language plus communication library[edit | edit source]

no compiler changes required

PVM and MPI use this approach

new language for parallel computing[edit | edit source]

requires all code to be re-written

hard to create a language that provides high performance on different platforms

hybrid approach -- old language(s), new constructs[edit | edit source]

HPF - add data distribution commands to code

add parallel loops and synchronization operations

application example - weather[edit | edit source]

typical of many scientific codes[edit | edit source]

computes results for three dimensional space

compute results at multiple time steps

uses equations to describe physics/chemistry of the problem

grids are used to discretize continuous space

* granularity of grids is important to speed/accuracy

simplifications (for example, not in real code)[edit | edit source]

earth is flat (no mountains)

earth is round (poles are really flat, earth bulges at equator)

second order properties

grid points[edit | edit source]

divide continuous space into discrete parts[edit | edit source]

for this code, grid size is fixed and uniform

possible to change grid size or use multiple grids

use three dimensional grid

two for latitude and longitude
one for elevation
total of M * N * L points

design choice: where is the grid point?[edit | edit source]

left, right, or cener of the interval for a grid element

in multiple dimensions this multiplies:

variables[edit | edit source]

one dimensional[edit | edit source]

m geo-potential (gravitational effects)

two dimensional[edit | edit source]

pi shifted surface pressure

sigmadot vertical component of the wind velocity

three dimensional (primary variables[edit | edit source]

shared memory version[edit | edit source]

in each loop nest, interations are independent

use a parallel for-loop for each loop nest

synchronize (barrier) after each loop nest

this is overly conservative, but works
could use a single sync variable per element, but would incur excessive overhead

potential parallelism is M * N * L

private variables: D, i, j, k

advantages of shared memeory

easier to get something working (ignoring performance)

hard to debug

other processors can modify shared data

distributed memory version[edit | edit source]

decompose data to specific processors[edit | edit source]

assign a cube to each processsor

maximize volume to surface ratio
which minimizes communication/computation ratio

called a <block,block,block> distribution

need to communicate {i,j,k}{+,-}{1,2} terms at boundaries[edit | edit source]

use send/receive to move the data

no need for barriers, send/receive operations provide sync

do sends earlier in computation to hide communication time

advantage[edit | edit source]

easier to debug? maybe

consider data locality explicitly with data decomposition

better performance/scaling

problems[edit | edit source]

hard to get the code running

Anonymous

Search

User:Szha/Notes/CMSC714/note0904