User:Szha/Notes/CMSC714/note0904

< User:Szha‎ | Notes/CMSC714
Revision as of 17:36, 4 September 2012 by Szha (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

coordination

processors working together to solve a problem

synchronization

protection of a single object (locks) coordination of processors (barriers)

size of a unit of work by a processor need to manage two issues

  • load balance - processors have equal work
  • coordination overhead - communication and synchronization

ofen called "grain" size -coarse vs fine more processors are hard to balance in workload


sources of parallelism

statments

called "control parallel"
can perform a series of steps in parallel
basis of dataflow computers

loops

called "data parallel"
most common source of parallelism for most programs
each processor gets one (or more) iterations to perform

dependency between loops makes it hard


examples

easy

multiple independent jobs (different simulations)

scientific

dense linear algebra (divide up matrix) (require coordination) physical system simulations (divide physical space)

databases

Biggest success of parallel computing (divide tuples) exploits semantics of relational algebra

AI

search problems (divide search space) pattern recognition and image processing (divide image)


metrics in application performance

speedup

ratio of time on one node to time on n nodes
hold problem size fixed
should really compare to best serial time
goal is linear speedup
super-linear speedup is possible due to:
  • adding more memory/cache
  • search problems

iso-speedup (or scaled speedup)

scale data sisze up with number of nodes
goal is a flat horizontal curve

amdahl's law

max speed up is 1/(serial fraction of time), or 1/(1-f+f/s) as s -> \inf

f = parallel fraction s = speed up of parallel part t_m = (1-f) + f/s

computation to communication ratio

goal is to maximize this ratio


how to write parallel programs

use old serial code

compiler converts it to parallel
called the dusty deck problem

serial language plus communication library

no compiler changes required
PVM and MPI use this approach

new language for parallel computing

requires all code to be re-written
hard to create a language that provides high performance on different platforms

hybrid approach -- old language(s), new constructs

HPF - add data distribution commands to code
add parallel loops and synchronization operations


application example - weather

typical of many scientific codes

computes results for three dimensional space
compute results at multiple time steps
uses equations to describe physics/chemistry of the problem
grids are used to discretize continuous space
* granularity of grids is important to speed/accuracy

simplifications (for example, not in real code)

earth is flat (no mountains)
earth is round (poles are really flat, earth bulges at equator)
second order properties


grid points

divide continuous space into discrete parts

for this code, grid size is fixed and uniform
  • possible to change grid size or use multiple grids
use three dimensional grid
  • two for latitude and longitude
  • one for elevation
  • total of M * N * L points

design choice: where is the grid point?

left, right, or cener of the interval for a grid element
in multiple dimensions this multiplies:


variables

one dimensional

m geo-potential (gravitational effects)

two dimensional

pi shifted surface pressure
sigmadot vertical component of the wind velocity

three dimensional (primary variables

shared memory version

in each loop nest, interations are independent
use a parallel for-loop for each loop nest
synchronize (barrier) after each loop nest
  • this is overly conservative, but works
  • could use a single sync variable per element, but would incur excessive overhead
potential parallelism is M * N * L
private variables: D, i, j, k
advantages of shared memeory
  • easier to get something working (ignoring performance)
hard to debug
  • other processors can modify shared data

distributed memory version

decompose data to specific processors

assign a cube to each processsor
  • maximize volume to surface ratio
  • which minimizes communication/computation ratio
called a <block,block,block> distribution

need to communicate {i,j,k}{+,-}{1,2} terms at boundaries

use send/receive to move the data
no need for barriers, send/receive operations provide sync
  • do sends earlier in computation to hide communication time

advantage

easier to debug? maybe
consider data locality explicitly with data decomposition
  • better performance/scaling

problems

hard to get the code running