Difference between revisions of "User:Szha/Notes/CMSC714/note0904"

From Grad Wiki
(No difference)

Revision as of 17:30, 4 September 2012

coordination[edit | edit source]

processors working together to solve a problem

synchronization[edit | edit source]

protection of a single object (locks) coordination of processors (barriers)

size of a unit of work by a processor need to manage two issues

  • load balance - processors have equal work
  • coordination overhead - communication and synchronization

ofen called "grain" size -coarse vs fine more processors are hard to balance in workload


sources of parallelism[edit | edit source]

statments

called "control parallel"
can perform a series of steps in parallel
basis of dataflow computers

loops

called "data parallel"
most common source of parallelism for most programs
each processor gets one (or more) iterations to perform

dependency between loops makes it hard


examples[edit | edit source]

easy[edit | edit source]

multiple independent jobs (different simulations)

scientific[edit | edit source]

dense linear algebra (divide up matrix) (require coordination) physical system simulations (divide physical space)

databases[edit | edit source]

Biggest success of parallel computing (divide tuples) exploits semantics of relational algebra

AI[edit | edit source]

search problems (divide search space) pattern recognition and image processing (divide image)


metrics in application performance[edit | edit source]

speedup[edit | edit source]

ratio of time on one node to time on n nodes
hold problem size fixed
should really compare to best serial time
goal is linear speedup
super-linear speedup is possible due to:
  • adding more memory/cache
  • search problems

iso-speedup (or scaled speedup)[edit | edit source]

scale data sisze up with number of nodes
goal is a flat horizontal curve

amdahl's law[edit | edit source]

max speed up is 1/(serial fraction of time), or 1/(1-f+f/s) as s -> \inf

f = parallel fraction s = speed up of parallel part t_m = (1-f) + f/s

computation to communication ratio[edit | edit source]

goal is to maximize this ratio


how to write parallel programs[edit | edit source]

use old serial code[edit | edit source]

compiler converts it to parallel
called the dusty deck problem

serial language plus communication library[edit | edit source]

no compiler changes required
PVM and MPI use this approach

new language for parallel computing[edit | edit source]

requires all code to be re-written
hard to create a language that provides high performance on different platforms

hybrid approach -- old language(s), new constructs[edit | edit source]

HPF - add data distribution commands to code
add parallel loops and synchronization operations


application example - weather[edit | edit source]

typical of many scientific codes[edit | edit source]

computes results for three dimensional space
compute results at multiple time steps
uses equations to describe physics/chemistry of the problem
grids are used to discretize continuous space
* granularity of grids is important to speed/accuracy

simplifications (for example, not in real code)[edit | edit source]

earth is flat (no mountains)
earth is round (poles are really flat, earth bulges at equator)
second order properties


grid points[edit | edit source]

divide continuous space into discrete parts[edit | edit source]

for this code, grid size is fixed and uniform
  • possible to change grid size or use multiple grids
use three dimensional grid
  • two for latitude and longitude
  • one for elevation
  • total of M * N * L points

design choice: where is the grid point?[edit | edit source]

left, right, or cener of the interval for a grid element
in multiple dimensions this multiplies:


variables[edit | edit source]

one dimensional[edit | edit source]

m geo-potential (gravitational effects)

two dimensional[edit | edit source]

pi shifted surface pressure
sigmadot vertical component of the wind velocity

three dimensional (primary variables[edit | edit source]

shared memory version[edit | edit source]

in each loop nest, interations are independent
use a parallel for-loop for each loop nest
synchronize (barrier) after each loop nest
  • this is overly conservative, but works
  • could use a single sync variable per element, but would incur excessive overhead
potential parallelism is M * N * L
private variables: D, i, j, k
advantages of shared memeory
  • easier to get something working (ignoring performance)
hard to debug
  • other processors can modify shared data

distributed memory version[edit | edit source]

decompose data to specific processors[edit | edit source]

assign a cube to each processsor
  • maximize volume to surface ratio
  • which minimizes communication/computation ratio
called a <block,block,block> distribution

need to communicate {i,j,k}{+,-}{1,2} terms at boundaries[edit | edit source]

use send/receive to move the data
no need for barriers, send/receive operations provide sync
  • do sends earlier in computation to hide communication time

advantage[edit | edit source]

easier to debug? maybe
consider data locality explicitly with data decomposition
  • better performance/scaling

problems[edit | edit source]

hard to get the code running