User:Szha/Notes/CMSC714/note0904
coordination
processors working together to solve a problem
synchronization
protection of a single object (locks) coordination of processors (barriers)
size of a unit of work by a processor need to manage two issues
- load balance - processors have equal work
- coordination overhead - communication and synchronization
ofen called "grain" size -coarse vs fine more processors are hard to balance in workload
sources of parallelism
statments
- called "control parallel"
- can perform a series of steps in parallel
- basis of dataflow computers
loops
- called "data parallel"
- most common source of parallelism for most programs
- each processor gets one (or more) iterations to perform
dependency between loops makes it hard
examples
easy
multiple independent jobs (different simulations)
scientific
dense linear algebra (divide up matrix) (require coordination) physical system simulations (divide physical space)
databases
Biggest success of parallel computing (divide tuples) exploits semantics of relational algebra
AI
search problems (divide search space) pattern recognition and image processing (divide image)
metrics in application performance
speedup
- ratio of time on one node to time on n nodes
- hold problem size fixed
- should really compare to best serial time
- goal is linear speedup
- super-linear speedup is possible due to:
- adding more memory/cache
- search problems
iso-speedup (or scaled speedup)
- scale data sisze up with number of nodes
- goal is a flat horizontal curve
amdahl's law
- max speed up is 1/(serial fraction of time), or 1/(1-f+f/s) as s -> \inf
f = parallel fraction s = speed up of parallel part t_m = (1-f) + f/s
computation to communication ratio
- goal is to maximize this ratio
how to write parallel programs
use old serial code
- compiler converts it to parallel
- called the dusty deck problem
serial language plus communication library
- no compiler changes required
- PVM and MPI use this approach
new language for parallel computing
- requires all code to be re-written
- hard to create a language that provides high performance on different platforms
hybrid approach -- old language(s), new constructs
- HPF - add data distribution commands to code
- add parallel loops and synchronization operations
application example - weather
typical of many scientific codes
- computes results for three dimensional space
- compute results at multiple time steps
- uses equations to describe physics/chemistry of the problem
- grids are used to discretize continuous space
- * granularity of grids is important to speed/accuracy
simplifications (for example, not in real code)
- earth is flat (no mountains)
- earth is round (poles are really flat, earth bulges at equator)
- second order properties
grid points
divide continuous space into discrete parts
- for this code, grid size is fixed and uniform
- possible to change grid size or use multiple grids
- use three dimensional grid
- two for latitude and longitude
- one for elevation
- total of M * N * L points
design choice: where is the grid point?
- left, right, or cener of the interval for a grid element
- in multiple dimensions this multiplies:
variables
one dimensional
- m geo-potential (gravitational effects)
two dimensional
- pi shifted surface pressure
- sigmadot vertical component of the wind velocity
three dimensional (primary variables
- in each loop nest, interations are independent
- use a parallel for-loop for each loop nest
- synchronize (barrier) after each loop nest
- this is overly conservative, but works
- could use a single sync variable per element, but would incur excessive overhead
- potential parallelism is M * N * L
- private variables: D, i, j, k
- advantages of shared memeory
- easier to get something working (ignoring performance)
- hard to debug
- other processors can modify shared data
distributed memory version
decompose data to specific processors
- assign a cube to each processsor
- maximize volume to surface ratio
- which minimizes communication/computation ratio
- called a <block,block,block> distribution
need to communicate {i,j,k}{+,-}{1,2} terms at boundaries
- use send/receive to move the data
- no need for barriers, send/receive operations provide sync
- do sends earlier in computation to hide communication time
advantage
- easier to debug? maybe
- consider data locality explicitly with data decomposition
- better performance/scaling
problems
- hard to get the code running