User:Szha/Notes/CMSC714/note0904
coordination[edit | edit source]
processors working together to solve a problem
synchronization[edit | edit source]
protection of a single object (locks) coordination of processors (barriers)
size of a unit of work by a processor need to manage two issues
- load balance - processors have equal work
- coordination overhead - communication and synchronization
ofen called "grain" size -coarse vs fine more processors are hard to balance in workload
sources of parallelism[edit | edit source]
statments
- called "control parallel"
- can perform a series of steps in parallel
- basis of dataflow computers
loops
- called "data parallel"
- most common source of parallelism for most programs
- each processor gets one (or more) iterations to perform
dependency between loops makes it hard
examples[edit | edit source]
easy[edit | edit source]
multiple independent jobs (different simulations)
scientific[edit | edit source]
dense linear algebra (divide up matrix) (require coordination) physical system simulations (divide physical space)
databases[edit | edit source]
Biggest success of parallel computing (divide tuples) exploits semantics of relational algebra
AI[edit | edit source]
search problems (divide search space) pattern recognition and image processing (divide image)
metrics in application performance[edit | edit source]
speedup[edit | edit source]
- ratio of time on one node to time on n nodes
- hold problem size fixed
- should really compare to best serial time
- goal is linear speedup
- super-linear speedup is possible due to:
- adding more memory/cache
- search problems
iso-speedup (or scaled speedup)[edit | edit source]
- scale data sisze up with number of nodes
- goal is a flat horizontal curve
amdahl's law[edit | edit source]
- max speed up is 1/(serial fraction of time), or 1/(1-f+f/s) as s -> \inf
f = parallel fraction s = speed up of parallel part t_m = (1-f) + f/s
computation to communication ratio[edit | edit source]
- goal is to maximize this ratio
how to write parallel programs[edit | edit source]
use old serial code[edit | edit source]
- compiler converts it to parallel
- called the dusty deck problem
serial language plus communication library[edit | edit source]
- no compiler changes required
- PVM and MPI use this approach
new language for parallel computing[edit | edit source]
- requires all code to be re-written
- hard to create a language that provides high performance on different platforms
hybrid approach -- old language(s), new constructs[edit | edit source]
- HPF - add data distribution commands to code
- add parallel loops and synchronization operations
application example - weather[edit | edit source]
typical of many scientific codes[edit | edit source]
- computes results for three dimensional space
- compute results at multiple time steps
- uses equations to describe physics/chemistry of the problem
- grids are used to discretize continuous space
- * granularity of grids is important to speed/accuracy
simplifications (for example, not in real code)[edit | edit source]
- earth is flat (no mountains)
- earth is round (poles are really flat, earth bulges at equator)
- second order properties
grid points[edit | edit source]
divide continuous space into discrete parts[edit | edit source]
- for this code, grid size is fixed and uniform
- possible to change grid size or use multiple grids
- use three dimensional grid
- two for latitude and longitude
- one for elevation
- total of M * N * L points
design choice: where is the grid point?[edit | edit source]
- left, right, or cener of the interval for a grid element
- in multiple dimensions this multiplies:
variables[edit | edit source]
one dimensional[edit | edit source]
- m geo-potential (gravitational effects)
two dimensional[edit | edit source]
- pi shifted surface pressure
- sigmadot vertical component of the wind velocity
three dimensional (primary variables[edit | edit source]
[edit | edit source]
- in each loop nest, interations are independent
- use a parallel for-loop for each loop nest
- synchronize (barrier) after each loop nest
- this is overly conservative, but works
- could use a single sync variable per element, but would incur excessive overhead
- potential parallelism is M * N * L
- private variables: D, i, j, k
- advantages of shared memeory
- easier to get something working (ignoring performance)
- hard to debug
- other processors can modify shared data
distributed memory version[edit | edit source]
decompose data to specific processors[edit | edit source]
- assign a cube to each processsor
- maximize volume to surface ratio
- which minimizes communication/computation ratio
- called a <block,block,block> distribution
need to communicate {i,j,k}{+,-}{1,2} terms at boundaries[edit | edit source]
- use send/receive to move the data
- no need for barriers, send/receive operations provide sync
- do sends earlier in computation to hide communication time
advantage[edit | edit source]
- easier to debug? maybe
- consider data locality explicitly with data decomposition
- better performance/scaling
problems[edit | edit source]
- hard to get the code running