After learning the GPU programming model and writing a basic CUDA program Week 2 introduces some concepts for efficient algorithms.
Communication is the first issues; how do threads communicate efficiently. This is easier in some problems, ie: Map than others ie: Scan/Sort. The memory access and write patterns of some of the key algorithm types are discussed. Input-to-Output relationships:
- Map: one-to-one
- Gather: many-to-one
- Stencil: several-to-one
- Reduce: all-to-one
- Scatter: one-to-many
- Scan: all-to-all
- Sort: all-to-all
The basic problem of concurrent memory access was illustrated via a scatter example. With one input tying to write a 3rd of its value to the neighboring elements we can see that trouble will result from independent thread execution.
To overcome these barriers threads must communicate in some ways. Shared memory and synchronization points were described as tools for this job.
Some important information about CUDA and what it guarantees (and doesn’t guarantee) about thread execution was also touched on:
- No guarantees about when and where thread blocks will run – this is an enabler for massive parallelism. There are also some limitation due to this, no assumptions about what blocks will run on what SM and there can be no direct communication between blocks.
- Does guarantee – all threads in a block are guaranteed to run on the same SM at the same time.
- Does guarantee – all blocks in a kernel finish before blocks from the next kernel run
A good source for some of the hardware diagrams and acronyms: www.cs.nyu.edu/manycores/cuda_many_cores.pdf
Finally coalescing memory access was discussed as a very good practice for efficiency.