# Lecture 16: Introduction to GPU Computing

### Vivek Kumar Computer Science and Engineering IIIT Delhi vivekk@iiitd.ac.in

#### Lecture 16: Introduction to GPU Computing

### Last Lecture (Recap)

- SIMD vector extensions
  - Special registers at each core that support instructions to operate upon vectors values
- Limitations
  - Loop size should be countable at runtime
  - Loop iterations should not have different control flow
  - Loop iterations should be independent
  - Loop should only use basic math functions
  - Only a single arithmetic type operation
  - Should not have non-contiguous memory accesses
  - Unsupported data-dependencies
    - Read-After-Write: A[i] = A[i-1] + 1
    - Write-After-Write: A[i%2] = B[i] + C[i]
  - Supported data-dependencies
    - Write-After-Write: A[i-1] = A[i] + 1
    - Read-After-Read: A[i] = B[i%2] + C[i]



```
#include "vectorclass.h"
int A[1024], B[1024], C[1024];
void sum() {
    Vec8i Av;
    for (int i=0; i<1024; i+=8) {
        Vec8i Bv = Vec8i().load(B+i);
        Vec8i Cv = Vec8i().load(C+i);
        Av = Bv + Cv;
        Av.store(A+i);
    }
}</pre>
```



# **Today's Class**

- →● GPU architecture
  - GPU programming

This lecture will give you a high-level overview of GPU architecture and a platform-neutral high-level library-based programming model for writing GPU programs that can compile with standard C++ compilers



CSE513: Parallel Runtimes for Modern Processors

# **Multicore CPUs with SIMD Support**

Multicore processors are latency oriented!
 How?



- Modern multicore processors have sophisticated cores to support general purpose computing
  - High core frequency for low latency operations
  - Large cache and prefetcher unit for improving memory access latency
    - Dynamically decide future memory accesses based on current access pattern to reduce CPU stalls
  - Superscalar capabilities allowing it to use Instruction Level Parallelism (ILP)
- They also support data parallel execution
  - Each physical core has bunch of ALUs and wide vector registers for SIMD operations



### **CPU Stalls in SIMD Execution**



# **Using SMT for Hiding Stalls**

|                       | D                     | RAN | 1                     |                       |
|-----------------------|-----------------------|-----|-----------------------|-----------------------|
| Ca                    | che                   |     | Ca                    | iche                  |
| <mark>alu</mark> alu  | ALU ALU               |     | <mark>alu</mark> alu  | <mark>ALU</mark> ALU  |
| Thread<br>State<br>PC | Thread<br>State<br>PC |     | Thread<br>State<br>PC | Thread<br>State<br>PC |

- Two-way SMT at each multicore (Simultaneous Multithreading)
  - Each SMT core has its own PC register, thereby allowing each core to simultaneously execute a completely different execution stream
  - Each SMT core has its own set of vector registers
  - Each SMT core pair share ALUs
  - Each SMT core pair can execute different set of SIMD operations (as they don't share PC register)



## **CPU Stalls in SIMD Execution**



- Using SMT for hiding stalls
  - Thread-1 on Core-1 and Thread-3 on Core-2 completes the first iteration, and then stalls for memory fetch
  - Thread-2 on Core-1 and Thread-4 on Core-2 memory fetch has completed, hence they start their first iteration while Thread-1 and Thread-3 are blocked for memory fetch
  - Key idea here is to increase the number of hardware threads for hiding CPU stalls

CSE513: Parallel Runtimes for Modern Processors

#### How to Further Optimize SIMD Execution?

| DR                                                               | AM                                                                                                         |
|------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| Cache                                                            | Cache                                                                                                      |
| ALU ALU ALU ALU                                                  | ALU ALU ALU ALU                                                                                            |
| Thread Thread Thread Thread State State State PC PC PC PC PC     | Thread     Thread     Thread       State     State     State       PC     PC     PC       PC     PC     PC |
| Thread State State PC        | Thread     Thread     Thread       State     State     State       PC     PC     PC                        |
| Thread Thread State State PC | Thread Thread Thread Thread State State PC                             |

- Increase the number of hardware threads supported on each core
  - CPU stalls are significantly reduced
  - Improves the performance as the hardware schedule the threads instead of the OS
- Improve the memory bandwidth
  - As large chunks of memory addresses are being fetched from DRAM due to large number of threads
- But, won't these enhancements increase the complexity and cost of the multicore processor?

# How to Design a Processor for SIMD?

- If we only have to run SIMD applications on a processor, then how to cut down the complexity of the processor?
  - Reduce core frequency and increase the number of cores
  - Support large number of hardware threads at each core
    - Requires a large amount of data, but stalls are hidden due to large number of threads
  - o Cores have smaller cache
    - Large number of threads per core would operate on large amount of data, thereby requiring frequent DRAM accesses
  - Increase the number of ALUs per core and the width of SIMD registers
  - Group of threads could share a single PC register
    - Single Instruction Multiple Thread (SIMT)
    - Shared instruction cache
  - Support high bandwidth data transfer

#### This is the design of a throughput oriented processor or a GPU

CSE513: Parallel Runtimes for Modern Processors

### **Mechanical Equivalent of a GPU**



Slide credit: https://web.engr.oregonstate.edu/~mjb/cs575/Handouts/gpu101.1pp.pdf



## Intel GPU Architecture

| SHAR     | ED FI         | UNC <sup>-</sup> | FIONS            | I               | со      | PY EI | NGINE    |                 |               |        |         |                 |         |          |          | MEDIA           | ENGIN         | IE       |                |                 |          |       |         |                 |  |  |
|----------|---------------|------------------|------------------|-----------------|---------|-------|----------|-----------------|---------------|--------|---------|-----------------|---------|----------|----------|-----------------|---------------|----------|----------------|-----------------|----------|-------|---------|-----------------|--|--|
|          | GEOMETRY      |                  |                  |                 |         |       |          |                 |               | RASTER |         |                 |         |          |          |                 |               |          | PIXEL DISPATCH |                 |          |       |         |                 |  |  |
|          |               | - 10             | BSLICE           |                 | Γ       | 100   | SLICE    |                 |               | 1000   | SLICE   |                 |         | 1000     | SLICE    |                 |               |          | SLICE          |                 |          | 1000  | SLICE   |                 |  |  |
|          | IS<br>EU      | EU               | HREAD DI         | SPATCH<br>EU    | EU      | EU    | READ DIS | PATCH<br>EU     | EU            | EU     | EU      | EU              | EU      | TH<br>EU | READ DIS | EU              | IS<br>EU      | TH<br>EU | READ DI        | SPATCH<br>EU    | IS<br>EU | TH    | EU      | EU              |  |  |
| ш        | EU            | EU               | EU               | EU              | EU      | EU    | EU       | EU              | EU            | EU     | EU      | EU              | EU      | EU       | EU       | EU              | EU            | EU       | EU             | EU              | EU       | EU    | EU      | EU              |  |  |
| X° slice | EU            |                  |                  | EU              | EU      | EU    | EU       | EU              | EU            | EU     | EU      | EU              | EU      | EU       |          | EU              | EU            |          |                | EU              | EU       |       |         | EU              |  |  |
| ِيّ<br>آ | EU            |                  |                  | EU              | EU      | EU    | EU       | EU              | EU            | EU     | EU      | EU              | EU      | EU       | EU       | EU              | EU            |          |                | EU              | EU       |       |         | EU              |  |  |
| ×        | SAMPLER       |                  | MEDIA<br>SAMPLER | LOAD '<br>STORE | SAMPLER | MEDIA | SAMPLER  | LOAD '<br>STORE | SAMPLER       | MEDIA  | SAMPLER | LOAD '<br>STORE | SAMPLER | MEDIA    | SAMPLER  | LOAD '<br>STORE | SAMPLER       | MEDIA    | SAMPLER        | LOAD .<br>STORE | SAMPLER  | MEDIA | SAMPLER | LOAD '<br>STORE |  |  |
|          | L1/1          | TEXS             | T                | SLM             | L1/7    | rexs  | T        | LM              | L1/           | TEX\$  |         | 5LM             | L1/7    |          | 1        | LM              | L1/           | TEXS     |                | SLM             | L1/      | TEXS  | T       | ilm             |  |  |
|          | PIXEL BACKEND |                  |                  |                 |         |       |          |                 | PIXEL BACKEND |        |         |                 |         |          |          |                 | PIXEL BACKEND |          |                |                 |          |       |         |                 |  |  |
|          |               |                  |                  |                 |         |       | L3 C     | ACHI            | E             |        |         |                 |         |          |          |                 |               | G        | TI             | G               | AM       | G     | ;TI     |                 |  |  |

#### Intel's Iris® Xe single Slice

Source: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/xe-arch.html



CSE513: Parallel Runtimes for Modern Processors

- Execution Unit (EU) is the smallest building block (same as a core in the CPU)
  - $\circ~$  Operates at MHz level instead of GHz
  - Each EU supports 7-way SMT
  - Supports one 8-wide SIMD operation
- Each slice contains 6 subslice
  - 16 EUs at each subslice
  - Total FP32 SIMD operations per slice per cycle are 7x8x16x6 (=5376)
- Intel supports multiple slices in GPU

### **NVIDIA GPU Architecture**

|      |         |            |            |          |            |         | Instructi                   | on Cache |      |            |            |          |            |       |     |  |
|------|---------|------------|------------|----------|------------|---------|-----------------------------|----------|------|------------|------------|----------|------------|-------|-----|--|
|      |         | 1          | nstructio  | on Buffe | r          |         | Instruction Buffer          |          |      |            |            |          |            |       |     |  |
|      |         |            | Warp So    | heduler  |            |         | Warp Scheduler              |          |      |            |            |          |            |       |     |  |
|      | Dispato |            |            |          |            | ch Unit | Dispatch Unit Dispatch Unit |          |      |            |            |          |            |       |     |  |
|      |         | Regist     | er File (3 | 32,768 x | 32-bit)    |         |                             |          |      | Regist     | er File (3 | 32,768 x | 32-bit)    |       |     |  |
| Core | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU                         | Core     | Core | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST | SFU |  |
| Core | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU                         | Core     | Core | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST | SFU |  |
| Core | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU                         | Core     | Core | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST | SFU |  |
| Core | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU                         | Core     | Core | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST | SFU |  |
| Core | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU                         | Core     | Core | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST | SFU |  |
| Core | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU                         | Core     | Core | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST | SFU |  |
| Core | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU                         | Core     | Core | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST | SFU |  |
| Core | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU                         | Core     | Core | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST | SFU |  |
|      |         |            |            |          |            | 3       | Texture /                   | L1 Cache |      |            |            |          |            |       |     |  |
|      | Te      | x          |            |          | т          | ex      |                             | Tex Tex  |      |            |            |          |            |       |     |  |

Pascal GP100 single SM (Streaming Multiprocessor)

 CUDA-core is the smallest building block (akin to EU in Intel)

- Operates at 1126 MHz
- Each CUDA-core can process 32 data elements (FP 32) simultaneously (warps). Similar to 32-wide vector operation
  - Warp has a common PC (SIMT)
- Each SM (akin to subslice in Intel) has 32x2 CUDA-cores
  - An SM can operate on 64 warps, i.e., each SM can process 32x64 FP32 data elements simultaneously
- GP100 has 56 SMs per GPU
  - Total FP32 that can be processed simultaneously are 32x64x56



# **Today's Class**

- GPU architecture
- ➡ GPU programming



# **GPU Programming Template**

- 1. Setup inputs on the host CPU
- 2. Allocate memory on the host CPU
- 3. Allocate memory on the GPU
- 4. Copy inputs from the host to GPU
- 5. Start GPU kernel
- 6. Copy output from the GPU to host

# **GPU Programming Model**

- Vendor supported programming model
  - CUDA on NVIDIA GPUs
  - o oneAPI on Intel GPUs
  - o Provides high performance
  - Cannot compile with standard compilers (lacks portability)
- OpenCL is vendor neutral
  - Does not require any special compiler or compiler extensions
    - Works with standard C/C++ compiler
  - Provides direct access to underlying hardware (CPU, GPU, FPGA)
  - o High portability
    - Same program can run on multiple device types
      - Although, performance may not be optimal without device specific tuning
  - Requires some serious effort for writing OpenCL programs



### **OpenCL Platform Model**



- One host is connected to one or more OpenCL compute devices
  - Compute device is a processor (e.g., multicore processor or GPU)
- Each compute device is composed of one or more compute units (a.k.a. work groups)
  - Compute unit is analogous to a "core" in multicore processor, or SIMD vector register in CPU, or CUDA-core in a GPU
- Each compute unit is divided into one or more processing elements (a.k.a. work items)
  - Processing element is analogous to an thread that execute code as SIMD

#### **OpenCL Memory Model**

- Private Memory
  - -Per work-item

#### Local Memory

- -Shared within a workgroup
- Global/Constant Memory
  - -Visible to all workgroups

#### Host Memory

-On the CPU



#### Memory management is Explicit

You must move data from host -> global -> local ... and back

© Copyright Khronos Group, 2012 - Page 2

#### **OpenCL Execution Model**

- OpenCL application runs on a host which submits work to the compute devices
  - **Context**: The environment within which work-items executes ... includes devices and their memories and command queues
  - **Program**: Collection of kernels and other functions (Analogous to a dynamic library)
  - **Kernel**: the code for a work item. Basically a C function
  - Work item: the basic unit of work on an OpenCL device

#### Applications queue kernel execution

- Executed in-order or out-of-order

Allows independent kernels to execute simultaneously whenever possible, and thus keeps the GPU fully utilized

© Copyright Khronos Group, 2012 - Page 3

GPU CPU Context Queue Queue

CSE513: Parallel Runtimes for Modern Processors

RON O S

I

Lecture 16: Introduction to GPU Computing

- 1. Query host for OpenCL devices
- 2. Create a context to associate OpenCL devices
- Create programs for execution on one or more associated devices
- Select kernels to execute from the programs
- Create memory objects accessible from the host and/or the device
- 6. Copy memory data to the device as needed
- 7. Provide kernels to command queue for execution
- 8. Copy results from the device to the host

Similar to GPU programming template listed in Slide #13



<sup>©</sup> Copyright Khronos Group 2013 - Page 9

## **OpenCL Kernel Example**





- Vector addition using OpenCL
  - The complete OpenCL program to compute vector addition could span to around 143 lines of **low-level code** as compared to the few lines of **simple code** in the traditional C/C++ program
    - See: <u>https://www.olcf.ornl.gov/tutorials/opencl-vector-addition/</u>
    - Low productivity!



# **Boost.Compute for GPU Computing**

- A header-only C++ library for GPU computing
  - o Easy to use GPU programming APIs → High Productivity!
  - Provides a thin C++ wrapper over OpenCL APIs
  - Works with standard C++ compilers
  - Provides several ready-to-use optimized kernel implementations (e.g., binary\_search, reduce, sort\_by\_key, etc.)
- Supports varieties of GPUs (Intel, NVIDIA, and AMD), as well as CPUs
- Caches OpenCL programs
  - Each OpenCL program (kernel) requires compilation and incurs overheads
  - o Boost.compute stores frequently used kernels in a global cache
    - Reduces overheads by avoiding multiple compilation for the same kernel
- May not match the performance of natively supported GPU programming model (e.g., CUDA on a NVIDIA GPU) without tuning

# **Vector Addition using Boost.Compute**

- Demo of the program available in the course GitHub repository
  - o <u>https://github.com/hipec/cse513/blob/main/lec16/tests/vecadd.cpp</u>



# **Reading Materials**

#### • OpenCL

- <u>https://sites.google.com/site/csc8820/opencl-basics/opencl-concepts#TOC-Kernel-and-compute-kernel</u>
- <u>https://www.khronos.org/assets/uploads/developers/library/2012-pan-pacific-road-show-June/OpenCL-Details-Taiwan\_June-2012.pdf</u>
- Boost.Compute
  - <u>https://www.boost.org/doc/libs/1\_80\_0/libs/compute/doc/html/index</u>
     <u>.html#boost\_compute.introduction</u>



## **Next Lecture**

- Heterogeneous parallel programming
- Quiz-3 on Nov 11<sup>th</sup>
  - o Syllabus: Lectures 15-17

