# Lecture 15: Parallel Programming using SIMD Vector Units

#### Vivek Kumar Computer Science and Engineering IIIT Delhi vivekk@iiitd.ac.in

CSE513: Parallel Runtimes for Modern Processors

# **Today's Class**

- →● Flynn's classification
  - SIMD vector extensions
  - SIMD programming techniques
  - Limitations of vectorization
  - Vector Class Library for SIMD programming



#### **Flynn's Classification of Parallel Computer**



#### Flynn's Classification of Parallel Computer



#### **Exploiting Parallelism on Modern Processors**

- Modern processors supports three different kinds of parallelism
  - Instruction level parallelism
    - Done automatically by the hardware
  - Thread (Task) level parallelism (multicore)
    - Achieved by the help of compiler/programmer
  - Vector (Data) level parallelism
    - Achieved by the help of compiler (automatic) or programmer (manual)



# **Today's Class**

- Flynn's classification
- ➡● SIMD vector extensions
  - SIMD programming techniques
  - Limitations of vectorization
  - Vector Class Library for SIMD programming



#### SIMD Vector Extensions

- What it is?
  - Extension of the ISA  $\cap$
  - Special registers that support instructions to operate upon **vectors** than **scalar** values  $\cap$ 
    - Each core has its own SIMD execution units
    - Parallel computation on short (length 2, 4, 8..) vectors of integers or floats
  - Names: SSE, SSE2, AVX, AVX2, AVX512, etc. 0
- What is their usage?
  - Free data parallelism units capable of providing (theoretical) speedups equal to the **vector width**  $\cap$ 
    - Single instruction operates on multiple data elements simultaneously
- Where do they exists?
  - On almost all modern processor, e.g., Intel & AMD 0





Vector

#### Picture source: Prof. Patterson's Lecture on vector processing



CSE513: Parallel Runtimes for Modern Processors

#### **Architectural Support for SIMD**





**SISD** operation on scalars

**SIMD** operation on vectors

Multicore processor supporting SIMD operations



CSE513: Parallel Runtimes for Modern Processors

- SIMD operation is supported on processors by adding more ALUs to each core, and by using wide registers (greater than 32 bit)
  - Thanks to Moore's law  $\cap$ that small size transistors leave ample space for adding other functionalities
- Each CPU cycle can now operate on more than one 32-bit value
- Increasing vector register width require adding new instructions

#### **History of SIMD Vector Support in Intel Chips**

| Year<br>Released | Name                             | Register Width<br>(BIT) | Width<br>(Float) |
|------------------|----------------------------------|-------------------------|------------------|
| 1996             | MMX (Multimedia Extension)       | 64                      | 2                |
| 1999             | SSE (Streaming SIMD Extension)   | 128                     | 4                |
|                  | SSE2                             | 128                     | 4                |
|                  | SSE3                             | 128                     | 4                |
|                  | SSE4                             | 128                     | 4                |
| 2011             | AVX (Advanced Vector Extensions) | 256                     | 8                |
|                  | AVX2                             | 256                     | 8                |
| 2013             | AVX-512                          | 512                     | 16               |

- Every new generation of SSE or AVX supports new and improved set of instructions
- Backward compatibility with every new generation



CSE513: Parallel Runtimes for Modern Processors

Source: https://en.wikipedia.org/wiki/Streaming\_SIMD\_Extensions

#### **Check Supported SIMD Instructions**

 Use the following command to check the SIMD instructions supported by your processor

#### \$ cat /proc/cpuinfo | grep flag | tail -1

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant tsc arch\_perfmon pebs bts rep\_good nopl xtopology nonstop\_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds\_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xSave avx f16c rdrand lahf\_lm abm cpuid\_fault epb invpcid\_single pti s5bd ibrs ibpb\_stibp tpr\_shadow vnmi flexpriority ept vpid fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md\_clear flush\_l1d

# **Today's Class**

- Flynn's classification
- SIMD vector extensions
- ➡● SIMD programming techniques
  - Limitations of vectorization
  - Vector Class Library for SIMD programming



# **SIMD Programming Techniques**

- Applied either at compile time or link-time
  - Compiler based auto-vectorization
  - Compiler pragmas (e.g., OpenMP simd)
  - Calls to Vector Class Library (VCL)
  - Hand coded compiler intrinsic
  - o Inline assembly code

Easy to use, but low performance

Ninja Level, but best performance

# **Compiler Perspective**

Vectorization is similar to loop unrolling
 Onroll by "N" iterations, where "N" is vector width



- How to inform the compiler for using vectorization?
  - Intel compiler starts vectorization with -O2 optimization flag
  - GCC compiler starts vectorization with -O3 optimization flag
  - By default, both compilers use SSE instructions and 128 bit vector width



# **Today's Class**

- Flynn's classification
- SIMD vector extensions
- SIMD programming techniques
- Limitations of vectorization
  - Vector Class Library for SIMD programming



# **Limitations of Auto-Vectorization**

- Loop iterations should not have different control flow
  - "if" or "switch" statements cannot be used for selective calculation of data elements
    - Although, "if" or "switch" statements may be used as masked statements, i.e., calculation is performed for all elements, but result is stored selectively

for (i=0; i<N; i++) { int s = B[i] + C[i]; if (s>10) A[i] = s; else A[i] = 0; }

- Loop iterations should be independent, e.g., a[b[i]] not allowed
- Loop should only use basic math functions, e.g., pow, sqrt,...



# **General Limitations of Vectorization (1/6)**

- General restrictions for vectorization, that applies to both compiler-based auto-vectorization, as well as manual vectorization
  - Loop size should be countable at runtime
    - Loop size not required during compile time, but it should not change during execution (runtime)
      - Implies single entry and single exit for the loop (no break statements)
  - Only a single arithmetic type operation, e.g., cannot intermix "+", "x", "-", etc. within a vector operation
  - Should not have non-contiguous memory accesses for (i=0; i<N; i+=2) Scalar\_A[i] = Scalar\_B[i] + Scalar\_C[i];</li>

# **General Limitations of Vectorization (2/6)**

- Data dependency (1/4)
  - o Read-After-Write (RAW) or Flow Dependency
    - Happens when a variable being written in one iteration is being read in the next iterations

for (i=1; i<5; i++) {  

$$A[i] = A[i-1] + 1;$$
  
}  
 $A[1] = A[0] + 1 \checkmark$   
 $A[2] = A[1] + 1 \land$   
 $A[3] = A[2] + 1 \land$   
 $A[4] = A[3] + 1 \land$   
 $A[4] = A[3] + 1 \land$ 

- Unsafe for any type of parallel execution of loop iterations including vectorization
  - Imagine each iteration being executed simultaneously using separate cores



CSE513: Parallel Runtimes for Modern Processors

# **General Limitations of Vectorization (3/6)**

- Data dependency (2/4)
  - o Write-After-Read (WAR) or Anti Dependency
    - Happens when a variable being read in one iteration is being written in the next iterations

for (i=1; i<5; i++) {  

$$A[i-1] = A[i] + 1;$$
  
}  
 $A[0] = A[1] + 1 \checkmark$   
 $A[1] = A[2] + 1 \checkmark$   
 $A[2] = A[3] + 1 \checkmark$   
 $A[3] = A[4] + 1 \checkmark$ 

- Unsafe for general parallel execution of loop iterations but totally safe for vectorization
  - During vectorization, iterations with higher value of "i" will complete only after iterations with lower value of "i" have completed



CSE513: Parallel Runtimes for Modern Processors

# **General Limitations of Vectorization (4/6)**

- Data dependency (3/4)
  - Write-After-Write (WAW) or Output Dependency
    - Happens when same variable is written in more than one iterations

Unsafe for any type of parallel execution of loop iterations including vectorization



# **General Limitations of Vectorization (5/6)**

- Data dependency (4/4)
  - o Read-After-Read (RAR)
    - Totally safe for both general parallelization and vectorization

# **General Limitations of Vectorization (6/6)**

#### • Pointer aliasing

- Aliasing refers to a situation where two different expressions or symbols refer to the same object
- o Pointer aliasing may lead to data dependencies



#### **Amdahl's Law for Vectorized Code**



- Assume some work takes "W" time on a scalar CPU
- Time taken on a CPU with vector width "N" for total vectorized fraction "f" available in that work
  - $\circ \qquad \text{Time}_{\text{scalar}} + \text{Time}_{\text{vector}} \Rightarrow (1-f)W + fW/N$
- Hence, maximum possible speedup
  - $\odot \qquad W / \{(1-f)W + fW/N\} => 1 / \{ (1-f) + f/N \}$

Picture source: https://cvw.cac.cornell.edu/vector/performance\_amdahl

CSE513: Parallel Runtimes for Modern Processors

- Linear speedup is possible only for perfectly parallel code
- The exact upper bound depends significantly on the percentage of code that is vectorized
  - At a vector width of 16, code that is 60% vectorized performs only twice as fast as nonvectorized code
- Sequential or scalar code would limit the performance
  - What about memory access pattern?

#### **Memory Access Pattern Affects Performance**



- Moving data into and out of vector registers involves several levels of the memory hierarchy
- Make use of temporal and spatial locality for getting best performance
  - True for all kinds of parallelization
  - Avoid using bad loop stride for vectorization

Picture source: https://cvw.cac.cornell.edu/vector/performance\_memory



CSE513: Parallel Runtimes for Modern Processors

### **Today's Class**

- Flynn's classification
- SIMD vector extensions
- SIMD programming techniques
- Limitations of vectorization
- Vector Class Library for SIMD programming

|                                                          | Current Methods                                                                                                                                                                                                                                                           |
|----------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                          | SIMD programming techniques are applied at compile-time or link-time- these techniques include:                                                                                                                                                                           |
| Why SIMD                                                 |                                                                                                                                                                                                                                                                           |
| Current Methods                                          | Compiler-based auto-vectorization                                                                                                                                                                                                                                         |
| Enter Intel ISPC®                                        | Calls to vector-class libraries                                                                                                                                                                                                                                           |
| > Testing One, Two, Three                                | Hand-coded intrinsics                                                                                                                                                                                                                                                     |
| Conclusion                                               | - Hald-coded intrinsics                                                                                                                                                                                                                                                   |
| Resources                                                | Inline assembly                                                                                                                                                                                                                                                           |
|                                                          | Each approach has its pros and cons. Hand-coded intrinsics—when handled by an intrinsics ninja—typically deliver excellent results. But hand-coding for a wide                                                                                                            |
| Published: 08/02/2019                                    | range of SIMD instruction sets can increase complexity, is time-consuming, and increases maintenance costs. For example, if you want to target multiple ISAs, you                                                                                                         |
| Last Updated: 08/02/2019                                 | need to write multiple algorithms. This decreases productivity and increases code complexity.                                                                                                                                                                             |
| By Marissa du Bois, Pete Brubaker, and<br>Dominic Milano | Auto-vectorizing compilers can reduce the complexity of targeting multiple ISAs, but they are far from perfect. The compiler is no substitute for an experienced programmer. The programmer is often left to optimize manually, using SIMD intrinsics or inline assembly. |
|                                                          |                                                                                                                                                                                                                                                                           |

CSE513: Parallel Runtimes for Modern Processors

#### **Programming using Vector Class Library (VCL)**

- C++17 library for writing vector code without using assembly language or compiler intrinsic
- Header only implementation, i.e., no installation required
- Programmer can use appropriate width vector class, and compile with native compiler (GNU, Clang, Intel icc, etc.)
  - Compiler flag used to specify the desired SIMD instruction set (SSE4, AVX2, AVX512, etc.)
    - Must be supported by the processor
- Supported on Windows, Linux, and Mac, 32-bit and 64-bit, with Intel, AMD, etc.



# VCL Usage

- 1. Constructing vectors using VCL
  - o Vec4i a;
  - o Vec4i a(5);
  - Vec4i a = 6;
  - o Vec4i a(1, 4, 9, 0);
  - o Vec4i a
- 2. Loading data into vectors
  - Vec4i a(0);
     a.insert(/\*index\*/ 2, /\*value\*/ 9);
  - Vec4i a;
     a.load(array + index);

- 3. Getting data from vectors
  - Vec4i a(1, 4, 9, 0); int array [SIZE] a.store(array + index);
  - Vec4i a(1, 4, 9, 0);
     int element\_index2 = a[2];
- 4. Arithmetic operations on vectors
  - o +, -, \*, /, ++, +=, -=, \*-, /=, /\*many
    more\*/
- 5. Logical operations on vectors o ==, !=, >, <, <=, >=, /\*many more\*/
- 6. Functions operating on single vectors
  - horizontal\_add, horizontal\_min, horizontal\_max, /\*many more\*/
- Function operating on two vectors
   min, max, abs, /\*many more\*/



g++ -std=c++17 -O3 -msse4 -fopt-info-vec -l/path\_to/VCL/version2 sum.cpp

#### **Performance Benefits using SIMD**



Code available in CSE513 GitHub repo:

https://github.com/hipec/cse513/blob/main/lec15/tests/par\_matmul.cpp

CSE513: Parallel Runtimes for Modern Processors

 Four different variants of matrix multiplication of size 1024x1024 of floats

- Sequential (**Baseline**)
- Sequential but using vectorization with Vec8f
- Recursive task parallel using Argolib but without vectorization with Vec8f (20 threads)
- Recursive task parallel using Argolib where leaf tasks are using vectorization with Vec8f (20 threads)
- Two sockets of 10-core Intel Xeon E5-2650 v3 processor
- GNU compiler 7.5.0 using -O3 and SSE4 instruction set
- VCL version2 commit id 08959eb
- Ubuntu 16.04.7 LTS

### **Reference Materials**

- Intel guide for auto vectorization
  - <u>https://www.intel.com/content/dam/www/public/us/en/documents/g</u> <u>uides/compiler-auto-vectorization-guide.pdf</u>
- Cornell virtual workshop for vectorization
  - o <u>https://cvw.cac.cornell.edu/vector/</u>
- VCL: Vector Class Library
  - o <a href="https://www.agner.org/optimize/vcl\_manual.pdf">https://www.agner.org/optimize/vcl\_manual.pdf</a>
  - o <a href="https://github.com/vectorclass/version2">https://github.com/vectorclass/version2</a>



#### **Next Lecture**

• GPU programming

