# Lecture 20: False Sharing

### Vivek Kumar Computer Science and Engineering IIIT Delhi vivekk@iiitd.ac.in

© Vivek Kumar

## **Today's Class**

- False sharing
- Runtime solutions for detecting/repairing false sharing
  - o Sheriff
  - Featherlight

Acknowledgement: Today's lectures slides are adapted from several conference presentation slides available online on false sharing



## False Sharing



CSE513: Parallel Runtimes for Modern Processors

## **False Sharing**

#### int count[8]; //Global array

```
thread_func(int id) {
    for(i = 0; i < M; i++)
        count[id]++;
</pre>
```



### False Sharing vs. True Sharing





CSE513: Parallel Runtimes for Modern Processors

### False Sharing vs. True Sharing





Sources: <a href="https://people.umass.edu/tongping/pubs/dthreads-final.pptx">https://people.umass.edu/tongping/pubs/dthreads-final.pptx</a>, and <a href="https://people.umass.edu/tongping/pubs/sheriff-final.pptx">https://people.umass.edu/tongping/pubs/sheriff-final.pptx</a>, and

### **False Sharing**



CSE513: Parallel Runtimes for Modern Processors

## **False Sharing**



CSE513: Parallel Runtimes for Modern Processors

#### **Resource Contention at Cache Line Level**





#### False Sharing is Everywhere



CSE513: Parallel Runtimes for Modern Processors

Sources: <a href="https://people.umass.edu/tongping/pubs/dthreads-final.pptx">https://people.umass.edu/tongping/pubs/dthreads-final.pptx</a>, and <a href="https://people.umass.edu/tongping/pubs/sheriff-final.pptx">https://people.umass.edu/tongping/pubs/sheriff-final.pptx</a>, and

## **Detecting / Removing False Sharing**

- Solutions based on instrumenting memory access and O.S.
  - o Sheriff
    - It can both detect, and resolve false sharing during runtime
    - Fakes threads with processes
    - Uses page protection mechanism to track false sharing
  - o and many more...
- Solutions based on hardware Performance Monitoring Units (PMUs)
  - Featherlight
    - Uses lightweight profiling of hardware Performance Monitoring Units (PMUs) and debug registers
    - Addresses several shortcomings of prior implementations
      - Doesn't require instrumenting memory accesses of O.S.
      - Extremely low overheads
  - o and many more...

CSE513: Parallel Runtimes for Modern Processors

### Walkthrough of Sheriff Execution





© Vivek Kumar

### **Sheriff Execution: Process Creation**





CSE513: Parallel Runtimes for Modern Processors

### **Sheriff Execution: Process Creation**

- In Linux, both pthreads and processes are essentially a KLT, and are created using the same API (do\_fork)
- Threads are created on the same CPU to improve locality, whereas processes are created on different CPUs

#### Sheriff Execution: Initialization Core 1 Core 2







### **Sheriff Execution: Initialization**

- Advantages of converting threads into processes
  - Enables the use of per-thread page protection, allowing Sheriff to track memory accesses by different threads (processes)
  - Each thread's (process) memory access are isolated, hence they would not update the same cache line
    - No false sharing!
- Memory mapped files are used to share global and heaps across different processes
- Twin copies of the pages for storing the global and heaps
  - Shared mapping for holding shared states
    - Pages storing these shared states are marked copy-on-write
  - Private mapping for per-process updates
    - Private copy of of the above shared pages are created whenever a process would attempt to update a page for the first time



#### Sheriff Execution: Execution Core 2

Core 1



## **Sheriff Execution: Synchronization**

- There are two different types of synchronization points
  - Thread termination
  - End of the critical section (mutex unlock), barriers, etc.
- At each synchronization point, Sheriff commits changes from private pages to the shared pages
  - It commits only the differences between the twin and the modified pages



Lecture 20: False Sharing

### **Sheriff Execution: Synchronization**











### **Sheriff Execution: Synchronization**



Lecture 20: False Sharing

### **Sheriff Execution: Synchronization**



### Featherlight: High Level Overview



Debug registers enable trapping CPU execution for debugging when the PC (program counter) reaches an address (breakpoint) or an instruction accesses a designated address (watchpoint)

- PMUs sample memory address **M** accessed by each thread (or process)
- A thread publishes its sampled address at a common location visible to other threads
- Other threads use hardware debug registers to monitor the addresses sharing the same cache line as M, excluding M itself
- If a thread accesses another variable in the same cache line, the debug register traps, which indicates false sharing



### **Reading Materials**

- Sheriff
  - o <u>https://people.umass.edu/tongping/pubs/sheriff-oopsla11.pdf</u>
- Featherlight
  - o <u>https://dl.acm.org/doi/10.1145/3178487.3178499</u>



### Next Lecture (L #21)

• Data race detection in task parallel programs

