



# **Energy-Aware Runtime Resource Harmonizer for Co-running Applications**

Vanshika Jain<sup>1</sup>, Varun Parashar<sup>1</sup>, Vivek Kumar<sup>1</sup>, Chiranjib Sur<sup>2</sup>

1 IIIT Delhi, New Delhi, India 2 Shell India Markets Pvt. Ltd, India





# Outline

- Introduction
- Motivation
- Existing Approaches
- Contributions
- Implementation
- Results
- Conclusion



#### Resource Utilization in the Exascale Era

# Increasing number of sockets and cores per node

| Rank of Top500<br>(November 2025) | Sockets<br>Per Node | Cores<br>Per Node |
|-----------------------------------|---------------------|-------------------|
| 1                                 | 4                   | 96                |
| 2                                 | 1                   | 64                |
| 3                                 | 2                   | 104               |
| 4                                 | 4                   | 288               |
| 5                                 | 2                   | 96                |





#### Resource Utilization in the Exascale Era

# Increasing number of sockets and cores per node

| Rank of Top500<br>(November 2025) | Sockets<br>Per Node | Cores<br>Per Node |
|-----------------------------------|---------------------|-------------------|
| 1                                 | 4                   | 96                |
| 2                                 | 1                   | 64                |
| 3                                 | 2                   | 104               |
| 4                                 | 4                   | 288               |
| 5                                 | 2                   | 96                |

#### Power usage at supercomputers





#### Resource Utilization in the Exascale Era

Increasing number of sockets and cores per node

|                                   | -                   |                   |
|-----------------------------------|---------------------|-------------------|
| Rank of Top500<br>(November 2025) | Sockets<br>Per Node | Cores<br>Per Node |
| 1                                 | 4                   | 96                |
| 2                                 | 1                   | 64                |
| 3                                 | 2                   | 104               |
| 4                                 | 4                   | 288               |
| 5                                 | 2                   | 96                |

#### Power usage at supercomputers



It is critical to improve resource utilization for achieving energy efficiency



2. Patki et.al. [ICS2025]



Applications A, B & C to be executed on a quad-socket system





Applications B & C waiting for the CPUs





Application C waiting for the CPUs





**Batch Execution** 

Each application completed their execution one by one





**Batch Execution** 



Threads of Application A, B & C running in parallel

Co-running Execution























Type: Block-Cyclic





Type: Block-Cyclic



















**Type: Interleaved** 













**Type: Block-Interleaved** 







Choosing optimal thread placement over Batch execution improves EDP by up to 81%









- Dynamic Voltage and Frequency Scaling (DVFS)
  - Core-level





- Dynamic Voltage and Frequency Scaling (DVFS)
  - o Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level







Dynamic Voltage and Frequency Scaling (DVFS)

- Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level

**Achieving Energy Efficiency on Multicores** 





 Dynamic Voltage and Frequency Scaling (DVFS)

- Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level



**Achieving Energy Efficiency on Multicores** 





 Dynamic Voltage and Frequency Scaling (DVFS)

- Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level



**Achieving Energy Efficiency on Multicores** 





Dynamic Voltage and Frequency Scaling (DVFS)

- Core-level
- Uncore Frequency Scaling (UFS)
  - Socket-level





- Dynamic Concurrency Throttling (DCT)
  - Adjusts the application level parallelism by controlling core allocation





- Dynamic Concurrency Throttling (DCT)
  - Adjusts the application level parallelism by controlling core allocation
  - Thread packing and unpacking technique provides runtime independence







- Dynamic Concurrency Throttling (DCT)
  - Adjusts the application level parallelism by controlling core allocation
  - Thread packing and unpacking technique provides runtime independence

Heatmap represents the change in EDP by changing the core count relative to default with maximum core allocation



#### **Achieving Energy Efficiency on Multicores**





- Dynamic Concurrency Throttling (DCT)
  - Adjusts the application level parallelism by controlling core allocation
  - Thread packing and unpacking technique provides runtime independence

Heatmap represents the change in EDP by changing the core count relative to default with maximum core allocation



## Insights

- Choosing optimal thread placement improves resource utilization for co-running applications
- There is a strong correlation between application behavior and resource requirement

#### Cache-Sensitive and Neutral



#### **Memory-bound**





# **Existing Approaches for Co-Running Applications**

| Categories of Resource<br>Management Techniques | DCT<br>only | DVFS<br>only | UFS<br>only | DVFS+<br>UFS | DCT+<br>DVFS+<br>UFS |
|-------------------------------------------------|-------------|--------------|-------------|--------------|----------------------|
| Thread Placement for contention reduction       | ✓           | ×            | ×           | ×            | ×                    |
| Runtime Oblivious                               | ✓           | $\checkmark$ | ×           | ×            | ×                    |
| Model Free                                      | ✓           | ✓            | <b>√</b>    | ✓            | ×                    |



# **Existing Approaches for Co-Running Applications**

|                                                 |             |              |             | Our          |                      |
|-------------------------------------------------|-------------|--------------|-------------|--------------|----------------------|
| Categories of Resource<br>Management Techniques | DCT<br>only | DVFS<br>only | UFS<br>only | Focus<br>UFS | DCT+<br>DVFS+<br>UFS |
| Thread Placement for contention reduction       | ✓           | ×            | ×           | ×            | ×                    |
| Runtime Oblivious                               | ✓           | ✓            | ×           | ×            | ×                    |
| Model Free                                      | ✓           | ✓            | ✓           | ✓            | ×                    |



# Contributions

- ✓ Harmonizer: A library-based resource management framework for corunning applications on multicore multi-socket servers
  - ✓ Model-free and runtime oblivious
- Dynamically manages thread placement, core frequency, uncore frequency and core allocation
  - ✓ Uses a lightweight daemon for online profiling of hardware PMCs
- Experimental Evaluations on a quad-socket 72-core Intel Xeon processor
  - ✓ Using several exascale proxy applications (OpenMP, Kokkos and HCLib)
- Results
  - ✓ Demonstrating substantially energy savings and performance gains









































- Classify each application's Memory Access Pattern (MAP)
  - Core-level PMCs
    - Cache misses
    - Cache accesses
  - Uncore PMCs (Socket-level)
    - Integrated Memory Controller (IMC) accesses











- Core-level PMCsCache misses
  - Cache accesses
- Uncore PMCs (Socket-level)
  - Integrated Memory Controller (IMC) accesses















- Core-level PMCs
  - Cache misses
  - Cache accesses
- Uncore PMCs (Socket-level)
  - Integrated Memory Controller (IMC) accesses



### Harmonizer Policy





Optimal placement for a particular mix

Cache Sensitive – Neutral – Cache Sensitive



### **Harmonizer Policy**





Optimal placement for a particular mix

Cache Sensitive – Neutral – Cache Sensitive

(Block-Cyclic to minimize LLC sharing)







UFS is used to explore optimal UF



# **Harmonizer Policy**





UFS exploration possible only over two sockets in this mix because UFS can be applied at socket-level



### **Harmonizer Policy**





Harmonizer rearranges threads over sockets to maximize application isolation while retaining the behaviour of Block-cyclic placement



### Harmonizer Policy





UFS exploration now possible on three sockets instead of two



### **Harmonizer Policy**





UFS exploration now possible on three sockets instead of two

Reduced exploration space based on MAP identified over each socket







DVFS is used to explore optimal CF for each application



#### **Harmonizer Policy**





DVFS is used to explore optimal CF for each application

Reduced exploration space based on MAP identified for each application



### **Harmonizer Policy**





Uniform DVFS settings on each core of sockets hosting a single application



### **Harmonizer Policy**





Non-uniform DVFS setting at socket hosting multiple application's threads











#### **Exascale proxy applications**

| Type of Applications | Application                                                 |
|----------------------|-------------------------------------------------------------|
| Cache Sensitive      | SimpleMOC (OpenMP)<br>MinTally (OpenMP)<br>XSBench (OpenMP) |
| Memory Bound         | HPCCG (OpenMP)<br>MiniFE ( <b>Kokkos</b> )                  |
| Neutral              | CoHMM ( <b>HCLib</b> )<br>CoMD (OpenMP)                     |



#### **Exascale proxy applications**

| Type of Applications | Application                                           |
|----------------------|-------------------------------------------------------|
| Cache Sensitive      | SimpleMOC (OpenMP) MinTally (OpenMP) XSBench (OpenMP) |
| Memory Bound         | HPCCG (OpenMP)<br>MiniFE ( <b>Kokkos</b> )            |
| Neutral              | CoHMM ( <b>HCLib</b> )<br>CoMD (OpenMP)               |

| Number of<br>Applications in a Mix | Number of Mixes |
|------------------------------------|-----------------|
| 3                                  | 6               |
| 4                                  | 3               |



#### **Exascale proxy applications**

| Type of Applications | Application                                                 |
|----------------------|-------------------------------------------------------------|
| Cache Sensitive      | SimpleMOC (OpenMP)<br>MinTally (OpenMP)<br>XSBench (OpenMP) |
| Memory Bound         | HPCCG (OpenMP)<br>MiniFE ( <b>Kokkos</b> )                  |
| Neutral              | CoHMM ( <b>HCLib</b> )<br>CoMD (OpenMP)                     |

| Number of Applications in a Mix | Number of Mixes |
|---------------------------------|-----------------|
| 3                               | 6               |
| 4                               | 3               |

#### **Hardware Platform**

- Quad socket Intel Xeon 5318H Cooper Lake
- 18 cores per socket, Total 72 cores (144 CPUs)



#### **Exascale proxy applications**

| <u></u>              |                                                       |
|----------------------|-------------------------------------------------------|
| Type of Applications | Application                                           |
| Cache Sensitive      | SimpleMOC (OpenMP) MinTally (OpenMP) XSBench (OpenMP) |
| Memory Bound         | HPCCG (OpenMP)<br>MiniFE ( <b>Kokkos</b> )            |
| Neutral              | CoHMM ( <b>HCLib</b> )<br>CoMD (OpenMP)               |

| Number of Applications in a Mix | Number of Mixes |
|---------------------------------|-----------------|
| 3                               | 6               |
| 4                               | 3               |

#### State-of-the-Art used for comparison

- Mapper (TACO'22)
- NuPoCo (PACT'18)

#### **Hardware Platform**

- Quad socket Intel Xeon 5318H Cooper Lake
- 18 cores per socket, Total 72 cores (144 CPUs)



#### **EDP of Harmonizer Relative to Default**





#### Evaluation

#### **EDP of Harmonizer Relative to Default**



| Harmonizer<br>Policy | Mean EDP<br>Improvement<br>(Mix1- Mix3) |
|----------------------|-----------------------------------------|
| Thread Placement     | 7.3%                                    |
| UFS                  | 3%                                      |
| DVFS                 | N/A                                     |
| DCT                  | N/A                                     |

Improvement in EDP from individual policies



#### Evaluation

#### **EDP of Harmonizer Relative to Default**



| Harmonizer<br>Policy | Mean EDP<br>Improvement<br>(Mix4- Mix6) |
|----------------------|-----------------------------------------|
| Thread Placement     | 26.6%                                   |
| UFS                  | N/A                                     |
| DVFS                 | 3%                                      |
| DCT                  | 13.7%                                   |

Improvement in EDP from individual policies



#### **Evaluation**

#### **EDP of Harmonizer Relative to Default**



| Harmonizer<br>Policy | Mean EDP<br>Improvement<br>(Mix7- Mix9) |
|----------------------|-----------------------------------------|
| Thread<br>Placement  | 14.6%                                   |
| UFS                  | N/A                                     |
| DVFS                 | 3.2%                                    |
| DCT                  | 9.3%                                    |

Improvement in EDP from individual policies



### System Throughput Relative to Default



#### **System Throughput**

Geometric mean of speedup of each application



#### Conclusion

# **Summary**

- Effective system utilization is key to improving energy efficiency in the exascale era
  - Co-running applications can improve system utilization by complementing each other's resource requirements
- Harmonizer dynamically profiles the core and uncore PMCs to characterize the behaviour of co-running applications
  - o It applies optimal thread placement for improving the system utilization
  - Dynamically tunes each socket's core and uncore frequencies, and application level core allocation to enhance energy efficiency
- Future Work
  - We plan to extend Harmonizer to handle dynamically varying memory access patterns in applications and scale it to cluster-level environments



# Thank You



Scan to access the Harmonizer artifact