Mutually Exclusive Device Allocations in DRA

# Mutually Exclusive Device Allocations in DRA ## Summary This document proposes an extension to the Dynamic Resource Allocation (DRA) API to support mutually exclusive device allocation constraints. ## Motivation Hardware devices often support multiple partitioning or virtualization schemes that provide different trade-offs in terms of isolation, performance, and resource sharing. However, these schemes are frequently mutually exclusive at the hardware level—once a physical device is partitioned or configured using one scheme, it cannot be reconfigured to use a different scheme until all existing allocations are released. ### Goals - Allow DRA drivers to specify compatibility between virtual devices within a single physical device - Allow the scheduler to make informed allocation decisions that respect compatibility rules - Provide a generic mechanism applicable to any hardware with partitioning constraints - Maintain backward compatibility with existing ResourceSlice specifications ### Non Goals - Allow DRA drivers to specify compatibility between physical/virtual devices across different phisical devices or device classes ### Problem Statement The current Partitionable Devices API does not provide a mechanism to express mutual exclusivity constraints between devices. Without this capability: 1. **Late Failure Detection**: Incompatible allocations are only detected during resource preparation (after scheduling decisions are made) 2. **Scheduler Unawareness**: The scheduler may allocate incompatible devices, leading to pod startup failures 3. **Poor User Experience**: Users receive cryptic preparation failures instead of clear scheduling feedback 4. **Resource Thrashing**: The scheduler may repeatedly attempt incompatible allocations **Current Workaround Limitations:** DRA drivers must fail resource preparation when incompatible allocations are attempted. ### Use Case **Generic Example:** Consider a physical accelerator device that supports four distinct operational modes, in all cases, **Partitionable Devices** is utilized: 1. **Exclusive Mode**: The entire physical device allocated to a single consumer 2. **Software-Partitioned Mode A** : Multiple consumers share the physical device through virtual devices 3. **Software-Partitioned Mode B** : Multiple consumers share the physical device through virtual devices 4. **Hardware-Partitioned Mode**: The device is divided into distinct isolated hardware partitions These modes have compatibility constraints: - **Exclusive Mode** is incompatible with all other modes - **Software-Partitioned Mode A** may be compatible with **Software-Partitioned Mode B**, but not with exclusive and hardware partitioning modes - **Hardware-Partitioned Mode** creates fixed partitions that cannot coexist with other partitioning modes The constraint is bidirectional and transitive: if partition mode A excludes partition mode B, then allocating A must prevent B from being allocated, and vice versa. #### GPU Example ```yaml apiVersion: resource.k8s.io/v1 kind: ResourceSlice ... spec: sharedCounters: # This counter set represents a specific physical device. - name: gpu-0-cs counters: multiprocessors: value: "152" devices: # Incompatible with any other partitioning schemes - name: gpu-0 bindsToNode: true consumesCounters: - counterSet: gpu-0-cs counters: multiprocessors: value: "152" attributes: partitioningMode: string: None # Incompatible with MIGSlicing and None partitioning modes, # but compatible with MPSSharing mode - name: gpu-0-fraction-0 bindsToNode: true allowMultipleAllocations: true consumesCounters: - counterSet: gpu-0-cs counters: multiprocessors: value: "76" capacity: ... attributes: partitioningMode: string: GPUFractioning # Incompatible with any other partitioning modes, only compatible with devices # partitioned with the same mode (MIGSlicing) - name: gpu-0-mig-1g.5gb-0 bindsToNode: true consumesCounters: - counterSet: gpu-0-cs counters: multiprocessors: value: "2" attributes: partitioningMode: string: MIGSlicing # Incompatible with any other partitioning modes, only compatible with devices # partitioned with the same mode (MIGSlicing) - name: gpu-0-mig-1g.5gb-1 bindsToNode: true consumesCounters: - counterSet: gpu-0-cs counters: multiprocessors: value: "2" attributes: partitioningMode: string: MIGSlicing # Incompatible with the MIGSlicing and None # partitioning modes, but compatible with GPUFractioning - name: gpu-0-mps-0 bindsToNode: true allowMultipleAllocations: true consumesCounters: - counterSet: gpu-0-cs counters: multiprocessors: value: "15" capacity: ... attributes: partitioningMode: string: MPSSharing ``` ## Proposal 1 - CompatibilityGroups Assignment ### API Changes Add the `device.consumesCounters[].compatibilityGroups` field which specifies which device groups this device is compatible with. Other devices must specify at least one `compatibilityGroup` from this list to be considered compatible. #### Field Structure ```yaml apiVersion: resource.k8s.io/v1 kind: ResourceSlice ... spec: sharedCounters: # This counter set represents a specific, physical device. - name: gpu-1-cs counters: multiprocessors: value: "152" devices: # Full, physical device. Consumes full counter set `gpu-1-cs`. - name: gpu-1 attributes: type: string: gpu consumesCounters: - counterSet: gpu-1-cs counters: multiprocessors: value: "152" # MIG partition. This cannot be allocated # - when device `gpu-1` is allocated # (reason: counters exhausted) # - when device `gpu-1-foo-part` is allocated # (reason: mismatching compatibilityGroups) - name: gpu-1-mig1 attributes: type: string: mig consumesCounters: - counterSet: gpu-1-cs # Can only consume from the same counter set when # all existing consumers also list compatibilityGroup "mig". compatibilityGroups: - mig counters: multiprocessors: value: "2" # FOO partition. This cannot be allocated # - when device `gpu-1` is allocated # (reason: counters exhausted). # - when device `gpu-1-mig1` is allocated # (reason: mismatching compatibilityGroups). # # This can generally still be allocated # - when `gpu-1-bar-part` is allocated # (reason: shared compatibilityGroups "bar"). # # The relationship between the foo and bar type # partitions on the same physical device is # modeled by counter consumption. - name: gpu-1-foo-part attributes: type: string: foo consumesCounters: - counterSet: gpu-1-cs compatibilityGroups: - foo - bar counters: multiprocessors: value: "17" # BAR paritition. Similar considerations as # described for FOO partition. - name: gpu-1-bar-part attributes: type: string: bar consumesCounters: - counterSet: gpu-1-cs compatibilityGroups: - bar counters: multiprocessors: value: "2" ``` ### Semantics #### Device Groupings 1. **Group Declaration**: Devices must declare which groups they are compatible with, otherwise they are assumed compatible with all groups. 3. **Scope**: Grouping rules apply: - To all devices within a device class, that specify `compatibilityGroups` - Across all resource claims 4. **Scheduler Enforcement**: The scheduler must: - Evaluate exclusion constraints during device selection - Skip device candidates that would violate existing allocations - Track allocated devices and their exclusion rules ## Proposal 2 - Attribute-based Compatibility with CEL ### API Changes Add an optional `compatibleOnlyWith` field to device objects within the ResourceSlice specification. This field allows devices to declare which other devices can be allocated alongside them. If not provided, a device is deemed compatible with all other devices to preserve backwards compatibility #### Field Structure ```yaml devices: - name: device-name # ... existing device fields ... # New field: compatibleOnlyWith # Specifies a CEL expression that the scheduler filters devices with when attempting # a device allocation. # This field is optional. If not specified, the device has no compatibility constraints. compatibleOnlyWith: expression: "cel exp" ``` ### Semantics #### Exclusion Rules 1. **Mutual Exclusivity**: If device A specifies a compatibility expression, scheduler must: - Evaluate the expression against already allocated devices when the device is considered for allocation - Evaluate the expression against devices that are considered for allocation if a device with an expression is already allocated 3. **Scope**: Compatibility expressions apply: - To all devices within a device class - Across all resource claims #### Example Exclusion Patterns **Pattern 1: Device-Level Exclusivity** ```yaml - name: device-full attributes: physicalDevice: dev-0 # Excludes all devices whos underlying device is dev-0 compatibleOnlyWith: expression: 'device.attributes["device.example.com"].physicalDevice != "dev-0"' ``` **Pattern 2: Mode-Based Exclusivity** ```yaml - name: dev-0-partition-1 attributes: physicalDevice: dev-0 mode: hardware-partitioned # Only compatible with specific paritioning modes compatibleOnlyWith: expression: 'device.attributes["device.example.com"].mode == "hardware-partitioned"' ``` ## Proposal Comparison **Attribute-based Compatibility with CEL** - **Higher degree of freedom**: - Device compatibility can be defined in a multi-dimentional way, not only physical device placement - Can be extended to support additional use cases in the future (maybe across device-classes?) **CompatibilityGroups Assignment** - **Cleaner and simpler implementation** - Minimal additions to the API and codebase that solve the problem at hand ## Implementation Considerations ### Scheduler Changes The DRA scheduler plugin must be enhanced to: 1. **Track Allocated Devices**: Maintain a cache of allocated devices per node with their attributes and compatibility expressions, or group mapping 2. **Evaluate Exclusions**: For each candidate device: - Check if all allocated devices are copmatible with this candidate - Check if this candidate is compatible with all allocated devices 3. **Filter Candidates**: Remove devices from consideration if they violate compatibility constraints 4. **Handle Allocation Failures**: If an incompatible device is allocated, provide clear feedback in scheduling events ### Driver Responsibilities Resource drivers should: 1. **Declare Constraints**: Populate `compatibleOnlyWith` or `compatibilityGroups` for all devices with compatibility requirements 2. **Validation**: Ensure compatibility rules are symmetric and consistent across devices 4. **Documentation**: Document their compatibility matrix ### Backward Compatibility - Both approaches are opt-in - Devices without `compatibleOnlyWith` or `compatibilityGroups` behave identically to current behavior - No changes to existing API fields or semantics - Older schedulers will ignore the new field but may allocate incompatible devices (same as current behavior)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.