# PP HW5 Report
### 110062305 謝諺緯
## 1. Overview
1. Identify how UCP Objects (`ucp_context`, `ucp_worker`, `ucp_ep`) interact through the API, including at least the following functions:
- `ucp_init`
This function initializes a UCP application context (`ucp_context`), which is the global context for managing resources like memory registrations and device contexts. It is responsible for setting up all necessary network interfaces and discovering resources required for communication.
**Example**:
```c
ucp_context_h ucp_context;
ucp_params_t ucp_params = { ... };
ucp_init(&ucp_params, config, &ucp_context);
```
- `ucp_worker_create`
A UCP worker (`ucp_worker`) is created to manage transport-related resources like connection establishment and completion events. Each worker operates within a `ucp_context` and can progress independently.
**Example**:
```c
ucp_worker_h ucp_worker;
ucp_worker_params_t worker_params = { ... };
ucp_worker_create(ucp_context, &worker_params, &ucp_worker);
```
- `ucp_ep_create`
Establishes a communication endpoint (`ucp_ep`) that connects a local worker to a remote worker. All send and receive operations occur via this endpoint.
**Example**:
```c
ucp_ep_h ucp_ep;
ucp_ep_params_t ep_params = { ... };
ucp_ep_create(ucp_worker, &ep_params, &ucp_ep);
```
2. UCX abstracts communication into three layers as below. Please provide a diagram illustrating the architectural design of UCX.
The UCX framework abstracts communication into three layers, each represented by a key object, enabling flexibility, scalability, and performance optimization in high-performance computing. Below is an explanation of these layers and their interactions:
- **`ucp_context` (Application Context)**:
- Serves as the global context for the application, managing resources such as memory registrations, network interfaces, and device contexts.
- All communication operations derive from the `ucp_context`, which acts as the foundation of the UCX application.
- **`ucp_worker` (Communication and Progress Engine)**:
- Represents the transport-related resources and event management for communication.
- Handles tasks such as connection establishment, polling for completion events, and progressing communication operations.
- Multiple `ucp_worker` instances can be created per `ucp_context` to enable concurrent operations across threads or processes.
- **`ucp_ep` (Endpoint)**:
- Defines a communication link between two workers (`ucp_worker`).
- All data transfer operations, such as sends and receives, are executed through the `ucp_ep`.
Below is a diagram that illustrates the architectural design of UCX, depicting the relationship between these layers and their roles in executing the command `srun -N 2 ./send_recv.out`:
```plaintext
+------------------------------------------------------+
| Application |
| |
| +-----------------+ |
| | ucp_context | |
| +-----------------+ |
| | |
| +--------------+-------------+ |
| | | |
| +------------------+ +------------------+ |
| | ucp_worker | | ucp_worker | |
| +------------------+ +------------------+ |
| | | |
| +------------+ +------------+ |
| | ucp_ep |<-------------->| ucp_ep | |
| +------------+ +------------+ |
| |
+------------------------------------------------------+
```
3. Based on the description in HW5, where do you think the following information is loaded/created?
- **`UCX_TLS`**:
- I think the `UCX_TLS` variable would be read and stored during the initialization of the global configuration in `ucp_context`.
- This allows the variable to be accessible globally within the UCX framework, aligning with the user-defined or default transport protocols.
- **TLS Selected by UCX**:
- I think the selection of the actual TLS happens primarily at the **`ucp_worker`** level.
- The `ucp_worker` object is responsible for managing communication resources, including the selection of transport protocols based on the user-defined `UCX_TLS` variable and system hardware capabilities.
- The worker chooses the most suitable transport protocols by consulting the configuration (`UCX_TLS`) and the system's resources (e.g., network devices, RDMA support).
## 2. Implementation
### 1. Which files did you modify, and where did you choose to print Line 1 and Line 2?
#### **Files Modified**
1. **`ucp_worker.c`**: Added modifications to the function `ucp_worker_print_used_tls()`.
- **Purpose**: To retrieve the `UCX_TLS` configuration and print both configured and actively used transport protocols.

2. **`parser.c`**: Updated the function `ucs_config_parser_print_opts()` to print all available UCX transport protocols.
- **Purpose**: Responsible for iterating and printing all transport layer configurations stored in `UCX_TLS`.

3. **`types.h`**: Added a new flag `UCS_CONFIG_PRINT_TLS` in the `ucs_config_print_flags_t` enumeration.
- **Purpose**: This flag is used to enable the printing of transport protocol configurations in `parser.c`.

#### **Line 1: Configured `UCX_TLS`**
- **Implementation Details**:
1. In `ucp_worker.c`, used `ucp_config_read()` to load the `UCX_TLS` configuration into the `ucp_config_t` structure.
2. Called `ucp_config_print()` to print the retrieved `UCX_TLS` configuration.
3. Inside `ucp_config_print()`, the function `ucs_config_parser_print_opts()` from `parser.c` was invoked to handle the actual printing of protocols.
- **Changes in `parser.c`**:
- Added a conditional check using the new `UCS_CONFIG_PRINT_TLS` flag.
- Called `ucs_config_parser_get_value()` to extract the `TLS` value into a variable.
- Printed the `UCX_TLS` value using `fprintf()`.
#### **Line 2: Active Transport Protocols**
- **Implementation Details**:
1. In `ucp_worker.c`, the `ucp_worker_print_used_tls()` function already computes the active transport protocols.
2. Printed the selected transport protocols directly by appending the print logic in this function.
### 2. How do the functions in these files call each other? Why is it designed this way?
To meet the requirement of outputting all available UCX transport protocols and the currently selected ones, we need to follow the process of transport establishment in UCX. Here is the step-by-step explanation:
#### 1. **`ucp_wireup_init_lanes()`**
- **File**: `wireup.c`
- **Purpose**:
- Initializes the communication lanes for a UCP endpoint (`ep`).
- Establishes connections to the remote peer.
- Configures transport protocols for communication.
- **Key Steps**:
1. Selects the available transport lanes.
2. Validates and matches local and remote configurations.
3. Creates a new endpoint configuration if necessary.
4. Establishes connections for each selected lane.
- **Function Call**:
- This function calls `ucp_worker_get_ep_config()` to manage and retrieve the endpoint configuration. (L:1541)

---
#### 2. **`ucp_worker_get_ep_config()`**
- **File**: `ucp_worker.c`
- **Purpose**:
- Manages and retrieves endpoint configurations.
- Creates a new configuration if it does not already exist.
- **Key Functions**:
1. Searches for existing configurations.
2. Creates a new configuration if needed.
3. Initializes protocol thresholds.
4. Sets parameters for short message transfers.
- **Function Call**:
- Once the endpoint configuration is ready, it calls `ucp_worker_print_used_tls()` to print the currently selected protocols. (L:2112)

---
#### 3. **`ucp_worker_print_used_tls()`**
- **File**: `ucp_worker.c`
- **Purpose**:
- Computes the currently used transport protocols.
- Prints the selected protocols for debugging or verification purposes.
- **Enhancement**:
- Before printing the selected protocols (Line 2), it calls `ucp_config_print()` to output all available transport protocols. (L:1862)

---
#### 4. **`ucp_config_print()`**
- **File**: `ucp_context.c`
- **Purpose**:
- Prints the configured transport protocols (`UCX_TLS`) and other configuration details.
- **Function Call**:
- This function calls `ucs_config_parser_print_opts()` to handle the actual printing of the transport protocols. (L:755)

---
#### 5. **`ucs_config_parser_print_opts()`**
- **File**: `parser.c`
- **Purpose**:
- Iterates through and prints the configuration options, including `UCX_TLS`.
- **Enhancement**:
- Prints all available transport protocols (Line 1) stored in `UCX_TLS` using the new `UCS_CONFIG_PRINT_TLS` flag.
---
#### **Why Is It Designed This Way?**
1. **Modularity**: Each function has a distinct role, such as initialization, configuration management, and debugging, which ensures clean separation of concerns.
2. **Flexibility**: The design allows for dynamic adjustments to configurations and supports a wide range of transport protocols, ensuring adaptability to different hardware setups.
3. **Debugging and Verification**: By printing both available and selected protocols, it becomes easier to debug and verify the transport configuration, which is critical in high-performance computing environments.
### 3. Observe when Line 1 and 2 are printed during the call of which UCP API?
#### **Line 1: Configured `UCX_TLS`**
- **When**: Printed during `ucp_worker_get_ep_config()`, called by `ucp_wireup_init_lanes()`.
- **Call Flow**:
1. `ucp_worker_get_ep_config()` invokes `ucp_config_print()`.
2. `ucp_config_print()` calls `ucs_config_parser_print_opts()` to print all available protocols (`UCX_TLS`).
#### **Line 2: Active Transport Protocols**
- **When**: Also printed during `ucp_worker_get_ep_config()`.
- **Call Flow**:
1. After configuration, `ucp_worker_get_ep_config()` calls `ucp_worker_print_used_tls()`.
2. `ucp_worker_print_used_tls()` computes and prints the active protocols.
### 4. Does it match your expectations for questions **1-3**? Why?
Partially, the observed behavior for `TLS Selected by UCX` differs from my initial expectations, but the behavior for `UCX_TLS` aligns well.
#### **`UCX_TLS`**
- **Expectation**: I expected the `UCX_TLS` variable to be read and stored during the initialization of the global configuration at the `ucp_context` level.
- **Observation**: This is accurate. The `UCX_TLS` variable is indeed loaded during the initialization phase in `ucp_context`, where it becomes globally accessible and aligns with user-defined or default transport configurations.
#### **TLS Selected by UCX**
- **Expectation**: I initially anticipated that the transport protocol selection would primarily happen at the `ucp_worker` level during its initialization.
- **Observation**: This is different. The actual selection of the transport protocols occurs when a `ucp_ep` is establishing a connection. Specifically, the function `ucp_worker_get_ep_config()` is called before establishing the connection to retrieve the transport protocols selected by UCX. This ensures the selected protocols are optimal for the current endpoint configuration and connection.
### 5. In implementing the features, we see variables like lanes, tl_rsc, tl_name, tl_device, bitmap, iface, etc., used to store different Layer's protocol information. Please explain what information each of them stores.
#### **1. `lanes`**
`lanes` represent the state of communication channels in UCX. They are used to track how different types of operations, such as active messages or remote memory access, are handled over specific transport protocols.
#### **Examples**
- For an active message lane (`UCP_LANE_TYPE_AM`), it tracks whether messages are being sent reliably using protocols like `rc`.
- For a remote memory access lane (`UCP_LANE_TYPE_RMA`), it indicates whether RDMA transfers are ongoing and whether the connection is healthy.

#### **2. `tl_rsc`**
`tl_rsc` provides information about the transport layer, describing the available communication protocols, devices, and their capabilities.

#### **3. `tl_name`**
Represent the name of the transport layer, eg. ud_verbs.
#### **4. `tl_device`**
`tl_rsc` provides information about the transport layer, describing the available communication protocols, devices, and their capabilities.
#### **Examples**
- infiniband
- ethernet

#### **5. `Bitmap`**
`bitmap` is used to represent the availability or selection of transport resources or features in a compact, efficient format. Each bit corresponds to a specific resource or capability.
#### **6. `Iface`**
`iface` represents the transport interface in UCX, which serves as the connection point between the application and the underlying transport protocol. It manages communication operations such as sending, receiving, and connection management.
## 3. Optimize System
1. Below are the current configurations for OpenMPI and UCX in the system. Based on your learning, what methods can you use to optimize single-node performance by setting UCX environment variables?
```
-------------------------------------------------------------------
/opt/modulefiles/openmpi/ucx-pp:
module-whatis {OpenMPI 4.1.6}
conflict mpi
module load ucx/1.15.0
prepend-path PATH /opt/openmpi-4.1.6/bin
prepend-path LD_LIBRARY_PATH /opt/openmpi-4.1.6/lib
prepend-path MANPATH /opt/openmpi-4.1.6/share/man
prepend-path CPATH /opt/openmpi-4.1.6/include
setenv UCX_TLS ud_verbs
setenv UCX_NET_DEVICES ibp3s0:1
-------------------------------------------------------------------
```
#### **Current Configuration**
- **UCX_TLS**: `ud_verbs`
- Uses the Unreliable Datagram (UD) protocol for communication over RDMA-capable devices.
- **UCX_NET_DEVICES**: `ibp3s0:1`
- Specifies the RDMA-capable network interface `ibp3s0` as the active device.
---
#### **Solution: Switch to Shared Memory Transport**
- **Reason**: For single-node communication, shared memory (`sm`) transport is faster and more efficient than network-based transports like `ud_verbs`.
- **Command**:
```bash
mpiucx -n 2 -x UCX_TLS=sm $HOME/UCX-lsalab/test/mpi/osu/pt2pt/osu_latency
mpiucx -n 2 -x UCX_TLS=sm $HOME/UCX-lsalab/test/mpi/osu/pt2pt/osu_bw
```

- **Observation**:
- `sm` (Shared Memory) consistently shows lower latency compared to `ud_verbs` (Unreliable Datagram over Verbs).
- The difference is particularly significant for smaller message sizes (<1KB), where `sm` has sub-microsecond latencies, while `ud_verbs` remains above 10 microseconds.
- **Hypothesis**:
- `sm` operates within the same node using shared memory, avoiding network overhead. This results in minimal software stack traversal and near-instant communication.
- `ud_verbs`, being a network transport, incurs additional overhead from RDMA device interactions, network stack traversal, and packetization.

- **Observation**:
- `sm` achieves significantly higher bandwidth for large message sizes (>1KB), peaking around 10 GB/s.
- `ud_verbs` maintains relatively low bandwidth (~2 GB/s) with minimal variation across message sizes.
- **Hypothesis**:
- `sm` leverages direct memory access and avoids network bottlenecks, making it highly efficient for bulk data transfers.
- `ud_verbs` has inherent limitations due to protocol overhead, network congestion, and RDMA resource contention.
## 4. Experience & Conclusion
### **1. What have you learned from this homework?**
This homework provided valuable insights into UCX's transport protocols and their configurations. I learned how to test and optimize performance for different communication scenarios using tools like OSU Micro-Benchmarks. The process also deepened my understanding of single-node and multi-node communication, as well as the impact of environment variables on latency and bandwidth.
### **2. How long did you spend on the assignment?**
Approximately 2 days, including tracing code, debugging, running tests, and analyzing results.
### **3. Feedback (optional)**
I would like to thank the professor and TAs for designing this challenging yet rewarding assignment. It not only enhanced my technical skills but also improved my problem-solving abilities. I appreciate the opportunity to explore UCX in depth and apply theoretical knowledge to practical scenarios.