# PP HW 4 Report
> - Please include both brief and detailed answers.
> - The report should be based on the UCX code.
> - Describe the code using the 'permalink' from [GitHub repository](https://github.com/NTHU-LSALAB/UCX-lsalab).
## 1. Overview
> In conjunction with the UCP architecture mentioned in the lecture, please read [ucp_hello_world.c](https://github.com/NTHU-LSALAB/UCX-lsalab/blob/ce5c5ee4b70a88ce7c15d2fe8acff2131a44aa4a/examples/ucp_hello_world.c)
1. Identify how UCP Objects (`ucp_context`, `ucp_worker`, `ucp_ep`) interact through the API, including at least the following functions:
- `ucp_config` is a descriptor for the configuration of `UCP application context`.
- `ucp_context` holds a UCP communication instance's global information. The information includes communication resources, endpoints, memory, temporary file storage, and other communication information directly associated with a specific UCP instance.
- `ucp_ep` is used to address a remote ucp_worker_h "worker". It provides a description of source, destination, or both.
首先,先通過`ucp_config_read`將environmental variable讀取並且設定有關ucp context的parameters。然後the function `ucp_init` takes in `ucp_context` and `ucp_params` to initializes the network resources required for a application scope. 通過設定UCP_WORKER_PARAM_FIELD_THREAD_MODE in work_params,可以確認在`ucp_worker_create` 的時候使用的是single thread的方式,不需要lock。
為了獲得worker的更多訊息使用`ucp_worker_query`搭配想要查詢的field`UCP_WORKER_ATTR_FIELD_ADDRESS`就可以查詢到worker address。
在獲得需要的資訊之後,server和client端需要通過Out of Bound communication的方式建立connection。
### Client side
1. 通過`getaddrinfo` in `connect_common` 得到server端的address,利用`socket`建立endpoint for communication。最後,`connect`所有可以connect的address。
2. 使用`connect` in `connect_common` 通過TCP socket的方式連結到server([code](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/examples/ucp_hello_world.c#L600-L626))
3. 在建立連結之後,準備接收來自server side的peer_addr_length以及receive下一個隨之而來peer_addr.
4. 最後,進入client side想要執行的[`run_ucx_client`](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/examples/ucp_hello_world.c#L224)
首先,先將end point相關的設定存入`ep_params` 包含ep_addr也就是server address。執行`ucp_ep_create`建立對應server side的endpoint。
準備message buffer並將client side 的address包含在內,通過`ucp_tag_send_nbx` 傳送。This routine is a non-blocking sending function which will return immediately. Therefore, we need a waiting function to check whether the message has been properly sent out.
The function `ucx_wait` is responsible for this purpose. It contains a while loop which keeps executing the another routine `ucp_worker_progress` until the termination condition `!request->completed` is met. The `ucp_worker_progress` explicitly progresses all communication operations on a worker and the state of communication can be advanced (progressed) by blocking routines. In this case, it's waiting for the `ucp_tag_send_nbx` to be completed.
接下來,就等待接收test string from server。過程中需要一個for-loop不停呼叫`ucp_tag_probe_nb`去查看是否有收到和當前tag相符合的message。如果有就結束迴圈,沒有的話繼續呼叫`ucp_worker_progress`做尚未完成的communication(i.e. poll CQs, or invoke callbacks)。但是如果沒有未完成的communication,則可以讓CPU sleep。Through `ucp_worker_wait` waits (blocking) until an event has happened, as part of the wake-up mechanism.
在確認收到一個message之後,通過`ucp_tag_msg_recv_nbx`可以non-blocking的去接受該msg。搭配`ucx_wait`可以得到和之前sending message的時候一樣的效果,等待recv的完成。在設定receive function的params時,使用`ucp_request_param_t`去設定callback function以及要傳入的datatype。
5. 如果一切順利,就可以free掉過程中使用到的socket connection以及message memory等。
### Server side
1. Using NULL node and AI_PASSIVE flag when **obtaining suitable socket descriptions** using `getaddrinfo`
2. Using `bind` in `connect_common` to bind with the specific service requested from the client side.([code](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/examples/ucp_hello_world.c#L600-L626))
3. 與client side對應,先傳送local_addr_len 再傳送 local_addr。
4. 過程基本上和client端相反,先從client端接收其address再送出testing strings。進入[`run_ucx_server`](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/examples/ucp_hello_world.c#L378):
Keep doing the `ucp_worker_progress` and `ucp_tag_probe_nb` to continuously check whether there is an incoming event with the specific tag `0x1337a880u`. If so, it will allocate the memory for this message based on the length provided by the `info_tag`. It then use `ucp_tag_msg_recv_nbx` to execute the receiving process and also use `ucx_wait` to check whether the receiving process has been completed or not. After filling the `msg` with the client address, the server start to sending test string to the client.
Based on the `peer_addr`, it creates the endpoint and connects to the remote destination with `ucp_ep_create`.
After set up the connection of endpoint, it send the testing message to the client by `ucp_tag_send_nbx` and `ucx_wait`. After it breaks the loop in `ucx_wait`, it means the server has sent out the message. If the status from `ucx_wait` is `UCS_OK`, it means the communication was a success. The last thing it needs to do is to flush the endpoint with `flush_ep`.
Inside the function `flush_ep`, it contains the `ucp_ep_flush_nbx`. This routine flushes all outstanding AMO and RMA communications on the endpoint. All the AMO and RMA operations issued on the ep prior to this call are completed both at the origin and at the target endpoint when this call returns. In ucp_hello_world, it uses the [blocking way](https://github.com/NTHU-LSALAB/UCX-lsalab/blob/9fb399958f535682ee8bd61e6a526bf95e255803/examples/ucp_hello_world.c#L367-L374) to implement the flushing process.一邊通過`ucp_worker_progress` 處理worker上其他ep的communication,一邊等待當前ep的flushing結束。
2. What is the specific significance of the division of UCP Objects in the program? What important information do they carry?
The division of UCP objects serves to organize and manage different aspects of the communication infrastructure. The context manages global settings, workers handle the communication and progress engine, and endpoints represent specific communication entities and do all the sending operations.Especially we can see that [`ucp_tag_msg_recv_nbx`](https://github.com/NTHU-LSALAB/UCX-lsalab/blob/9fb399958f535682ee8bd61e6a526bf95e255803/examples/ucp_hello_world.c#L412C31-L412C31) takes in a worker instead of an ep comparing to the [`ucp_tag_send_nbx`](https://github.com/NTHU-LSALAB/UCX-lsalab/blob/9fb399958f535682ee8bd61e6a526bf95e255803/examples/ucp_hello_world.c#L483).
- `ucp_context`
The type was defined [here](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/ucp/core/ucp_context.h#L276-L390) in `ucp_context.h`. It includes the resources of memory domain, the UCT components and a struct [request](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/ucp/core/ucp_context.h#L338-L342) which can be set by the [ucp_params.request_init](https://github.com/NTHU-LSALAB/UCX-lsalab/blob/9fb399958f535682ee8bd61e6a526bf95e255803/examples/ucp_hello_world.c#L572). It also contains the memory registerations for buffers and the global device.
- `ucp_worker`
The struct was defined [here](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/ucp/core/ucp_worker.h#L273-L367). It contains the `ucp_context` handler, the worker's host name and a struct for [counting the endpoints number](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/ucp/core/ucp_worker.h#L357-L366) and status. Normally, one thread will have one worker.
- `ucp_ep`
It carries the [address](https://github.com/NTHU-LSALAB/UCX-lsalab/blob/9fb399958f535682ee8bd61e6a526bf95e255803/examples/ucp_hello_world.c#L448C3-L448C3) of the remote endpoint. The information of this struct is [here](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/ucp/core/ucp_ep.h#L542-L575). It also includes the worker handler and the remote key handler. All the send operations will be performed on the endpoints.
3. Based on the description in HW4, where do you think the following information is loaded/created?
- `UCX_TLS`
The `UCX_TLS` might loaded during parsing the config for `ucp_context`, so it should be somewhere in the `ucp_context.c` or related to `ucp_context`.
- TLS selected by UCX
I think the TLS are selected during ucp_ep_create -> ucp_wireup_init_lanes -> ucp_wireup_select_lanes -> [ucp_wireup_search_lanes](https://github.com/openucx/ucx/blob/f93e4d582add344e4134891a017396c3ff2cee1f/src/ucp/wireup/select.c#L2413-L2420). 因為search_lanes function會從最快的protocol開始測試是否可以連接成功,如果可以就選擇這個protocol。這個過程可以選出best transports based on their scores。
## 2. Implementation
> Describe how you implemented the two special features of HW4.
1. Which files did you modify, and where did you choose to print Line 1 and Line 2?
There are total 4 files I have modifed: `ucp/core/ucp_worker.c`, `ucs/config/parser.c`, `ucs/config/parser.h` and `ucs/config/type.h`.
I chose to print Line 1 in `parser.c` and print Line 2 in `ucp_worker.c`.
In `parser.c` the function `ucs_config_parser_print_opts` is implemented to print all the configurations from the environment variables. When the `flags` is set as `UCS_CONFIG_PRINT_TLS`, which is added in `type.h`, it will invoke `ucs_config_parser_print_env_vars`. It can process the global envrionment variable `environ` to get the `UCX_TLS`. At first, this function is an internal function and only can be invoked through `ucs_config_parser_print_env_vars_once`. However, `ucs_config_parser_print_env_vars_once` can only call the previous function once, which can not satisfy the requirement. Therefore, I also modified the `parser.h` to make this function accessible for `ucs_config_parser_print_opts`.
As for Line 2, I followed the instruction from `UCX_LOG_LEVEL` and found the `ucp_worker_print_used_tls` will print all the information for each transport layer. Therefore, the Line 2 was printed there.
2. How do the functions in these files call each other? Why is it designed this way?
Firstly, for Line 2:
- `ucp_ep_create` -> `ucp_ep_create_to_sock_addr` -> `ucp_ep_init_create_wireup` -> `ucp_worker_get_ep_config` -> `ucp_worker_print_used_tls` -> `printf`
The `ucp_worker_print_used_tls` have stored the information for each transport layer in `strb`, so I printed the `strb`.
For Line 1:
- `ucp_worker_print_used_tls` -> `ucp_config_print` -> `ucs_config_parser_print_opts` -> `ucs_config_parser_print_env_vars`
Since all the env variable has been contained in `environ`, so I could find `UCX_TLS` from it and print them.
The `ucp_config_print` is an api to print the configuration obtained from the run-time environment by `ucp_config_read`. It is defined under the `ucp_context.c` since the context needs all the configurations. The `UCS` provides some service functions for both `UCP` and `UCT`, such as the `ucs_config_parser_print_opts`.
3. Observe when Line 1 and 2 are printed during the call of which UCP API?
Based on the previous answer for Line 2, both Line 1 and Line 2 are called during the function of `ucp_ep_create`.
4. Does it match your expectations for questions **1-3**? Why?
It doesn't match with the `UCX_TLS` parsing part but it matched the TLS selection part. This is because the UCX program has splitted the function of configuration collected from the global environment and the `ucp_context`. Although some of the configurations are used for `ucp_context`, it will be put in the part of UCS, a service part for the whole UCX system.
5. In implementing the features, we see variables like lanes, tl_rsc, tl_name, tl_device, bitmap, iface, etc., used to store different Layer's protocol information. Please explain what information each of them stores.
- `lanes`
In UCX, a lane is a communication pathway associated with a specific transport resource. Each lane typically corresponds to a specific communication channel or resource, and multiple lanes can exist within a single endpoint. Lanes are used to parallelize communication operations. It is defined in the [`ucp_ep.h`](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/ucp/core/ucp_ep.h#L218) specifying the active lanes in the endpoint.
Besides, the endpoint also has the following variables to indicate some specific communication channels.
- `ucp_lane_index_t am_lane; /* Lane for AM (can be NULL) */`
- `ucp_lane_index_t tag_lane; /* Lane for tag matching offload (can be NULL) */`
- `ucp_lane_index_t wireup_msg_lane; /* Lane for wireup messages (can be NULL) */`
- `ucp_lane_index_t cm_lane; /* Lane for holding a CM connection (can be NULL) */`
- `ucp_lane_index_t keepalive_lane; /* Lane for checking a connection state (can be NULL) */`
至於每一個lane當中包含的資訊則是一下這些。其中`rsc_index`可以讓lane獲取`tl_rscs[rsc_index]`當中tranport layer的資訊,像是tl_name等等。猜測應該是在wireup的時候選擇要用lanes(`ucp_wireup_select_lanes` in `selec.c`)
```
typedef struct ucp_ep_config_key_lane {
ucp_rsc_index_t rsc_index; /* Resource index */
...
uint8_t path_index; /* Device path index */
ucp_lane_type_mask_t lane_types; /* Which types of operations this lane
was selected for */
...
} ucp_ep_config_key_lane_t;
```
一個ep有多個lanes,一個lane會根據不同的tranport layer resource擁有不同的lane_types(i.e. UCP_LANE_TYPE_AM, UCP_LANE_TYPE_TAG)
- `tl_rsc`
It's the instance of [`uct_tl_resource_desc_t`](https://github.com/openucx/ucx/blob/487cac0d790c66b7459ade8346e3ef921d2e670a/src/uct/api/uct.h#L328-L335). This is a description of network resources including `tl_name`, `dev_name`, `dev_type` and `sys_device`. The `tl_rsc` is obtained from the `ucp_tl_resource_desc_t *resource = &context->tl_rscs[tl_id]`. Therefore, the origin resources should be set by the context related config.
- `tl_name`
This is a string that specify the name of transport layer in the `tl_rsc`.
- `tl_device`
This is a low-level data structure, [uct_tl_device_resource_t](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/uct/base/uct_iface.h#L377-L386), in UCT showing the information about a transport device including the hardware device name and the device type.
- `bitmap`
Bitmap type for representing which TL resources are in use. The original data type is defined as [`ucs_bitmap_t(_length)`](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/ucs/datastruct/bitmap.h#L209C1-L211) in UCS. It will also be used in many other situations. For example, in `ucp_context` it indicates the opened resources under this context. If the [bitmap](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/ucp/core/ucp_context.h#L314) was set, it should be the cache for the best transport layer left by the previously opened ucp_worker.
- `iface`
This is the struct [`uct_iface`](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/uct/api/tl.h#L390-L392) designed as the communication interface context. It only contains one struct, [`uct_iface_ops_t`](https://github.com/openucx/ucx/blob/b44cd453d50b455d01c8ded2f8b703bd55c7aae0/src/uct/api/tl.h#L295-L375) , which has all the possible actions for a transport interface.
每一個worker會有wifaces array,每一個TL resource都對應一個wiface。wiface是一個encapsulated struct for worker iface。裏面包含了原始的uct iface,以及對應的attributes。每一個iface內又包含了可以執行的functions。
To trace the above objects, I checked the `ucp_worker_create` routine. During the worker creation, it will invoke `ucp_worker_add_resource_ifaces`. If the bitmap of transport layer has been already setup, then the worker will just use this bitmap and open the corresponding ifaces. Otherwise, it will open all the possible resources on the node and find best interfaces that coverd all the devices by `ucp_worker_select_best_ifaces`. ( Path: ucp_worker_create -> ucp_worker_add_resource_ifaces -> ucp_worker_iface_open & ucp_worker_select_best_ifaces)
## 3. Optimize System
1. Below are the current configurations for OpenMPI and UCX in the system. Based on your learning, what methods can you use to optimize single-node performance by setting UCX environment variables?
```
-------------------------------------------------------------------
/opt/modulefiles/openmpi/4.1.5:
module-whatis {Sets up environment for OpenMPI located in /opt/openmpi}
conflict mpi
module load ucx
setenv OPENMPI_HOME /opt/openmpi
prepend-path PATH /opt/openmpi/bin
prepend-path LD_LIBRARY_PATH /opt/openmpi/lib
prepend-path CPATH /opt/openmpi/include
setenv UCX_TLS ud_verbs
setenv UCX_NET_DEVICES ibp3s0:1
-------------------------------------------------------------------
```
> Please use the following commands to test different data sizes for latency and bandwidth, to verify your ideas:
```bash
module load openmpi/4.1.5
mpiucx -n 2 $HOME/UCX-lsalab/test/mpi/osu/pt2pt/standard/osu_latency
mpiucx -n 2 $HOME/UCX-lsalab/test/mpi/osu/pt2pt/standard/osu_bw
```
The available net devices are `enp4s0f0`(tcp), `ibp3s0`(tcp), `ibp3s0:1`(ib), `lo`(tcp).
- Default: the default transport layer in Apollo31 is `ud_verbs` and the default network device is `ibp3s0:1` which indicates the infiniband connection.
Use `ucx_info -d` to check the available transport layer along with compatable net devices. The following picture shows the transport layer can be `tcp` with the net device `lo` and `ibps30`.
```
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# memory types: host (access,reg,cache)
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: ibp3s0
# Type: network
# System device: ibp3s0 (0)
```
There are several possible combinations of different TL and
NET_DEVICES:
- TL=ud_verbs & NET_DEVICE = ibp3s0:1
- TL=rc_verbs & NET_DEVICE = ibp3s0:1
- TL=tcp & NET_DEVICE=enp4s0f0
- TL=tcp & NET_DEVICE=lo
- TL=tcp & NET_DEVICE=ibp3s0
- TL=sysv & DEVICE=shared memory(not net device)
- TL=posix & DEVICE=shared memory(not net device)
- TL=cma & DEVICE=shared memory(not net device):
Since the `cma` shared memory method cannot be used independently, it will be tested along with `sysv` or `posix`. I am not sure about the reason why it can't work independently, but I guess the reason might be the `cma` does not setup a remote key.






從以上的實驗結果可以發現,在資料量比較大的情況下,使用Net devices溝通時最快的設定應該是rc_verbs + infiniband,而在使用Memory溝通的情況下,最快的設定是sysv以及posix。因為是single node的communication,所以使用shared memory的溝通方式會比透過網路溝通更快。透過網路溝通會有connection overhead像是在進行RDMA的時候會需要先initialize connection(i.e. rc_verbs, ud_verbs)。Accessing shared memory within a single node can also achieve very high bandwidth because data transfer occurs directly between memory locations without the need for serialization, deserialization, and network transmission.(i.e. TCP)
### Advanced Challenge: Multi-Node Testing
This challenge involves testing the performance across multiple nodes. You can accomplish this by utilizing the sbatch script provided below. The task includes creating tables and providing explanations based on your findings. Notably, Writing a comprehensive report on this exercise can earn you up to 5 additional points.
- For information on sbatch, refer to the documentation at [Slurm's sbatch page](https://slurm.schedmd.com/sbatch.html).
- To conduct multi-node testing, use the following command:
```
cd ~/UCX-lsalab/test/
sbatch run.batch
```
In multi-node computing, the `enp4s0f0` is not available:
`network device 'enp4s0f0' is not available, please use one or more of: 'br0'(tcp), 'ibp3s0'(tcp), 'ibp3s0:1'(ib), 'lo'(tcp)`. The `lo` means teh loopback-device, it can only communication within a node, so it is not applicable for this test. Moreover, multi-node communication requires `inter-node` network configurations, so the tranport protocol can only be `ud_verbs`, `tcp`, and `rc_verbs` but no shared memory transportation.
##### Question: Why does `rc_verbs` need help from `tcp` but `ud_verbs` doesn't?
When I execute `-x UCX_TLS=rc_verbs`, it shows the following
```
/home/pp23/pp23s01/hw4/lib/libucm.so.0:/home/pp23/pp23s01/hw4/lib/libucs.so.0:/home/pp23/pp23s01/hw4/lib/libuct.so.0:/home/pp23/pp23s01/hw4/lib/libucp.so.0
[1704123935.641801] [apollo32:2172391:0] select.c:629 UCX ERROR no auxiliary transport to <no debug data>: Unsupported operation
[1704123935.654289] [apollo33:320987:0] select.c:629 UCX ERROR no auxiliary transport to <no debug data>: Unsupported operation
```
It shows that the `rc_verbs` needs an auxiliary protocol to help it build connection or set the configuration of the communication environment. Therefore, I guess that's the reason why the UCX cannot be executed by only setting `UCX_TLS=rc_verbs`in a multi-node condition.
Testing combinations:
- TL=ud_verbs & NET_DEVICE = ibp3s0:1
- TL=rc_verbs(TCP) & NET_DEVICE = ibp3s0:1
- TL=TCP & NET_DEVICE = ibp3s0
- TL=TCP & NET_DEVICE = br0 ([br0 definition](https://access.redhat.com/documentation/zh-tw/red_hat_enterprise_linux/6/html/deployment_guide/s2-networkscripts-interfaces_network-bridge): the `br0` here is a Link Layer device)
其中br0的版本因為時間太久了,沒有辦法全部跑完。


From the above experiments, the best choice is the combination of rc_verbs(TCP) and infiniband network. ud_verbs的效果也看起來不錯,但是還是比rc_verbs差一點。相較之下,用TCP作為protocol的方式,不論是bandwidth還是latency的效果都比使用verbs的效果差很多。但是將TCP使用在infiniband上時,可以看到效果會比在bridge mode上好。應該是因為infiniband光是在硬體上就已經加速很多,如果再搭配上針對infiniband優化過的verbs protocol,效果會是最好。
至於為什麼ud_verbs會比rc_verbs還慢一點的原因,可能是因為會有error handling overhead。The simplicity of UD verbs comes at the cost of not having built-in reliability and ordering guarantees, which might be handled at a higher level in the application.
## 4. Experience & Conclusion
1. What have you learned from this homework?
在了解UCX的時候,因為會用到一些網路的概念,所以同時也複習了一下計網概的內容。並且也有更加瞭解UCX的api的使用方式。