###### tags: `Profiling` # Function Tracing :::info - uftrace: function tracing in userspace - trace-cmd: front end for ftrace in Linux ::: ## uftrace uftrace 為 userspace 的函式追蹤器,使用方法跟輸出都類似 Linux 的 ftrace,以下筆記 uftrace 的使用方法。 ### Prerequisite 確認被追蹤的函式庫和執行檔都有使用 `-pg` 編譯及連結,若使用 cmake 可以修改以下參數: - CMAKE_C_FLAGS - CMAKE_CXX_FLAGS - CMAKE_EXE_LINKER_FLAGS 若使用 ROS2 的組建工具 colcon 則可以使用以下命令,自動組建可用 uftrace 追蹤的執行檔。 ```bash $ colcon build --symlink-install \ --cmake-args -DCMAKE_C_FLAGS="-pg" \ -DCMAKE_CXX_FLAGS="-pg" \ -DCMAKE_EXE_LINKER_FLAGS="-pg" \ ``` ### Usage > 常用指令的 man pages: > [uftrace-record(1)](https://github.com/namhyung/uftrace/blob/master/doc/uftrace-record.md), [uftrace-report(1)](https://github.com/namhyung/uftrace/blob/master/doc/uftrace-report.md) 紀錄一個命令 : ```bash $ uftrace record -a -k <command> ``` - `-a`: auto arguments recording - `-k`: 追蹤 kernel 函數 (需要有 root 權限) 列出效能報表,依照函數花費的時間排序 ```bash $ uftrace report ``` Ouput example (紀錄 ROS2 應用程式的熱點): ```cpp $ uftrace report Total time Self time Calls Function ========== ========== ========== ==================== 2.036 m 2.036 m 580 linux:schedule 2.016 m 163.070 us 54 std::__invoke 2.016 m 85.230 us 54 std::__invoke_impl 2.016 m 32.664 us 6 std::thread::_State_impl::_M_run 2.016 m 5.664 us 6 std::thread::_Invoker::operator() 2.016 m 22.143 us 6 std::thread::_Invoker::_M_invoke 1.008 m 493.544 us 3 eprosima::fastrtps::rtps::UDPChannelResource::perform_listen_operation 1.007 m 392.896 us 113 eprosima::fastrtps::rtps::UDPChannelResource::Receive 1.007 m 333.519 us 113 asio::basic_datagram_socket::receive_from 1.007 m 84.633 us 113 asio::datagram_socket_service::receive_from 1.007 m 548.420 us 113 asio::detail::reactive_socket_service::receive_from 1.007 m 88.103 us 113 asio::detail::socket_ops::sync_recvfrom 1.007 m 2.466 ms 113 asio::detail::socket_ops::recvfrom 44.130 s 2.902 ms 404 rclcpp::sleep_for 22.766 s 4.360 ms 1 main 22.707 s 1.428 us 1 eprosima::fastrtps::rtps::ResourceEvent::run_io_service 22.707 s 1.405 us 1 asio::io_service::run 22.707 s 179.629 us 1 asio::detail::task_io_service::run 22.707 s 325.524 us 49 asio::detail::task_io_service::do_run_one ... ``` 印出函式呼叫關係,適合用在追蹤大型的軟體架構: ``` $ uftrace replay ``` 單純使用 `replay` 可能會印出太多函式,以 ROS2 為例一個簡單的 talker (不斷的發送資料給其他節點) 會印出約近千萬筆函式呼叫的紀錄,因此需搭配 uftrace 提供的過濾功能: 只列出幾個常用的選項,以下旗標可以在 uftrace 的任何子命令使用 - `-F FUNC`: 函式白名單,只列出 FUNC 函式 - `-N FUNC`: 函式黑名單,排除 FUNC 函式 - `-C FUNC`: 列出所有呼叫 FUNC 的函式,也就是列出 FUNC 的 caller - <mark>`-t TIME`: 以函式的執行時間過濾,只列出執行時間大於 TIME 的函式 (最好用)</mark> 使用範例,列出所有執行時間大於 50us 且函式名稱開頭不是 std 的所有函式: ``` $ uftrace replay -N ^std -t 50us ``` 輸出結果,只列出其中一部份: ``` $ uftrace replay -N ^std -t 50us ... [ 30561] | rclcpp::Publisher::publish(0x562cc5ba4150, 0x7ffdcdd97ab0) { [ 30561] | rclcpp::Publisher::do_inter_process_publish(0x562cc5ba4150, 0x7ffdcdd97ab0) { [ 30561] | rcl_publish() { [ 30561] | rcl_publish(0x562cc5ba4168, 0x7ffdcdd97ab0, 0) { [ 30561] | rmw_publish(0x562cc5bf8230, 0x7ffdcdd97ab0, 0) { [ 30561] | rmw_publish() { [ 30561] | rmw_fastrtps_shared_cpp::__rmw_publish() { [ 30561] | eprosima::fastrtps::Publisher::write() { [ 30561] | eprosima::fastrtps::PublisherImpl::create_new_change() { [ 30561] | eprosima::fastrtps::PublisherImpl::create_new_change_with_params() { [ 30561] | rmw_fastrtps_shared_cpp::TypeSupport::serialize() { [ 30561] | rmw_fastrtps_cpp::TypeSupport::serializeROSmessage() { 50.561 us [ 30561] | std_msgs::msg::typesupport_fastrtps_cpp::_String__cdr_serialize(); 64.284 us [ 30561] | } /* rmw_fastrtps_cpp::TypeSupport::serializeROSmessage */ 72.783 us [ 30561] | } /* rmw_fastrtps_shared_cpp::TypeSupport::serialize */ [ 30561] | eprosima::fastrtps::PublisherHistory::add_pub_change() { [ 30561] | eprosima::fastrtps::rtps::WriterHistory::add_change_() { [ 30561] | eprosima::fastrtps::rtps::StatefulWriter::unsent_change_added_to_history() { 66.144 us [ 30561] | eprosima::fastrtps::rtps::ChangeForReader_t::ChangeForReader_t(); 55.405 us [ 30561] | eprosima::fastrtps::rtps::ReaderProxy::add_change(); 51.435 us [ 30561] | eprosima::fastrtps::rtps::AsyncWriterThread::wakeUp(); 211.351 us [ 30561] | } /* eprosima::fastrtps::rtps::StatefulWriter::unsent_change_added_to_history */ 245.287 us [ 30561] | } /* eprosima::fastrtps::rtps::WriterHistory::add_change_ */ 247.500 us [ 30561] | } /* eprosima::fastrtps::PublisherHistory::add_pub_change */ 429.102 us [ 30561] | } /* eprosima::fastrtps::PublisherImpl::create_new_change_with_params */ 440.351 us [ 30561] | } /* eprosima::fastrtps::PublisherImpl::create_new_change */ 445.366 us [ 30561] | } /* eprosima::fastrtps::Publisher::write */ 446.717 us [ 30561] | } /* rmw_fastrtps_shared_cpp::__rmw_publish */ 447.939 us [ 30561] | } /* rmw_publish */ 450.116 us [ 30561] | } = 0; /* rmw_publish */ 457.967 us [ 30561] | } = 0; /* rcl_publish */ 459.788 us [ 30561] | } /* rcl_publish */ 460.571 us [ 30561] | } /* rclcpp::Publisher::do_inter_process_publish */ 461.153 us [ 30561] | } /* rclcpp::Publisher::publish */ ``` ### Set Record Whitelist 指定所有要追蹤的函式,以下為 ROS2 使用 fastrtps 當作底層 DDS middleware 時,傳輸和接收一個 ros message 所呼叫的函式 (按照呼叫順序)。 ``` // publish side rcl_publish eprosima::fastrtps::Publisher::write eprosima::fastrtps::PublisherHistory::add_pub_change eprosima::fastrtps::rtps::UDPTransportInterface::ShrinkLocatorLists eprosima::fastrtps::rtps::NetworkFactory::ShrinkLocatorLists eprosima::fastrtps::rtps::RTPSMessageGroup::add_data eprosima::fastrtps::rtps::RTPSMessageGroup::~RTPSMessageGroup // subscription side rmw_wait eprosima::fastrtps::rtps::ReceiverResource::OnDataReceived eprosima::fastrtps::SubscriberHistory::add_received_change rclcpp::executor::Executor::execute_subscription rmw_take_with_info ``` 如果直接使用 uftrace 過濾函式名稱,使用過後覺得有些困難,不僅是因為 mcount 會造成 overhead,造成函式之間經過的時間失真,而且過濾規則可能要套用 uftrace 的 TRIGGER 選項,指定呼叫的深度,因此改用 bpftrace 簡單寫一個腳本觀察呼叫時間。 以下為輸出範例,最左邊為 timestamp (單位為 us),lat 代表函式進入和回傳經過的時間 ``` ... 29043337757, rmw_wait entry 29043436718, ReceiverResource::OnDataReceived 29043436796, add_received_change 29043436937, rmw_wait exit, lat: 99219 us 29043437122, rmw_take_with_info, lat: 58 us ... ``` ## trace-cmd trace-cmd 為 ftrace 提供一個簡易使用的前端命令,追蹤 Linux kernel 中的函式呼叫。 ### 1. Trace ROS2 Applications Function Calls ftrace 會追蹤整個系統上的核心的函式呼叫,第一個欄位會顯示此呼叫所屬的程式為何 (格式為 comm-pid),因 ROS2 使用 DDS 進行傳輸,DDS 的通訊協定 DDSI-RTPS 是建立在 UDP 之上,因此觀察 `udp_sendmsg` 了解系統行為及效能,以下為追蹤使用的命令: ```bash $ trace-cmd record -p function_graph -c -g udp_sendmsg <command> ``` - `-p`: 選擇要使用的 plugin,function_graph 會紀錄某個函式及其所呼叫的函式 - `-c`: 表示要紀錄 child process - `-g`: 此選項提供給 function_graph,指定要紀錄的函式,這裡紀錄 `udp_sendmsg` :::info plugin 總共有 function, function_graph, preemptirqsoff, irqsoff, preemptoff, and wakeup 請參考 [trace-cmd-record(1)](https://linux.die.net/man/1/trace-cmd-record) ::: ### Output Example ``` $ trace-cmd report ``` 單行範例,每一行皆代表一個事件,funcgraph_entry 代表函式進入。 ``` talker-22190 [002] 567757.342223: funcgraph_entry: | udp_sendmsg() { ``` 以下為修改過後的輸出結果,拿掉 plugin 名稱 (funcgraph_entry, funcgraph_exit)。 ``` $ trace-cmd report | cut --complement -c1-10,44-65 | less ... talker-22196 [009] 567758.376036: | udp_sendmsg() { talker-22196 [009] 567758.376042: | lock_sock_nested() { talker-22196 [009] 567758.376044: | _cond_resched() { talker-22196 [009] 567758.376046: 1.667 us | rcu_all_qs(); talker-22196 [009] 567758.376050: 5.500 us | } talker-22196 [009] 567758.376052: 1.724 us | _raw_spin_lock_bh(); talker-22196 [009] 567758.376055: 1.666 us | __local_bh_enable_ip(); talker-22196 [009] 567758.376058: + 16.475 us | } talker-22196 [009] 567758.376060: 2.093 us | __cgroup_bpf_run_filter_sock_addr(); talker-22196 [009] 567758.376064: | release_sock() { talker-22196 [009] 567758.376065: 1.662 us | _raw_spin_lock_bh(); talker-22196 [009] 567758.376069: 1.848 us | ip4_datagram_release_cb(); talker-22196 [009] 567758.376072: | _raw_spin_unlock_bh() { talker-22196 [009] 567758.376074: 1.666 us | __local_bh_enable_ip(); talker-22196 [009] 567758.376077: 4.750 us | } talker-22196 [009] 567758.376079: + 15.150 us | } talker-22196 [009] 567758.376081: 1.896 us | security_sk_classify_flow(); talker-22196 [009] 567758.376085: | ip_route_output_flow() { talker-22196 [009] 567758.376086: | ip_route_output_key_hash() { talker-22196 [009] 567758.376088: | ip_route_output_key_hash_rcu() { talker-22196 [009] 567758.376090: 4.407 us | fib_table_lookup(); talker-22196 [009] 567758.376097: 1.615 us | find_exception(); talker-22196 [009] 567758.376101: + 13.135 us | } talker-22196 [009] 567758.376103: + 16.453 us | } ... talker-22196 [009] 567758.376329: | udp_send_skb.isra.47() { talker-22196 [009] 567758.376331: 1.708 us | udp4_hwcsum(); talker-22196 [009] 567758.376334: | ip_send_skb() { talker-22196 [009] 567758.376336: | ip_local_out() { talker-22196 [009] 567758.376337: | __ip_local_out() { talker-22196 [009] 567758.376339: 1.635 us | ip_send_check(); talker-22196 [009] 567758.376342: 4.588 us | } talker-22196 [009] 567758.376344: | ip_output() { talker-22196 [009] 567758.376345: | ip_finish_output() { talker-22196 [009] 567758.376347: 2.084 us | __cgroup_bpf_run_filter_skb(); talker-22196 [009] 567758.376351: | __ip_finish_output() { talker-22196 [009] 567758.376353: 1.640 us | ipv4_mtu(); talker-22196 [009] 567758.376356: | ip_finish_output2() { talker-22196 [009] 567758.376360: | dev_queue_xmit() { talker-22196 [009] 567758.376362: | __dev_queue_xmit() { talker-22196 [009] 567758.376364: 1.771 us | netdev_core_pick_tx(); talker-22196 [009] 567758.376368: | validate_xmit_skb() { talker-22196 [009] 567758.376369: | netif_skb_features() { talker-22196 [009] 567758.376372: 1.682 us | skb_network_protocol(); talker-22196 [009] 567758.376375: 5.624 us | } talker-22196 [009] 567758.376377: 1.694 us | skb_csum_hwoffload_help(); talker-22196 [009] 567758.376380: 1.734 us | validate_xmit_xfrm(); talker-22196 [009] 567758.376384: + 15.748 us | } talker-22196 [009] 567758.376387: | dev_hard_start_xmit() { ... ``` :::info 使用 function_graph 回報的函式執行時間,會受到其內部函式的呼叫次數影響,呼叫次數越多或層數越多,overhead 就越大,若要精準的執行時間要使用以下命令: ```bash $ trace-cmd record -p function -l <func_name> ``` ::: ### 2. Trace a BPF Helper Calls TC (Traffic Control) 在 Linux 是用來過濾並排程封包的子系統,在 TC 中也有一個 eBPF hook,可以讓我們掛載自訂 eBPF 程式過濾封包。在 Linux Network Stack 之中呼叫 `cls_bpf_classify` 時就會執行目前註冊的 eBPF 程式。 ``` cpus=12 ... ssh-15986 [002] 2882.242765: funcgraph_entry: | cls_bpf_classify() { ssh-15986 [002] 2882.242767: funcgraph_entry: 0.213 us | bpf_skb_load_bytes(); ssh-15986 [002] 2882.242767: funcgraph_exit: 2.816 us | } <idle>-0 [002] 2882.243697: funcgraph_entry: | cls_bpf_classify() { <idle>-0 [002] 2882.243697: funcgraph_entry: 0.150 us | bpf_skb_load_bytes(); <idle>-0 [002] 2882.243698: funcgraph_exit: 1.330 us | } talker-17381 [003] 2882.290333: funcgraph_entry: | cls_bpf_classify() { talker-17381 [003] 2882.290339: funcgraph_entry: 1.255 us | bpf_skb_load_bytes(); talker-17381 [003] 2882.290342: funcgraph_entry: 1.022 us | bpf_skb_load_bytes(); talker-17381 [003] 2882.290344: funcgraph_entry: 1.031 us | bpf_skb_load_bytes(); talker-17381 [003] 2882.290346: funcgraph_entry: 0.978 us | bpf_skb_load_bytes(); talker-17381 [003] 2882.290348: funcgraph_entry: 0.959 us | bpf_skb_load_bytes(); talker-17381 [003] 2882.290349: funcgraph_entry: 0.974 us | bpf_skb_load_bytes(); talker-17381 [003] 2882.290351: funcgraph_entry: 0.958 us | bpf_skb_load_bytes(); talker-17381 [003] 2882.290353: funcgraph_entry: 0.963 us | bpf_skb_load_bytes(); talker-17381 [003] 2882.290355: funcgraph_entry: | bpf_ktime_get_ns() { talker-17381 [003] 2882.290356: funcgraph_entry: 1.271 us | ktime_get_mono_fast_ns(); talker-17381 [003] 2882.290358: funcgraph_exit: 3.032 us | } talker-17381 [003] 2882.290359: funcgraph_entry: | bpf_skb_event_output() { talker-17381 [003] 2882.290361: funcgraph_entry: | perf_misc_flags() { talker-17381 [003] 2882.290362: funcgraph_entry: 0.923 us | kvm_is_in_guest(); talker-17381 [003] 2882.290364: funcgraph_exit: 3.681 us | } talker-17381 [003] 2882.290366: funcgraph_entry: 1.791 us | perf_output_begin(); talker-17381 [003] 2882.290369: funcgraph_entry: 1.083 us | perf_output_copy(); talker-17381 [003] 2882.290371: funcgraph_entry: 0.948 us | perf_output_copy(); talker-17381 [003] 2882.290373: funcgraph_entry: | perf_output_end() { talker-17381 [003] 2882.290374: funcgraph_entry: | perf_output_put_handle() { ``` #### Dump Kernel Stack 印出呼叫函式時的堆疊。 ``` talker-30890 [010] 6385.274551: function: cls_bpf_classify talker-30890 [010] 6385.274552: kernel_stack: <stack trace> => cls_bpf_classify (ffffffffc0f38585) => tcf_classify (ffffffff82f597c0) => __dev_queue_xmit (ffffffff82efcc16) => dev_queue_xmit (ffffffff82efd1c0) => ip_finish_output2 (ffffffff82f786e1) => __ip_finish_output (ffffffff82f78a7a) => ip_finish_output (ffffffff82f78b6c) => ip_mc_output (ffffffff82f7a15d) => ip_local_out (ffffffff82f79aeb) => ip_send_skb (ffffffff82f7ac99) => udp_send_skb.isra.47 (ffffffff82fa9d76) => udp_sendmsg (ffffffff82faa6b7) => inet_sendmsg (ffffffff82fb9dad) => sock_sendmsg (ffffffff82ed79dc) => ___sys_sendmsg (ffffffff82ed8b19) => __sys_sendmsg (ffffffff82eda363) => __x64_sys_sendmsg (ffffffff82eda3bf) => do_syscall_64 (ffffffff8260441a) => entry_SYSCALL_64_after_hwframe (ffffffff8320008c) talker-30890 [010] 6385.274553: function: bpf_ktime_get_ns talker-30890 [010] 6385.274554: kernel_stack: <stack trace> => bpf_ktime_get_ns (ffffffff827e63d5) ``` --- ## References: ### uftrace uftrace-replay: https://github.com/namhyung/uftrace/blob/master/doc/uftrace-replay.md ### ftrace trace-cmd tutorial: https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/