# LibDX: A Cross-Platform and Accurate System to Detect Third-Party Libraries in Binary Code - published : 2020 - Author : Wei Tang School of Software Tsinghua University #### Abstract 1. there are some hidden risks in misusing third-party libraries such as license violation and security vulnerability 2. this paper proposes LibDX, a platform-independent and fully-automated system, to detect reused libraries in binary files 3. LibDX novelly introduces the logic feature block concept which is applied to deal with the feature duplication challenge in a large-scale feature database 總結 : opensource 第三方函式庫具有漏洞隱憂,作者提出這套自動化系統來檢測重複利用的 library ,C和C++在編譯階段會隱藏許多特徵 #### Intro 1. 提出不同OS使用的第三方library e.g. : SourceForge , Google Code , and numerous package managers for different platforms like APT and RPM for Linux, Homebrew and CocoaPods [7] for MacOS, NuGet [8] for Windows 2. 開發商在使用這些第三方函式庫時為了減少開發成本會採用部分的code or function ,好處是可最小化資安風險M但可能會遇上license violations然後把漏洞部署到用戶主機上 3. 其他一樣在說明相關第三方漏洞說明 openssl 4. Most features are stripped or changed to generate binary files such as function names, variable names and function call graphs. 5. binary code 跟 source code 相比很難萃取特徵並讓他產生標記的方法 還是回歸到 6. We novelly propose the logic feature block concept which represents logical characteristics of code 7. LibDX identifies logic feature blocks between the target application and third-party libraries, then generates a gene map of the target 8. 但 gene map 還是存在某些誤差,因此 We group libraries that are identified as positives in the gene map with similar logic feature blocks and select the optimal match from each group as detection results --- 總結: - We construct a file processor and a feature extractor to analyze applications in various packaging formats and binary formats for different operating systems. - We innovatively adopt binary files in packages to extract features and build a third-party library database with our cross-platform feature extractor. - We novelly propose the logic feature block concept and use it to deal with the code and feature duplication challenge in a large-scale library database. - We design and build a third-party detection system LibDX and evaluate it with a large and comprehensive ground truth data set. Results show that LibDX achieves high accuracy and recall rate. - LibDX is applied to analyze desktop applications in a software download centre and we find some GPL license violations #### Related work 1. 以 Java 作為底層: Java bytecode is not in binary format. It includes lots of available features to detect clone, even is easy to be converted to source code 沒辦法和C等有個通用性 2. C/C++作為底層的第三方套件: 這邊分成兩類 1. Source-to-Source Comparison 相對簡單,可以在每行做 fine-grained clone detection 3. Binary-to-Source Comparison 提出幾個 如 OSSPolice 最先進的工具用來找 androrid 的第三方 該工具使用了一種叫 hierarchical indexing scheme 來達到高準確度 5. 另一種和 作者系統很像的工具叫做 Binary Analysis Tool (BAT) 在韌體中找第三方的函示庫 並且是專門針對 Fedora 釋出的 RPM package 6. BAT 使用 direct feature matching 方法來針對每個library 產生一個文本 7. 剩下其他的檢測方法就是 extract features, such as the number of instructions, the size of arguments in bytes, and search them through the API of code hosting websites.(即是我的ACFG) #### Design: 1. File Processor 建一個file type 辨別model 根據 MIMEtype 來辨別 file type 最終分成三類compressed files, binary files, and other 但像是 macos 的 dmg 和 Fedora 的 rpm 就需要特別的工具來分析 針對執行檔則是根據 They PE 、ELF 、Mach-O 來區分 區分完後 丟進 Feature Extractor 3. Feature Extractor 選擇執行檔中有包含靜態函數的 read-only data 的segment 來作為feature 但他在選擇則尚有一些更嚴格的標準如: • The sequence starts and ends with \x00. • All characters are printable characters . • The length of the sequence is greater than 5. 當這些 Hex 滿足以上時才能夠視為可用的 birthmark feature 根據每個檔案提取特徵後生成一種string list 的 scheme 原因是因為在編譯階段不太會影響靜態函數,因此可以忽略編譯過程的差異 事實上還有一些執行檔通常會使用 dynamic link 這時候他用模糊檔名做為 輔助辨識的feature ![](https://i.imgur.com/lBTkzMg.png) 5. Logic Feature Block 字串常數是一種句法的特徵 沒辦法直接表示整體的程式邏輯 LibDX檢測目標中所有字符串的上下文,並且 將它們分組為多個邏輯功能區塊 簡單來說 他extract 完檔案後來 配對 所有資料庫中的 library ffeature 然後去找到有互相 match 的區塊 他有提到說 compiler在構建語法樹的時候會根據相鄰邏輯中的數據將存儲在二進製文件中的相鄰地址處 因此在廠商在重構的時候也可在鄰近區找到相同特徵 7. Matching Method TF-IDF 權重 - The matching ratio is larger than 0.25. - The binary file in our database has more than 20 features.
{"metaMigratedAt":"2023-06-15T21:26:50.875Z","metaMigratedFrom":"Content","title":"LibDX: A Cross-Platform and Accurate System to Detect Third-Party Libraries in Binary Code","breaks":true,"contributors":"[{\"id\":\"e5c84a4f-cc75-4dd4-803e-16fc0ffa933c\",\"add\":4524,\"del\":213}]"}
    549 views