# Project Name: BinaryFixer BinaryFixer: Transformering Vulnerable binaries into Secure Constructs (for ICSE2025 Early Submission, due 03/15) ## 0. Project Pitch ### 0.1 Note for 12/18/2023 A. Concrete traces vs Abstract Traces (i.e., critiques: way too abstract, false positives -> our tool can incorporate in a different manner, the model itself is not guaranteed to be sound but we enjoy the trade off, etc) -> Abstract Traces will be so helpful -> having this kind of connection like proving the model is useful based on the current setting will make sense B. Not only how much it works, but also people really want that theoreticall the model already makes sense, but we have to go deeper C. (Kevin) Intuitive for something specific -> either need to enumerate this specific type that would be required. Stay away from weaponing tasks, but not for a paper. Also focus on one task (i.e., code optimization suggestions, investgate everything for that task) D. For the advanced obfuscation technique, there is not simple compiler method -> virtual machine, completely random stuff, not readable anymore -> make sense to decrypt this binaries -> argue execution, this piece of obfuscated code (run on virtual machine, on the fly, to understand what it is doing) -> if there is a model that can understand it it would be new (one Usenix paper -> MDP) https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-blazytko.pdf E. If de-obfuscation -> what is the evaluation -> transform the obfuscated code to see if it can be executed -> transformation will change the program, but also capable of detecting specific run-time environment -> deobfuscate the code and change the obfuscated so that it runs in a debugger or virtual machine without detecting the environment, it also works -> not only perfectly deobfuscated but also (side goal) if cannot, making it easier for reverse engineer F. Program Transformation -> Insecure to secure code -> Vulnerability program repair ### 0.2 Note for 01/08/2024 A. Taint analysis -> insecure code, let LLM to generate taint analysis or execution trace first -> reveal what is the root cause of vulnerability -> might achieve better results. Integrate some information from execution traces or taint analysis. Even facier: let the model learn the taint analysis rules before predict the secure code from the insecure code. I.e., A = B + C, if B is tainted, then C is tainted. If both B and C are tainted, A is tainted. B. Binary hardening -> non-travial than fixing source code -> most popular way: lift binary to IR and perform fix and recompile that back to binary code. Otherwise it will break a lot of assumptions. Very large context, different places in the binary that you need to fix accordingly, and the fix should be very deterministic, instead of iterative RL framework to achieve that. C. Soure code level, but incorporating additional information from compiler/other sources? noise channel (not problem in binary level), search space (not the case in binary hardening, since it's purely rule-based thing -> only focusing on localizing position in binary level) D. Learning security pratices -> certain env satisfy, certain construction in source code level -> LLMs have been generating insecure code most of the time -> if we have a tool to fix it, it's aligned E. Documentation about security pracives should be helpful -> developing rules from the documentation (i.e., extract regular rules of writing secure code into the model). Recent literature about how LLMs generate insecure code. Notes: https://ee.stanford.edu/dan-boneh-and-team-find-relying-ai-more-likely-make-your-code-buggier https://github.com/angr/angr -> Value-set analysis (VSA) Ghidra ### 0.3 Note for 01/15/2024 A. VSA is applicable to limited types of vulnerabilities -> discuss all of them separately -> VSA: one function calls the other function, some functions may be indireectly called, quite useful to propagate data flow across different functions, if the vulnerability requires tracking the information flow across functions. 1. dump the information produced by VSA and see if it can be helpful LLMs, if it can indirectly benefit the model, 2. build the whole pipeline, where VSA is indeed useful in some cases B. Fine-tune the model not for the input but for the output -> prompted to say before actually perform fix on the code itself, can you generate the step in natural language what you can fix on the code -> in this case documentation serves as intermediate steps for fixing it, like logical steps. Prompt the LLM to follow to rule by providing natural language C. Vulnerability type, concrete type of documentation, how to repair -> design choice, i.e., combining sensitive statements or further approaches D. While reading the paper (DeepVSA), they perform something like backward taint analysis, starting from what the table here (string copy function, how the parameters got passed to the string copy). Using the simple pattern, you can get complicated patterns -> potentially vulnerable properties of function calls, understand the values given to the function actually does not overflow -> know the bound of those variable, how to propagate the value set E. During modeling, saves some intermediate computation (consider mutiple pass, merge of operation, until reach the end point till the function no longer grow). There patterns you can take some shortcuts (not sound) to generate reasonably high accurate vulnerability repair. I.e., (if prompting) run one or two cases, prompt the model, the VSA result is like this F. Find one or two concrete case study (i.e, bufferoverflow) -> VSA propagate the value set along data flow, they are dependent -> people use these tools in many different ways, i.e., give signature to the programs, use the signature to match similar program snippets, on the other hand, using symbolic execution / dynamic analysis to do the similar task, or in terms of the matching, they try to do SMT solver to equivalently check the matching (VSA flow on data flow, data flow sometimes depends on VSA) G. (Kevin) Given this known buggy analysis, what VSA should be to match the patch / pattern -> another type of data flow analysis / taint analysis ### 0.4 Note for 01/22/2023 A. KG + Prompt -> Call Graph Retrieval direction -> Given bug report, how to fix / localize which function has bug, then how to fix -> Caller/Callee's children as more close neighbor (how to function is called, how to call other function -> just like KG to guide the source to retrieve) -> execution trace as KG to direct B. Once execution tool is sucessfully run -> we can claim VSA instead of concrete value, interpret emulation of execution (using Unicorn) -> since VSA is abstract domain, execution of Unicorn (1 -> 0, 2 -> 3, 3 -> 5, as set, abstract three execution trace as one value set trace) -> let LLM to execute to devide which value to execute, or sample meaningful values for LLM to execute C. Abstract interpretation -> better than VSA in search ### 0.5 Note for 02/05/2024 A. Cross-architecture knowledge graph? B. Groud-truth data C. CodeQL ### 0.6 Note for 02/12/2024 A. Granularity of individual instructions -> LLM can compose them to image what exibit arbitrary code piece behavior -> trying different strategies (function level, block level) B. CodeQL -> Check if new vulnerability introduced (if what type of vulnerability, then we can do cross-check) -> basic evaluation plus real evaluation C. CVEFixer (MSR'20) + VulRepair (FSE'22) D. I.e., CPU **formal specification** could be a form of documentation -> check formal description of individual instruction + concrete exceution examples (concrete execution) -> Ghidra CPU specification, IDA pro E. Composite repair -> half patch correct (i.e., lock) -> how to inform LLM to repair both F. About execution traces -> nice to have some categorization of the quantification of difficulty of fixing different types of vulnerabilities (i.e., what Kevin mentioned, the type that requires more effort) G. Strongly connected graph in CFG -> more details in functions, more than single instructions ### 0.7 Note for 02/26/2024 A. Provide the trend of using different window sizes -> Two potential decision (1) context window, (2) actually highlighted statement of the code -> Show a sweet spot (i.e., when the performance goes up and goes down) -> research question? (if big result, when the experiment shows that the context window makes big difference) B. along with showing the number, it would be good to have several cases that show the failure cases. (i.e., the version 1 didn't work but version 2 worked, demonstrate the key insight of the design) and success cases -> motivating example + case study C. Replace BLEU score -> ground truth of token repair (i.e., how many tokens are repaired) -> proof of concept way of showing the result -> Zeroshot vulnerability repair (Oakland'24) -> probably contains metrics for the evaluation D. Compile a single project -> just compile openssl -> individual instructions / statements are neutral, more generic, knowledge-based -> knowledge system E Approach to condense execution traces -> use heuristics -> run a programs many times and see what are the elements changed during execution (dycon?); Jeremy (?) variable renaming -> compile github projects -> they have scripts for the build system: https://github.com/huzecong/ghcc, follow the build system -> heuristics? F. Just execute single instruction using the emulator -> have some execution traces for the single instruction, then freely compose them using the prompt ### 0.8 Note for 03/04/2024 A. B. C. D. E. ## 0.X Idea summary ### 0.X.1 Overview - Instruction: Vulnerable code repair - Context: external knowledge(1) + context information(2) - Input: Vulnerable code - Output: Invulnerable code ### 0.X.2 Summary Motivation: Instruct LLM by telling them **what**(2) and **how**(1) to repair vulnerable code Assumption A: All vulnerable type labels are known for each source code snippet Assumption B: We do not need to execute the source code (either vulnerable or invulnerable counterpart) again for model inference - RAG(1)(2) + CoT(2+1+2) - For RAG(1), collecting documents for vulnerability repair as external KG - For RAG(2), collecting specific numbers of execution traces as internal KG - For CoT(2+1+2), all three steps will query information from the **same** KG that connects execution traces and , because different type of vulnerability may need different type of external documentation to repair ### 0.X.3 RAG - Running code to obtain execution traces - Establish KG - Obtain knowledge from KG ### 0.X.4 CoT - By default, set k=3 - Step1: Provide k traces for the vulnerable code behavior - Step2: Provide related documentation - Step3: Provide k traces for the repaired code behavior ### Update 02/18/2024 ![image](https://hackmd.io/_uploads/S1ivVvg26.png) <!-- ## 1. Motivation (Binary Code) ### 1.1 Research Background * Difficult to detect / repair vulnerabilities in the source level (i.e., relatively low detection accuracy even using LLMs, emerging new vulnerabilities in real world scenarios, ) * Discrepancy between source code / assembly code & static / dynamic assembly code (i.e., diversity in source code writing style, values in registers change in execution, etc) * Some LM/LLMs to systematically repair vulnerabilities in source level. Some examples: - VulRepair (T5-based model + Source Code Level, FSE2022) ![image](https://hackmd.io/_uploads/SyW5A8wOp.png) ![image](https://hackmd.io/_uploads/B1zsLIP_p.png) - Zero-Shot Vul Repair (OpenAI Codex & AI21 Jurrasic J-1 + Source Code Level, S&P 2023) ![image](https://hackmd.io/_uploads/S123pIP_T.png) ![image](https://hackmd.io/_uploads/By4ba8Pda.png) * Only a few to repair vulnerabilities in binary level and under certain limitations. Some examples: - Almost all of them are non-AI methods ### 1.2 Common Types of Vulnerabilities Buffer Overflow, Instruction Misinterpretation, Memory Corruption, Improper Handling of Privileges, Race Condition, etc. ![image](https://hackmd.io/_uploads/r1QTLBPdT.png) ### 1.3 Inspiration * A few LMs/LLMs consider binary code as modality (i.e., Kexin's papers), only small number of models consider if the generated code can execute (i.e., CompCode, PPOCoder, etc), (maybe) no paper considers specifically repair vulnerabilities using AI in binary level * Due to the facts (claimed by Trex, StateFormer and NeuDep: spurious memorization of LLMs on common patterns of source code rather than the causalities of vulnerabilities), better to repair vulnerable code in binary level before execution, rather than in the source level ### 1.4 Draft of BinaryFixer ![image](https://hackmd.io/_uploads/H1vCj1KuT.png) ### 1.5 Angr VSA (See code) ## 2. Motivation (Source Code) ### 2.1 Insecure Code with AI Agent Neil Perry, et al. Do Users Write More Insecure Code with AI Assistants? https://arxiv.org/pdf/2211.03622.pdf Info: Misunderstanding design-level security concepts rather than implementation mistakes which statics analysis tools (e.g., SpotBugs, Infer, etc) are more likely to focus on. Q1: Using VSA to enhance LLM in understanding vulnerability (i.e., the regular design in other papers) Q2: Connection between documentation and binary-enhanced LLM (i.e., which methodology) Q3: Format of source-level repair (i.e., used align with LLM, or individual tool) ### 2.2 Different Types of Vulnerabilities & Program Analysis Tools Q1-1: Is VSA the only analysis tool for all kinds of vulnerabilities Benjamin Steenhoek, et al. An Empirical Study of Deep Learning Models for Vulnerability Detection (ICSE'23) https://arxiv.org/pdf/2212.08109.pdf ![image](https://hackmd.io/_uploads/rJAFxgGYp.png) Q1-2: Is VSA only capable of making LM learn variable value related vulnerabilities (i.e., buffer overflow, integer overflow, etc) Benjamin Steenshoek, et al. Do Language Models Learn Semantics of Code? A Case Study in Vulnerability Detection (ICSE'24) https://arxiv.org/pdf/2311.04109.pdf ![image](https://hackmd.io/_uploads/rkiluJztp.png) ### 2.3 Methodology? Q2-1: Should APR associated with (continuous) prompt or fine-tuning? Q2-2: Should we use LMs (reviewers may challenge if it's sota) or LLMs (training speed might be too slow) ## 3. Pre-Processing ### 3.1 Dataset * CVEFixes ![image](https://hackmd.io/_uploads/SkGtibiKp.png) ### 3.2 DeepVSA * Unable to obtain tailored libdasm ![image](https://hackmd.io/_uploads/B1PLjbsta.png) ### References: [1] IEEE-TDSN: Pre-Trained Model-Based Automated Software Vulnerability Repair: How Far are We?: https://www.computer.org/csdl/journal/tq/5555/01/10232867/1PZ32Zm1isU [2] ISSRE 2021: A Comparative Study of Automatic Program Repair Techniques for Security Vulnerabilities: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9700400 [3] ESEC/FSE 2022: VulRepair: a T5-based automated software vulnerability repair https://dl.acm.org/doi/pdf/10.1145/3540250.3549098 [4] S&P 2023: Examining Zero-Shot Vulnerability Repair with Large Language Models: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10179324 [5] ISSTA 2023: How Effective Are Neural Networks for Fixing Security Bulnerabilities: https://arxiv.org/pdf/2305.18607.pdf [6] ICML 2021: Break-It-Fix-It: Unsupervised Learning for Program Repair: https://arxiv.org/pdf/2106.06600.pdf [7] ACL Findings 2022: Compilable Neural Code Generation with Compiler Feedback: https://aclanthology.org/2022.findings-acl.2.pdf -->