CodeLoRA - HackMD

## CodeLoRA ### 0. Background * Lack of context (i.e., syntax or semantics) during fine-tuing of LLMs * Performance decrease in domain-specific tasks * Therefore -> our method (CodeLoRA), syntax and semantic-based diversity -> enhanced Code LLMs during PEFT ### 1. Dataset * Code2Text (6 PLs and corresponding comments) * Code2Text (Modified by only electing Java and Python) * Code Translation (Java -> C#) * Code Repair (Java) * Vulnerablity Repair (Java) ![image](https://hackmd.io/_uploads/rk00ptUSa.png) Code smells -> bad code smells, type of things that could be repaired i.e., less severe buffer overflow -> strong buffer overflow, CVSS score to group weak or strong develop rules to help decide attention between two nodes -> definition of one variable, tied back to other nodes in the AST. In the construction of one of the maps, also represent data flow (control flow graph) value numbering -> a type of dataflow analysis. The element you give value number analysis can correspond to elements in AST, could be a path in combining everything together ### 2. Model * Background Information #### 2.1 PEFT for training LLMs * (Q)LoRA -> where adjustment happens ![image](https://hackmd.io/_uploads/S1BNNOUB6.png) ![image](https://hackmd.io/_uploads/BkOf7_UB6.png) ### 3. A. Concrete traces vs Abstract Traces (i.e., critiques: way too abstract, false positives -> our tool can incorporate in a different manner, the model itself is not guaranteed to be sound but we enjoy the trade off, etc) -> Abstract Traces will be so helpful -> having this kind of connection like proving the model is useful based on the current setting will make sense B. Not only how much it works, but also people really want that theoreticall the model already makes sense, but we have to go deeper C. (Kevin) Intuitive for something specific -> either need to enumerate this specific type that would be required. Stay away from weaponing tasks, but not for a paper. Also focus on one task (i.e., code optimization suggestions, investgate everything for that task) D. For the advanced obfuscation technique, there is not simple compiler method -> virtual machine, completely random stuff, not readable anymore -> make sense to decrypt this binaries -> argue execution, this piece of obfuscated code (run on virtual machine, on the fly, to understand what it is doing) -> if there is a model that can understand it it would be new (one Usenix paper -> MDP) https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-blazytko.pdf E. If de-obfuscation -> what is the evaluation -> transform the obfuscated code to see if it can executed -> transformation will change the program, but also capable of detecting specific run-time environment -> deobfuscate the code and change the obfuscated so that it runs in a debugger or virtual machine without detecting the environment, it also works -> not only perfectly deobfuscated but also (side goal) if cannot, making it easier for reverse engineer F. Program Transformation -> Insecure to secure code -> Vulnerability program repair