## K-ASTRO: Making LLMs more generalizable in Neural Code Analysis



References:
https://arxiv.org/abs/2211.08411
### 0. Background
* Lack of context (i.e., syntax or semantics) during fine-tuing of LLMs
* Performance decrease in cross-domain tasks
* Therefore -> our method (K-ASTRO), semantic-based syntax diversity -> enhanced Code LLMs
### 1. Dataset
* Code2Text (6 PLs and corresponding comments)
* Code2Text (Modified by only electing Java and Python)
* Code Translation (Java -> C#)
* Code Repair (Java)
* Vulnerablity Repair (Java)

Code smells -> bad code smells, type of things that could be repaired
i.e., less severe buffer overflow -> strong buffer overflow, CVSS score to group weak or strong
develop rules to help decide attention between two nodes -> definition of one variable, tied back to other nodes in the AST. In the construction of one of the maps, also represent data flow (control flow graph)
value numbering -> a type of dataflow analysis. The element you give value number analysis can correspond to elements in AST, could be a path in combining everything together
### 2. Model
* Background Information

### 2.0 Old Design
* Model Design

* Dataset


* Processed Dataframe

* Detail A: Augmentation & Upper and Lower Diagonal Matrix

* Detail B: Bining Techniques

* Detail C: Attention Bias

* Baseline result


#### 2.1 PEFT for training LLMs
* (Q)LoRA -> where adjustment happens


#### 2.2 Automated Bias for LoRA
* One global (parameter sharing) structure
* 1 x Transformer block -> Local adjustment for bias
#### 2.3 Dimension Alignment
* Diagonal matrix
* 1 x Transformer block -> dimension alignment for bias injection
### 3. Experiments
#### 3.1 Research Questions
RQ1: Code2Text -> general performance
RQ2: Code2Text -> generalization
RQ3: Code Translation + Code Repair -> general performance
RQ4: Vulnerability Repair (BigBul, CWEs) -> special generalization (weak vulnerability -> strong vulnerability)
#### 3.2
TBD
#### 3.3
TBD
### 4. Results
TBD