GNN Features - HackMD

# Node Features: * **Method Count:** The number of methods associated with a class. This helps the GNN understand the size and complexity of a class. * **Description Length:** The length of the class or method description. A longer description might indicate more complex entities, giving the GNN a hint about the entity’s importance. * **Is Class / Is Method:** Binary flags that tell whether the node represents a class (1 if true) or a method (1 if true). * **Node Degree:** This represents how many connections (edges) a node has. A class that is inherited by several other classes or calls multiple methods would have a higher degree. # Edge Features: **Relationship Type:** * **Inheritance (0):** Edges between parent and child classes in the inheritance hierarchy. * **Method Call/Definition (1):** Edges representing method calls or definitions within or between classes. Why These Features Are Important for Code Generation: These features give the GNN structural information about the code: **Method Count:** Helps indicate how many methods a class has, which can reflect its complexity. **Description Length:** A longer description may hint at more detailed or critical functionality. **Node Degree:** Indicates how connected a node is, which might reflect how central or important it is in the codebase. ---------------------- # Visualization of the Graph: **What It Shows Graph Structure Visualization:** During the construction of the graph, we visualized the relationships between classes and methods using a node-link diagram. This diagram helps to visualize the structure and interactions within the codebase. **Node Sizes:** Nodes are sized based on their degree centrality. Classes or methods with more connections (edges) are represented as larger nodes, indicating their importance or central role in the codebase. **Example:** A class that is inherited by many other classes will appear larger, as it has more connections. Edge Colors: We used different colors to indicate the type of relationships between nodes: * **Red Edges:** Represent inheritance relationships (e.g., one class inherits from another). * **Blue Edges:** Represent method call/definition relationships (e.g., a method belongs to a class or a class calls a method). What the Visualization Indicates: **Large Nodes:** These represent classes or methods that have many connections, likely serving as central components or classes in the codebase. **Example:** A base class in an inheritance hierarchy might appear as a large node, indicating that several other classes depend on it. **Dense Areas:** Areas where there are many edges connected to a few nodes indicate parts of the codebase with high interactivity or complexity. **Example:** A class that calls many methods or is involved in multiple inheritance relationships may appear in a dense part of the graph. **Sparse Areas:** Conversely, sparse areas (with fewer edges) might indicate isolated or less interactive parts of the codebase. **Example:** A utility method or a standalone class with limited interaction might appear in these regions. ---------------- # How BERT Embeddings Work: Example: * **Description:** "Handles user authentication and login requests." * **BERT Embedding:** A 768-dimensional vector representing the meaning of this description. The embedding encodes not just the words but also the context and relationships between the words. * Two descriptions like "Handles user login" and "Manages authentication" will have similar embeddings, which helps the GNN understand that these classes have similar purposes. Updated Node Features with Embeddings: In the updated pipeline, each node includes: * **Method Count:** Same as before, representing how many methods a class has. * **Is Class / Is Method:** Binary flags indicating whether the node is a class or a method. * **Node Degree:** Same as before, representing how interconnected the node is. * **BERT Description Embedding:** A 768-dimensional vector that captures the meaning of the class or method description. **Example:** Let’s consider two classes: ``` Class A: "Handles user input and processes form data." Class B: "Manages database connections and queries." ``` With traditional features, both classes might appear similar (e.g., they both have 5 methods), but with BERT embeddings, the GNN understands that Class A is about input handling, while Class B is about database management. These embeddings provide more depth, allowing the GNN to differentiate between classes based on what they do, not just how they are structured.