GraphTK Codebase Structure

# GraphTK Codebase Structure ###### tags: `IBM` ## Important Links | Technology | Paper | Codebase | Demo Dataset | | ---------- | ----- | -------- | ------------ | | FastGCN | [linked](https://jiechenjiechen.github.io/pub/fastgcn.pdf) |https://github.com/gmancino/fastgcn-pytorch | Cora | | EvolveGCN | [linked](https://jiechenjiechen.github.io/pub/evolvegcn.pdf) |https://github.com/IBM/EvolveGCN | elliptic | | SALIENT |[linked](https://jiechenjiechen.github.io/pub/salient.pdf) |https://github.com/MITIBMxGraph/SALIENT | obgn-arxiv | | DAG-GNN |[linked](https://jiechenjiechen.github.io/pub/DAG-GNN.pdf) |https://github.com/fishmoon1234/DAG-GNN | ECOLI70 | | GANF |[linked](https://jiechenjiechen.github.io/pub/ganf.pdf) |https://github.com/EnyanDai/GANF | SWaT | | GTS |[linked](https://jiechenjiechen.github.io/pub/gts.pdf) |https://github.com/chaoshangcs/GTS | METR-LA | > Seems like all codebase are structured similarly except SALIENT. ## Summery of codebase structure ### [FastGCN](https://github.com/gmancino/fastgcn-pytorch) - data folder - model folder - running script ### [EvolveGCN](https://github.com/IBM/EvolveGCN) - data folder - model files (egcn_h, egcn_o) - running script - other files not sure what they are ### [SALIENT](https://github.com/MITIBMxGraph/SALIENT) - fast sampler folder - fast trainer folder - driver (which contains dataset.py, main.py, models.py, parser.py) ### [DAG-GNN](https://github.com/fishmoon1234/DAG-GNN) - model file - train.py - utils.py ### [GANF](https://github.com/EnyanDai/GANF) - models folder - dataset.py - training script - evaluation script - utils.py ### [GTS](https://github.com/chaoshangcs/GTS) - data folder (includes data zip file, model hyperparameter yaml file) - model folder - scripts folder (generate data, adj mx, evaluation) ## General Structure Comments on `base`: - some codebase deal with dynamic graph, others deal with static graph - we have both node features and time series - Idea: use a single representation - can have some tools to transform graph representation - Idea: Inherit from PyG - PyG has loader, sampler, datasets - Q: does PyG deal with dynamic graphs and time series data? - For dynamic graph, write our own class but inherit from pyG `datasets`; within it it has a `data` object. - TODO: - o ``` GraphTK/ ├── base/ │ └── dataset.py ├── data/ │ └── some zip file for data, or data in appropriate format ├── utils/ │ ├── FastGCN | | ├── ... (all the utils that's unique to the model) │ ├── SALIENT | | ├── ... │ ├── EvolveGCN | | ├── ... │ ├── GTS | | ├── ... │ ├── GANF | | ├── ... │ ├── DAG_GNN | | └── ... │ ├── dataloader.py (note: unsure if it's straightforward yet; | it could go into examples or every utils folder) │ └── metrics.py ├── models/ │ ├── FastGCN (note: sampler goes in here instead) | | ├── ... │ ├── SALIENT (note: sampler goes in here instead) | | ├── ... │ ├── EvolveGCN | | ├── ... │ ├── GTS (no or one graph) | | ├── ... │ ├── GANF (assume no graph) | | ├── ... │ └── DAG_GNN (no graph, learn the graph) | └── ... ├── examples/ │ ├── ... │ └── ... ├── tests/ (automative tests for individual functions) │ ├── test_... │ └── ... ├── docs/ (or on a website) | ├── installation.md | └── usage.md ├── .gitignore ├── README.md └── requirements.txt ``` > Ignore all sections below for now. They are not finalized. ## TODO 1. Make a unified environment and put up requirements.txt [Done] 2. Most intuitive to write down all models first? 3. Find unified data format and write data processing (w/ test cases): `dataset.py`, or if needed also `preprecessing.py` 4. Do training (w/ test cases) 5. Do evaluation ## 1. Base Classes These are the fundamental classes that will be used to implement the various models. They should be abstract enough to be used across different models but specific enough to provide useful functionality. * `Graph`: This class should represent a graph. It should contain nodes and edges, and methods to add/remove nodes and edges. It should also support both directed and undirected graphs. * `Node`: This class should represent a node in a graph. It should contain a list of its neighbors and any other node-specific data. * `Edge`: This class should represent an edge in a graph. It should contain the two nodes it connects and any other edge-specific data. * `Model`: This is the base class for all GNN models. It should define the interface that all models must implement. This might include methods like `train`, `predict`, `save`, and `load`. ## 2. Utility Classes/Functions These are helper classes or functions that provide common functionality needed by the models. * `DataLoader`: This class should handle loading and preprocessing of graph data. * `Sampler`: This class should handle the sampling of nodes or edges, as required by models like FastGCN and SALIENT. * `Metrics`: This class should provide methods to compute various evaluation metrics like accuracy, precision, recall, etc. ## 3. Model Classes These are the classes that implement the specific GNN models. Each paper's model should have its own class that inherits from the `Model` base class. * `FastGCN`: This class should implement the FastGCN model. * `SALIENT`: This class should implement various SALIENT models. * `EvolveGCN`: This class should implement the two EvolveGCN models. * `GTS`: This class should implement the GTS model. * `GANF`: This class should implement the GANF model. * `DAG_GNN`: This class should implement the DAG-GNN model. ## 4. Demo (and Data) These are scripts that demonstrate how to use the models. There should be one script for each model that shows how to load data, train the model, and make predictions. > Refer to Jie's slides for sample dataset. ## 5.Tests These are scripts or functions that test the various parts of the codebase to ensure they work as expected. ## 6. Documentation It should explain how to install and use the toolkit, and it should provide detailed information about each class and function.