Assignment 2: MLOps & PCAM Pipeline Journal

# Assignment 2: MLOps & PCAM Pipeline Journal **MLOps & ML Programming (2026)** ## Group Information * **Group Number:** 8 * **Team Members:** Aram Amiry (15656675), Fien de Boer (15502422), Thomas Jansen (15863824), Delaram Zaker (15717208) * **GitHub Repository:** [[Link to your Group Repository]](https://github.com/dela888/MLops) * **Base Setup Chosen from Assignment 1:** Delaram Zaker --- ## Question 1: Reproducibility Audit 1. **Sources of Non-Determinism:** & 2. **Control Measures:** **train_pcam_simple.py** ```python # Set device device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using device: {device}") if device == "cuda": print(f"GPU: {torch.cuda.get_device_name(0)}") print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB") # Create trainer trainer = Trainer( model=model, optimizer=optimizer, config=config, device=device, ) ``` Different systems will produce different outputs (GPU vs CPU, different GPU models) If CUDA is available, the output will depend on the specific GPU that is installed. The determinism is currently not controlled. We cannot fully control this, but we monitor the hardware used will ensure determinism on each individual device. **loader.py** ```python sampler = WeightedRandomSampler( weights=sample_weights, num_samples=len(train_dataset), replacement=True ) train_loader = DataLoader( train_dataset, batch_size=data_cfg["batch_size"], sampler=sampler, # Use sampler instead of shuffle num_workers=data_cfg.get("num_workers", 0), pin_memory=data_cfg.get("pin_memory", False), drop_last=True ) ``` In this snippet, the default PyTorch randomizer is used due to a lack of a generator argument with a fixed seed, making the DataLoader for the training set nondeterministic. This is currently not controlled in the pipeline, but a generator with a fixed seed will be implemented for task 1.3. **train_loss.job** ```sh #!/bin/bash #SBATCH --job-name=train_loss_test #SBATCH --partition=gpu_course #SBATCH --gres=gpu:1 #SBATCH --time=00:30:00 #SBATCH --output=train_loss_%j.out #SBATCH --error=train_loss_%j.err #SBATCH --cpus-per-task=4 #SBATCH --mem=8G echo "=== Job started at $(date) ===" echo "Job ID: $SLURM_JOB_ID" echo "Node: $(hostname)" echo "User: $USER" # Load 2025 module stack echo "=== Loading modules ===" module load 2025 # Load specific modules (using bare Python to avoid conflicts) module load Python/3.13.1-GCCcore-14.2.0 module load CUDA/12.8.0 ``` In this snippet, we can see that there is no module purge command inside of the train_loss.job SLURM job. This causes nondeterminism in terms of version control, as everyone has their own set of modules and packages installed on their virtual environment. We have decided not to add the module purge code because the MLOps teaching team provided us with a set of required modules. The essential modules are still reloaded in each SLURM job. **mlp.py** ``` layers.append(nn.Dropout(dropout_rate)) ``` The torch dropout function is nondeterministic because the built in torch RNG function is used. That is why this line of code makes the MLP nondeterministic. We are not controlling this, as we do not want to create permanent "dead" neurons the network learns to ignore. 3. **Code Snippets for Reproducibility:** **train_pcam_simple.py** ```python # Set device device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using device: {device}") if device == "cuda": print(f"GPU: {torch.cuda.get_device_name(0)}") print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB") #NEW: Enable deterministic algorithms if torch.cuda.is_available(): torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False torch.use_deterministic_algorithms(True, warn_only=True) # Create dataloaders print("\nCreating dataloaders...") train_loader, val_loader = get_dataloaders(config) print(f"Train batches: {len(train_loader)}") print(f"Val batches: {len(val_loader)}") ``` **loader.py** ```python #NEW: Generator function generator = torch.Generator() generator.manual_seed(42) # Create DataLoaders train_loader = DataLoader( train_dataset, batch_size=data_cfg["batch_size"], sampler=sampler, # Use sampler instead of shuffle num_workers=data_cfg.get("num_workers", 0), pin_memory=data_cfg.get("pin_memory", False), drop_last=True, #NEW: generator is called in the dataloader, making the loader non deterministic generator=generator ) ``` 4. **Twin Run Results:** We got similar results, sometimes with small variations but compared to the runs before manual seeding and other fixes, the determinism increased significantly. --- ## Question 2: Data, Partitioning, and Leakage Audit 1. **Partitioning Strategy:** We used the pre-defined H5 splits from PCAM SURFDrive directly. No shuffling or re-splitting has been done. | Train | Validation | Test | | -------- | -------- | -------- | | 80% | 10% | 10% | 2. **Leakage Prevention:** Looking at pcam.py, we do have a data leakage issue. _calculate_statistics() is used to compute normalization parameters (mean/std) that are then applied to both training and validation/test data. ``` def _calculate_statistics(self, sample_size: int = 1000) -> Tuple[float, float]: """Calculate mean and std from a sample of the data.""" with h5py.File(self.x_path, 'r') as f: indices = np.random.choice(f['x'].shape[0], size=min(sample_size, f['x'].shape[0]), replace=False) sample = f['x'][indices].astype(np.float32) ... mean = np.mean(sample) / 255.0 std = np.std(sample) / 255.0 ``` The the proper approach would be to compute mean/std from training data only. 3. **Cross-Validation Reflection:** Our current approach is reasonable given the dataset size. If a more robust HP tuning is needed, we should use k-fold CV on the training set for HP selection, then evaluate on validation. But the current approach (Hold-out) is best for computational efficiency. 4. **The Dataset Size Mystery:** The poisoned dataset is smaller in file size but has fewer datapoints due to compression differences, data type and image complexity. We could: * Check compression settings * Compare actual image complexity * Check for duplicate images in the poisoned set * Verify data types match (uint8 vs float32) 5. **Poisoning Analysis:** ![pcam_class_pixel_dist](https://hackmd.io/_uploads/BJ3XOVCrbe.png) * Class 0 (Negative): Strong peak at 220-250 (very bright). Unusual brightness bias * Class 1 (Positive): More natural distribution centered ~150-200. Suspected Poisoning Methods: * Brightness/Washout Poisoning Class 0 has strong peak at 220-250. Negative samples may have been artificially brightened, creating a spurious correlation between brightness and label * Label Flipping Random labels swapped to confuse learning * Black/White Patch Insertion The tests check for mean=0 and mean=255 outliers. Some images may be entirely black or white patches and easy for model to memorize. * Data Duplication Smaller file size despite fewer samples. Duplicate images could explain both smaller file size and poisoning effect Most likely poisoning: Brightness manipulation of Class 0 samples. The pixel distribution shows Class 0 (negative) samples are significantly brighter than Class 1. This creates a shortcut where the model can achieve good accuracy by simply predicting based on image brightness rather than learning actual cancer features. --- ## Question 3: Configuration Management 1. **Centralized Parameters:** * In loader.py, the random seed is fixed. This should be configurable for reproducibility across runs. ``` generator.manual_seed(42) ``` * In the _MLP_ class in mlp.py, the dropout rate is fixed. This should be configurable because it is an important hyperparameter for model regularization. ``` dropout_rate: float = 0.2 ``` * In the _calculate_statistics_ function in pcam.py, the sample size for statistics is fixed. The performance vs accuracy trade-off should be configurable. ``` def _calculate_statistics(self, sample_size: int = 1000) ``` * In loader.py, the weighted random sampler uses replacement=True which is fixed. This should be configurable because sampling with or without replacement affects the training data distribution and model generalization. ``` sampler = WeightedRandomSampler( weights=sample_weights, num_samples=len(train_dataset), replacement=True # Fixed sampling strategy ) ``` 2. **Loading Mechanism:** I used the config.yaml file to centralize the configuration. I added the following lines to it: ``` dropout_rate: 0.3 activation: "relu" random_seed: 42 sampling: replacement: true statistics: sample_size: 1000 ``` And loaded the configuration in python using: ``` config_path = Path(__file__).parent / "config.yaml" with open(config_path, 'r') as f: config = yaml.safe_load(f) ``` Then called it wherever necessary, eg: ``` sample_size = config["data"]["statistics"]["sample_size"] lr=config["training"]["learning_rate"], weight_decay=config["training"]["weight_decay"] data_cfg = config["data"] sampling_config = config.get("data", {}).get("sampling", {}) replacement = sampling_config.get("replacement", True) random_seed = config.get("data", {}).get("random_seed", 42) input_shape = model_config["input_shape"] hidden_units = model_config["hidden_units"] num_classes = model_config["num_classes"] dropout_rate = model_config["dropout_rate"] model = MLP(config["model"]) ``` Note that I did add defaults as for example when running the tests that use simpler configs, these might fail. 3. **Impact Analysis:** * Reproducibility: Missing parameters don't crash the system Works with both complete configs (real experiments) and minimal configs (tests) Code can handle configs from different versions * Experiment Comparison Can compare experiments even if some used different config versions Easy to add new parameters without breaking old experiments Start with minimal config, add parameters as needed * Collaboration New collaborators can start with simple configs Can understand config structure gradually Less likely to break others' work with config changes 4. **Remaining Risks:** * Different code files might use different default values for the same parameter, leading to inconsistent behavior. * Adding new parameters to the config structure can break compatibility with old config files. * Multiple similar config files with unclear naming create confusion about which configuration was actually used. * Code behavior changes over time while config files might remain unchanged, leading to different results with the same config. * Team members don't understand what certain parameters do or why specific values were chosen. * Config files might accidentally contain sensitive paths or credentials that shouldn't be committed to version control. --- ## Question 4: Gradients & LR Scheduler 1. **Internal Dynamics:** ![gradient_norms_3seeds](https://hackmd.io/_uploads/H1VEU5LSbe.png) * Per step: High variance within epochs, individual batches have wildly different gradient magnitudes Per epoch: Smoothed trend, would mask the spikes and make training appear stable Gradient norms fluctuate significantly within each epoch. Some batches produce gradients 10-20x larger than the mean. The moving average (black line) shows the underlying trend is stable, but individual batches are noisy.Step-level granularity exposes that the batches are not uniformly stable and that some batches contain data that causes disproportionately large gradients. * Yes, significant gradient spikes are visible (marked as red dots in the plots). They indicate outlier samples in batches, evidence of data poisoning and implications for training. 2. **Learning Rate Scheduling:** ReduceLROnPlateau was implemented as the scheduler. ![lr_schedule_reduce_on_plateau](https://hackmd.io/_uploads/BJ6F_c8r-l.png) The learning rate controls the "step size" of gradient descent: `new_weights = old_weights - learning_rate × gradient` Reducing the learning rate in the late stages will make small, precise adjustments to converge: * High LR near a minimum causes the optimizer to "bounce" back and forth across the valley. * Smaller steps allow the model to settle into the minimum rather than overshooting. * Gradual descent into flatter minima (which generalize better) rather than sharp minima. * Lower LR makes updates less sensitive to noisy gradients from individual batches. --- ## Question 5: Part 1 - Experiment Tracking 1. **Metrics Choice:** - $F_{\beta}$ ($\beta$ = 2): The $F_{\beta}$ formula allows us to weight recall and precision differently. By setting $\beta=2$, we mathematically state that Recall is twice as important as Precision. This forces the model to prioritize "catching every tumor," even if it means being a bit more "cautious" and triggering more false alarms. - ROC AUC: A model might have 90% accuracy but only because it predicts "healthy" for everything (if the dataset is imbalanced). ROC AUC tests the model at every possible threshold (0.1, 0.5, 0.9, etc.). - PR AUC: This metric is used to identify the precision recall curve, this metric is especially useful when the dataset is imbalanced as we are comparing true positives to false positives. - Loss & Accuracy: To see how accurately the model is running. 2. **Results (Average of 3 Seeds):** | Seed | Validation Loss | Validation accuracy | F-beta | ROC AUC | PR AUC | |------|------|------|-----|-----|-----| |0|0.6071|0.6849|0.6501|0.7639|0.7280 |42|0.6347|0.6356|0.4396|0.7365|0.7029| |80|0.5841|0.6060|0.3331|0.7432|0.7051| ================================================== Final Training Loss: 0.5848 Final Validation Loss: 0.6071 Final Validation Accuracy: 0.6849 Final Validation F-beta: 0.6501 Final Validation ROC AUC: 0.7639 Final Validation PR AUC: 0.7280 3. **Logging Scalability:** When using simple 'ad hoc' logging, there are several scalability issues. Firstly, you get no structured storage, print() output gets buried in terminal logs and is hard to query/filter. Also, with ad-hoc logging, you often forget to log key hyperparameters or seeds, leading to non-reproducibility. Finally, other people cannot easily browse, compare and reproduce your “printed” runs. 4. **Tracker Initialization:** ```python timestamp = time.strftime("%Y%m%d_%H%M%S") experiment_name = f"{config['model']['name']}_{timestamp}" self.tracker = ExperimentTracker(experiment_name=experiment_name, config=config) ``` ```python #During training: metrics = { "val_loss": val_loss, "val_accuracy": val_acc, "val_f_beta": fbeta_score( all_targets, all_preds, beta=self.fbeta_beta ), "val_roc_auc": roc_auc_score(all_targets, all_probs), } metrics.update(grad_norm_summary) self.tracker.log_epoch_metrics(epoch=epoch, metrics=metrics) ``` ```python #Sending metrics to MLflow mlflow.log_metrics({k: float(v) for k, v in metrics.items()}, step=step) ``` ```python if val_loss < self.best_val_loss: # ... save_dir = Path(self.config["training"]["save_dir"]) / "checkpoints" checkpoint_path = save_dir / "best_model.pt" torch.save(checkpoint, checkpoint_path) self.tracker.log_model(self.model, name=f"model_epoch_{epoch}") print(f" Saved best checkpoint to: {checkpoint_path}") ``` 5.**Evidence of Logging:** Config: ![image](https://hackmd.io/_uploads/r1DhJsir-e.png) Git commit: ![image](https://hackmd.io/_uploads/HkamZsor-e.png) Environment information: ![image](https://hackmd.io/_uploads/S1ZyvsiHZe.png) Metrics and plots: ![image](https://hackmd.io/_uploads/SkGZDjjrbl.png) ![image](https://hackmd.io/_uploads/ByPfvsiBZl.png) Checkpoints: ![image](https://hackmd.io/_uploads/H10DujiBbx.png) 6. **Reproduction & Checkpoint Usage:** [1.] Get all metadata you need to reproduce the experiment from MLflow UI, under the overview and then view parameters. Also fetch the environment information and the config.yaml. [2.] Use git checkout <hash> to ensure that you are using the exact same codebase [3.] Recreate the environment by creating a new environment, loading all necessary modules and install all python packages. Finally, using the exact same codebase, you can execute the job using the config file on the train.py file. 7. **Deployment Issues:** The data distribution in production changes over time such as new user behaviour. This can be prevented by continuous monitoring and occassional periodic retraining. If you're not careful with feature engineering, the (pre)processing of the dataset can differ for users compared to the preprocessing we have done, leading to an incorrect use of the training program. This can be prevented by providing the entire pipeline, including preprocessing scripts and files. The model may be too large or slow for the deployment environment. This can be prevented by quantization and efficiency optimalisation. --- ## Question 5: Part 2 - Hyperparameter Optimization 1. **Search Space:** We chose to perform a grid search on the following parameters: training.learning_rate $\rightarrow$ In order to analyse whether it is more effective to have a higher learning rate (overfits more easily) or a lower learning rate (naive). training.batch_size $\rightarrow$ In order to test the effect of gradient noise versus stability and training speed. model.hidden_units $\rightarrow$ We want to test whether more parameters help or hurt given PCAM’s size and regularization. For the learning rate, we picked two value that were smaller than our baseline (baseline: 1e-4) and one value that's larger. We picked the values {1e-4, 1e-5, 3e-4, 3e-5} For the training batch size we decided to only test on values larger than our baseline (baseline: 32), as the baseline is relatively small already. We picked the values {32, 64, 128} For the hidden units we picked one value smaller and one value bigger than our baseline (baseline: [512,256,128]), resulting in {[512,256,128],[256,128],[1024,512,256]} 2. **Visualization:** ![image](https://hackmd.io/_uploads/HyEOT6orWe.png) ![image](https://hackmd.io/_uploads/S1hKoairZx.png) ![image](https://hackmd.io/_uploads/SkEoiasH-e.png) ![image](https://hackmd.io/_uploads/HkupiTsBZx.png) 3. **The "Champion" Model:** The model with the following config has the best ROC and PR score. {hidden dims: [1024,512,256], learning rate: 1e-05, batch size: 32} Average Loss: 0.6005 Accuracy @0.5: 0.6661 F2-score @0.5: 0.6491 ROC AUC: 0.7447 PR AUC: 0.7088 Confusion Matrix @ threshold 0.5 ------------------------------------- |TN = 11297 | FP = 5102| |------------|--------| |FN = 5840 |TP = 10529| 4. **Thresholding Logic:** The 0.5 treshold is not appropriate at all. When diagnosing cancer, it is horrible that we have almost 6000 false negatives. We need to lower the accuracy to make sure we almost never give false negatives, as these can result into unnecessary death. We picked treshold = 0.2949, which was the result of argmaxing the recall. In other words, we tried minimizing the amount of false negatives compared to the amount of true positives. Confusion Matrix @ treshold 0.2949 ---------------------------------- |TN= 6057|FP= 10342| |---|---| |FN= 818|TP= 15551| Recall: 0.95 Precision: 0.6 5. **Baseline Comparison:** Given the fact that we introduced class weights to combat class imbalance, we built a model that's mediocre. Our accuracy is 0.667 compared to the 0.5 that we would obtain by always guessing cancerous or always guessing non-cancerous. --- ## Question 6: Model Slicing & Error Analysis 1. **Visual Error Patterns:** ![image](https://hackmd.io/_uploads/Bk91yNAH-e.png) These images are mostly dark. ![image](https://hackmd.io/_uploads/SybMJ4CBZe.png) This images are mostly bright with some dark spots, also a bit blurry. 2. **The "Slice":** Our slice includes 91 dark images (Pixel intensity mean lower than 100) our best model has 29.67% accuracy on this slice, while it has around 67% on the full dataset, this is a big difference, making it clear dark images are hard to read. 3. **Risks of Silent Failure:** Global metrics can hide subgroup failures if we were to create a model with excellent average F1, it could still fail catastrophically on underrepresented groups of people or poorly acquired data. In the PCAM dataset this can be caused by differences between positives and negatives, such as colour setting and brightness settings. --- ## Question 7: Team Collaboration and CI/CD 1. **Consolidation Strategy:** Our group selected Delaram's repository as the foundation because it had the most complete initial structure and core functionality needed for the project. She cloned her repository to a new group repository and added all members as collaborators. We used git merge via Pull Requests to integrate individual work. Each member forked the new group repo and worked on feature branches in their forks. In the end they submitted Pull Requests to the main group repository and used GitHub's "Create a merge commit" option for integration. We used `git merge` becasuse: * It reserves full history of who contributed what * Non-destructive, original work remained intact * Team-friendly, avoided the risks of rebasing shared code * Clear audit trail with merge commits linking to issues 2. **Collaborative Flow:** ![Screenshot 2026-01-18 at 13.24.26](https://hackmd.io/_uploads/r1jWZ8cSbx.png)![Screenshot 2026-01-18 at 13.26.46](https://hackmd.io/_uploads/BJQ5-U5H-g.png) 3. **CI Audit:** * GitHub runners don't have GPUs. Standard GitHub Actions runners are CPU-only virtual machines and they have no GPUs, so GPU-enabled PyTorch would be useless anyway. Furtheremore, the size difference is massive. GPU-enabled PyTorch (CUDA) is 2-3 GB but CPU-only PyTorch is at around 200-300 MB. Runners have limited storage. The GPU version could cause the runner to run out of space, especially with other dependencies. Also, the large download and installation can cause memory issues and downloading 2GB+ on every CI run wastes time and bandwidth. * The CI pipeline ensures that no one can merge code that breaks the PCAMDataset or MLP architecture by automatically running the full test suite. If a change introduces a bug, such as forgetting to flatten the MLP input or breaking the dataset filtering logic, the tests fail and GitHub shows a red ❌ on the PR. When branch protection rules are enabled, merging is completely blocked until all tests pass, making it impossible for any teammate to accidentally merge broken code into the main branch. 4. **Merge Conflict Resolution:** The most difficult merge conflict we encountered was one surrounding the implementation of Q5 with MLflow and fixing the overall model infrastructure. The colleague working on this assignment had some trouble working in different git branches, so they couldn't push their intermediate steps onto the GitHub, resulting in them being behind on 70 commits. There were five merge conflicts that couldn't be solved within github, so they had to use the built in conflict resolver in VScode, showing both versions and selecting whether they would like implementing the main code or the secondary branches code. This didn't work out perfectly, because some combinations of code failed tests, this was easily resolved by handwriting solutions. Finally, the CI pipeline for the commit of the teammate passed and the changes could be pushed to the main branch. 5. **Branching Discipline:** ``` (pytorch_gpu_env) scur2408@int4:~/MLOps_2026$ git log --graph --oneline --all --max-count=15 * 908d414 (HEAD -> dev, origin/main, origin/HEAD, main) Q4 making plots * 808a68e Q4 compleet * 5d5e425 question 4 complete * 773cd89 Merge pull request #2 from dela888/master |\ | * 1f0366f (origin/master) Reduced nondeterminism | * 296d53e Reduced nondeterminism * | 56af1cd Implement Gradient Norm Tracking Q4 * | 8ff8c04 Implement Gradient Norm Tracking Q4 * | 4d7f68c new training script * | 64ad1da reproducable results with this training script * | fd1e6c9 no error file * | 05c7688 readme changed |/ * 07d7f6d Merge pull request #1 from dela888/master |\ | * 3a21986 Improved reproducibility |/ * 9dea594 Delete train_18232577.err ``` The history does show clear branching and merging. It's not a single flat line of commits. For team collaboration, a non-linear graph with clear merges like ours is preferred --- ## Question 8: Benchmarking Infrastructure 1. **Throughput Logic:** It is critical to perform these benchmarks separately from the standard training loop to isolate the hardware's raw computational capability from external bottlenecks. While measuring during training provides a realistic view of end-to-end performance, it introduces interference from operations such as HDF5 file reading and CPU-based data augmentation that can mask the true speed of the GPU. Within the PyTorch framework, we use `torch.cuda.synchronize()` to account for the asynchronous nature of GPU execution, ensuring the timer captures the actual completion of the operation rather than just the task submission. Regarding numerical precision, utilizing `float16` instead of `float32` would likely increase throughput because half-precision tensors reduce memory bandwidth requirements and allow the use of specialized hardware like Tensor Cores. If throughput were to drop unexpectedly on a subsequent session on Snellius, the issue could likely be attributed to node congestion. Code Snippet: ``` if args.device == "cuda": torch.cuda.synchronize() start_time = time.perf_counter() with torch.no_grad(): for _ in range(args.num_iterations): _ = model(dummy_input) if args.device == "cuda": torch.cuda.synchronize() end_time = time.perf_counter() total_time = end_time - start_time total_images = args.num_iterations * args.batch_size throughput = total_images / total_time ``` 2. **Throughput Table (Batch Size 1):** | Partition | Node Type | Throughput (img/s) | Job ID | | :--- | :--- | :--- | :--- | | `thin_course` | CPU Only | 351.98 | 18465667 | | `gpu_course` | GPU (Multi-Instance) | 2836.50 | 18465884 | 3. **Scaling Analysis:** To analyze how parallelization impacts performance, a Slurm Array Job was executed across batch sizes of `8, 16, 32, 128, 256, and 512`. As the batch size increases, the throughput generally scales upward because the GPU can saturate its thousands of cores more effectively. However, this increase eventually plateaus once the hardware reaches its maximum utilization or the overhead of launching kernels becomes negligible. For the scaling test, we compared our standard 3-layer MLP with a large model featuring 5 layers and wider hidden dimensions. When scaling the large model to a batch size of 512, by monitoring the infrastructure with nvidia-smi, we observed that the large model utilized approximately 1180 MiB of VRAM out of the 40,960 MiB available on the NVIDIA node. 4. **Bottleneck Identification:** We suspect the HDF5 file reading is the slowest on the node, because the PCAM dataset involves fetching a significant amount of small images from a compressed file on a shared filesystem. --- ## Question 9: Documentation & README 1. **README Link:** [README](https://github.com/dela888/MLops/blob/55fb7e4b015652d9057069835ae2dc7d997249f7/README.md) 2. **README Sections:** Installation, Data Setup, Training, and Inference are present. Even included an overview of the structure of the repository. 3. **Offline Handover:** All pcam data files, so: - `camelyonpatch_level_2_split_train_x.h5` - `camelyonpatch_level_2_split_valid_x.h5` - `train_y.h5` - `valid_y.h5` All files needed to run the model: - Entire `src/ml_core` folder - `train.py` + `config.yaml` - `inference.py`: for running from checkpoint - A virtual environment with all libraries/methods we use installed. --- ## Final Submission Checklist - [x] Group repository link provided? - [x] Best model checkpoint pushed to GitHub? Our model was too big for GitHub, so we added a WeTransfer link to the README - [x] inference.py script included and functional? - [x] All Slurm scripts included in the repository? - [x] All images use relative paths (assets/)? - [x] Names and IDs of all members on the first page?