# HDARTS Background Research
## Theory-Inspired Path-Regularized Differential Network Architecture Search
https://proceedings.neurips.cc/paper/2020/file/5e1b18c4c6a6d31695acbae3fd70ecc6-Paper.pdf
- Tackles problem of too many skip-connections leading to poor performance of learnt networks
- Analyzes theoretical reasoning behind this bias and fixes it
- Proves that when optimizing $F_{train}(W, \beta)$, the convergence rate at each iteration depends on the weights of skip connections much heavier than other types of operations ($\beta$ in this paper is $\alpha$ from DARTS)
- Proposes Path Regularized DARTS which uses **group-structured sparsity penalizes the skip connection group heavier than another group to rectify the competitive advantage of skip connections** (don't understand this yet, need to read further)
## Differentiable Neural Architecture Search in Equivalent Space with Exploration Enhancement
- Theoretical approach to ensuring that the differential
- Fixes the rich-get-richer problem: architectures with better performance early on trained more frequently, the updated weights lead to higher probability of sampling which often leads to local optima
- Uses a **variational graph autoencoder** (need to read about this) to injectively transform the discrete architecture space into an equivalently continuous latent space
## Other Papers
- https://openreview.net/forum?id=PKubaeJkw3
- https://arxiv.org/pdf/2010.13501.pdf
- https://openaccess.thecvf.com/content_ICCVW_2019/papers/NeurArch/Yan_HMNAS_Efficient_Neural_Architecture_Search_via_Hierarchical_Masking_ICCVW_2019_paper.pdf
- https://openreview.net/forum?id=PKubaeJkw3
## Ideas So Far
- Argue about weight sharing for alpha being useful while not running into same problems as conventional weight sharing -> avoid catastrophic forgetting
- Issue of lack of correspondance between rich get richer and
- Hessian norm - related to argmax - argue hierarchy ends up doing perturbation https://openreview.net/forum?id=PKubaeJkw3
- Larger search space
- Argument for wider networks that converge faster (why is this the case?)
## Ideas So Far 2
- Interpretability to help us understand how to argue that same convolution / higher level operation can find both lower level features and higher level features
- Lottery ticket hypothesis - towards understanding hierarchical learning - finding repeated substructures that are doing the work and contribute to most of the performance - biological equivalent?
- Weight sharing - multi task learning