Final Project Report: Embeddings and labeling schemes for A*

# Final Project Report: Embeddings and labeling schemes for A*   ## AMLDS, Fall 2022 ## Sundeep Chenreddy & Benjamin Feuer, NYU ## Part I: Background and Problem Formulation The paper we are investigating is called Embeddings and Labeling Schemes for A*. (https://arxiv.org/abs/2111.10041) The paper deals with feasibility proofs on methods for learning shortest paths on graphs. In Part I, we provide a brief review of the facts which are most relevant to this paper. **Graphs** are sets $G = \{V, E\}$ of **vertices** $u, v \in V$, any pair of which may be joined by an **edge** $e_{u, v} \in E$. We evaluate the weight of an edge with a function $d : e_{u,v} \implies \mathbb{R}^+$. The **Single-Pair Shortest Path (SPSP) problem** consists of finding the shortest path between a single pair of vertices. SPSP can be solved for any finite, non negative weighted ($d(e) ≥ 0$ for all $e ∈ E$), **directed** (cycles are OK) graph using **Dijkstra's algorithm** with time complexity $O(E log V)$; technically, Dijkstra's solves the more challenging **single-source shortest path (SSSP)**; for SPSP, we solve SSSP, then ignore the paths to all nodes except our query. **A\* Search** **A*** (pronounced "A-star") search is an SPSP directed search algorithm for arbitrarily large graphs which uses a heuristic function $\pi$ to guide search. In A\*, the frontier is not ordered by the true edge weight $d(u, v)$, but rather by the sum $d_\pi(u, v) = d(u, v) + \pi(u, t)$, where $\pi$ is some heuristic function and $t$ is the target node. **Correctness Analysis of A\*** A heuristic function $\pi(s,t)$ which estimates $e(s,t)$ is **consistent** IFF, for any destination $t$ and edge $(u,v)$, $\pi(t, t) = 0$ and $\pi(v, t) - \pi(u, t) > 0$. Without a consistent heuristic, $A*$ may fail. We also require EITHER that the graph is finite OR that $t$ exists in the graph -- if we fail both conditions, $A*$ will not terminate. If $\pi$ is non-negative and $\pi(t, t) \leq 0$ (along with the graph assumptions from earlier), then shortest paths have not changed, and $e_\pi$ is a lower bound on $e$: $\pi(u, v) \leq e(u, v)$ for any $u, v \in V$. This latter property is sometimes called **admissibility**. A heuristic function $\pi(s,t)$ which estimates $d(s, t)$ is **sub-additive** if for any three vertices $(u, v, w)$, we have $\pi(u, v) + \pi(v, w) \geq \pi(u, w)$. All norm heuristics are sub-additive. **Learned A\*** Traditionally, heuristics have been handcrafted by domain experts. However, over the last few years, there has been a growing interest in learning heuristic functions. In **Learned** $A*$: * A model $m$ is trained on a dataset of graphs; the input, $x$, are per-node features, and the output prediction, $y$, is edge weight * The function learned by the model, $m(u, t)$, is used as $\pi(u, t)$ at inference time **Paper Review** We provide a brief recapitulation of the key results which impacted our investigation. For a graph $G = (V, E, W)$, and a heuristic $\pi$, let $\delta(s,t)$ denote the set of vertices on the shortest path between $s$ and $t$ with the maximal number of hops, and let $S_\pi(s, t)$ be the set of vertices scanned by $A*$ given $s, t, \pi$. Further let $d(s,t) = |\delta(s,t)|$. We say that $\pi$ has an **average additive overhead** $c$ if $$ E_{s,t \in V}[|S_\pi(s,t)| - d(s,t)] \leq c $$ where $s$ and $t$ are chosen independently and uniformly at random from $V$. A heuristic function $\pi : V \times V \rightarrow \mathbb{R}$ is a **norm heuristic** induced by an $l_p$ norm an of a function $m : V \rightarrow \mathbb{R}^d$ if, for any arbitrary pair of vertices $s, t \in V$ $\pi_{s, t} = \|m(s) - m(t)\|_p$. The dimension $d$ is referred to as the dimension of $\pi$. The paper we investigate provides certain novel upper and lower bounds on the average additive overhead of learned A* norm heuristics, both $l_p$ and $l_\infty$. **Thm. 1.6: Lower bound for norm heuristics:** There exists an $v$-vertex unweighted graph $G = (V,E)$ of constant diameter such that for any consistent norm heuristic induced by an $l_p$ norm of dimension $d$, where $d = o(v / \log^2 v)$ or, when $p = \infty$, $d = o(\log v)$, the average query complexity is lower bounded by $\Omega(\frac{v}{\log^3 v})$. **Research Questions** We investigate two research questions in this report; 1. Can we improve the dependence $d = o(v / \log^2 v)$ shown in the original paper by focusing on directed, rather than undirected, graphs? 2. Can we apply A* search techniques successfully to domains other than road networks? In part III, we present a novel algorithm which utilizes techniques derived from Learned A* search to solve the architecture selection problem. ## Part II: Learned A* on Directed Graph Search In the paper, we were given proofs that when we use norm heuristics that are consistent, if the dimension of the heuristic, $d$ is not sufficiently large, then we have a graph where the expected overhead is large. Now, we are going to prove that for consistent norm heuristics, irrespective of how big $d$ can get, we will have a graph for which it fails, i.e., it has a large overhead. We will prove this using an example graph shown below. ![](https://imgur.com/0xvQZim.png) **Theorem:** There exists an $v$-vertex unweighted **directed** graph $G = (V,E)$ of constant diameter such that for any consistent norm heuristic induced by an $l_p$ norm of ANY dimension $d$, the average query complexity is lower bounded by $\Theta(v)$, and since we can scan at most $v$ vertices we will scan $\Theta(v)$ vertices on average. **Proof:** The below fact was stated in the paper and its fairly straightforward to see that it holds for directed graphs as well. ![](https://imgur.com/OyClYPg.png) Now consider the example graph in the above figure. Here, we have $|V|$=$\Theta(n)$ vertices. Note that the weights on the above graph are all 1 as it is unweighted. We have, for $i, j$ in $\{1,2,…,n\}$, the shortest distance from vertex $a_i$ to vertex $e_j$ is 6 ($a_i\rightarrow$ $p\rightarrow$ $b_k\rightarrow$ $c_k\rightarrow$ $d_k\rightarrow$ $q\rightarrow$ $e_j$). Below is main part of the proof. Note that, for any vertex $b_k$, $k$, in $\{1,2,…,n\}$, we have $d(a_i,b_k)+h(b_k,e_j) = 2 + h(b_k,e_j)$ $\qquad\qquad\qquad\qquad <= 2 + h(b_k,d_k) + h(d_k,q) + h(q,e_j) \quad$ (by the subadditivity property) $\qquad\qquad\qquad\qquad$<= $2 + h(b_k,d_k) + 1 + 1$ $\quad\quad\quad\quad\quad\quad\space$ (by the admissibility property) $\qquad\qquad\qquad\qquad$ = $2 + h(b_k,d_k) + 2$ $\qquad\qquad\qquad\qquad$<= $2 + 1 + 2$ **(by applying the admissibility property on edge $d_k\rightarrow$$b_k$)** $\qquad\qquad\qquad\qquad$ < $6 = d(a_i,e_j)$ Therefore, whenever the source vertex is $a_i$ and target vertex is $e_j$, we will scan all the vertices $\{b_1,b_2,b_3,…b_n\}$, i.e., for $n^2$ pairs of source and target vertices, we will have a overhead of $Θ(n)$ vertices. This proves that the average overhead for this graph is $Θ(v)$ irrespective of the dimension $d$. ## Part III: A* Heuristics in Neural Architecture Search **Neural architecture search (NAS)** is a technique for automating the design of neural networks. Design of network architecture is traditionally handled by a human expert; NAS attempts to automate this task by use of a meta-model. The meta-model's task is to learn how to select an optimal architecture for a particular problem. In order to focus our investigation, we choose a recent successful meta-model called **Neural Architecture Transfer** or NAT as a baseline for our experiments. (https://arxiv.org/abs/2005.05859) The major challenges of neural architecture search are as follows -- 1. Bounding the search space 2. Selecting architectures from the search space to evaluate next 3. Efficiently evaluating those architectures Although it could potentially be applied to all three of these problems, we choose to focus on the second. That said, in order to provide sufficient context, we will briefly describe how the NAT meta-model solves all three problems. **Bounding the Search Space** NAT search takes place over a space bounded by a pre-trained "supernetwork"; the supernetwork is a carefully designed convolutional neural network whose weights were trained in such a way to simultaneously optimize for subnetwork accuracy and supernetwork accuracy at the same time. The method by which this is accomplished is described in (https://arxiv.org/abs/1908.09791). The supernetwork search space is broken down into 22-integer strings, where the first two correspond to resolution and width multiplier, and the rest correspond to the layers. Each value indicates a choice, e.g. the third integer (L1) takes a value of “1” corresponds to using expansion ratio of 3 and kernel size of 3 in layer 1 of stage 1. See Fig. 2 of the paper for an illustration. This space, although finite, is still vast; The resulting volume is approximately $3.5 × 10^{19}$ for each combination of image resolution and width multiplier. **Evaluating Architectures** Given the large search space in NAT, it is highly desirable to have an accurate and efficient predictor of which architectures are likely to be successful. In NAT, an ensemble of radial basis functions is used as the accuracy predictor. The performance of this ensemble is enhanced by restricting the surrogate model to evaluating the pareto frontier of the current objective trade-off. Please see Fig. 4 and Fig. 6 of the NAT paper for illustrations of these concepts. **Selecting Architectures** How can we add new possible architectures to the search space? One common approach is to design an evolutionary algorithm. EA uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators. The specifics of the evolutionary algorithm used in NAT can be found in (https://arxiv.org/abs/2005.05859). In an evolutionary space of model architectures, there will necessarily be a trade-off frontier, or pareto front, consisting of all the models which are not dominated by any other model in the population on one or more objective functions. (**Domination** is a widely-used partial ordering concept for comparing two objective vectors. $a_1$ dominates $a_2$ if $a_1$ is no worse than $a_2$ for all objectives, and is better on at least one.) When the number of objectives is large, there can be many such frontiers, and many such models; yet our time and resource budget is limited, and it is not immediately obvious how we should select models for survival. **NSGA-III** NAT uses a well-known evolutionary algorithm to solve the problem of model selection, called NSGA-III. (https://dl.acm.org/doi/10.1145/1143997.1144112) Illustrations and a detailed explanation of the algorithm can be found in (https://arxiv.org/abs/2005.05859) as well. Conceptually, NSGA-III works as follows -- We start with a set of architectures $R$, each of which is uniquely associated with objective scores $F$. We are also given so-called reference directions $Z$, which we can think of as rays emanating from some starting point, usually the ideal point on the frontier we would like to reach in this iteration. The amount of these reference directions is chosen as a hyperparameter. We first perform a non-dominated sort of the points in $F$. We then order the sets produced (for example, if we had four objectives, $F_1$ would probably be the set of all models which are not dominated on any objective, $F_2$, $F_3$, $F_4$, $F_5$ would be sets containing models which are dominated on objectives 1, 2, 3, 4 only, etc). Another hyperparameter, $N$, sets the number of architectures the algorithm is to return. Assuming $F > N$, at some point, one of these sets will be too large to include in its entirety. In the set we are to split, we select architectures by their proximity to some $Z_i$ in $Z$ which does not yet have any points associated with it; loosely speaking, this algorithm proposes a heuristic intended to encourage equal distribution of points on the frontier. Pseudocode can be found below. ```python # Input : A set of archs R, their objectives F , number of archs to # select N , reference directions Z. 1: Put archs into different fronts (rank levels) based on domination. 2: (F1, F2, . . .) ← non dominated sort(F) 3: S ← ∅, i ← 1 4: while |S|+|Fi|<: N 5: do S ← S ∪ Fi; i ← i + 1; 6: FL ← Fi # next front is the split front where we cannot accommodate all archs associated with it. 7: if |S|+|FL|= N then S ← S ∪ FL; 8: else 9: ( ̃S, ̃FL) ← Normalize(S, FL) # normalize the objectives based on the ideal and nadir points derived from R. 10: d ← compute orthogonal dist to Zi for each i 11: ρ ← count #associated solutions for Zi based on d for each i. 12: # remaining archs from FL to fill up S. 13: S ← S ∪ Niching( ̃FL, N − |S|, ρ, d) 14: end 15: Return S. ``` **Our Proposed Algorithm** We observe that when the number of objectives is small, $|Z|$ can be low, but that as the number of objectives we wish to co-optimize increases, $|Z|$ must grow exponentially if we wish to provide equivalent coverage of the search space (the curse of dimensionality guarantees this). While this limitation is not particularly severe for simple problems, it poses a major challenge if we wish to learn NAS for more complex tasks, such as motion planning for robotic arm with six degrees of freedom, or evaluating many trade-offs when planning a course of action. We present an algorithm that uses A* heuristic methods to encourage a diverse selection of points even when dimensions grow very large, but which also strongly optimizes for the best solution. Where we typically conceptualize A* search as finding the shortest path between two points, we could also argue that A* is assembling the minimum size set of points which joins source $s$ and target $t$. If $t$ does not exist, as long as the number of steps is finite, $A*$ search is still guaranteed to terminate, although it may not attain a solution. We translate A* to NAS as follows; Inspired by the theoretical results presented in the Learned A* paper, we design an algorithm where the true distance from a point $s$ is estimated as the lp-norm distance to the ideal point $t$, such that $d(s,t) = \|s - t\|_p$. At all times, we seek a path to the ideal point on our objective trade-off frontier. Minimizing our distance measure is akin to optimizing for "exploitation" -- we greedily select points close to our ideal. However, we do not know if such points are good paths or dead ends. One strong approach to a heuristic in this setting is to optimize for "exploration"; NSGA-III accomplishes this by use of reference directions. We choose a different approach -- , our exploration heuristic is the largest lp-norm distance between points already in the set and $s$, the point we are now adding. IE, for any point $p$ which is already in the set, $h_{p,s} = \|p - s\|_p$ and $h_s = \textbf{ argmax } h_{p, s}$. Unlike NSGA-III, our heuristic time complexity does not scale exponentially with number of objectives, but rather linearly with the size of the frontier. We minimize $d + h$, with $h$ subject to a scaling hyperparameter $-\frac{d}{h} \leq \alpha \leq \frac{d}{h}$, $\alpha \neq 0$. When $\alpha$ is negative, our heuristic discourages clustering of similar architectures. However, our experimental results suggested that it was more effective in practice to encourage weak clustering, and so in our experiments we heuristically fixed $\alpha = 1$. That said, we feel that additional ablations would be helpful in order to justify the selection of alpha. Pseudocode can be found below. ```python # Input : A set of archs R, their objectives F , number of archs to # select N , reference directions Z. 1: Put archs into different fronts (rank levels) based on domination. 2: (F1, F2, . . .) ← non dominated sort(F) 3: S ← ∅, i ← 1 4: while |S|+|Fi|<: N 5: do S ← S ∪ Fi; i ← i + 1; 6: FL ← Fi # next front is the split front where we cannot accommodate # all archs associated with it. 7: if |S|+|FL|= N then S ← S ∪ FL; 8: else # Our contribution 9: Normalize(S, FL), k ← |S|+|FL| - N # 10: P ← ∅, j,k ← 1, ID ← Ideal Point 11: For each FL_j in FL: 12: d_j ← Lp_norm(FL_j - ID) # p is the number of objectives 13: For each s_k in S: 14: h_jk ← Lp_norm(s_k - FL_j) 15: h_j ← argmin(h_jk) #minimum point-line distance # from point FL_j to any ideal-nadir line 13: P ← d_j, h_j 14: S ← S ∪ min_k_in_P(d_j / max(d_j) + h_j / max(\alpha h_j))# We grow S by the minimum #k points in P 15: end 16: Return S. ``` **Experimental Setup** In this series of experiments, we sample $500$ uniform random points in $d$-dimensional space, where $d$ represents the number of objectives we are attempting to optimize. The points range in value from $0$ to $1$. We assume objective minimization; IE, that the ideal point is situated at the origin of the graph. Objectives are normalized and presumed to be scaled so that they all contribute equally to the loss. The size of the point represents its position in the 3rd dimension (smaller is more optimal). We use the standard NSGA-III algorithm as implemented in [ENCAS](https://github.com/AwesomeLemon/ENCAS), a subsequent paper to NAT which reproduced the NAT experiments in close collaboration with the original authors; there is no code-base associated with the NAT paper, unfortunately. We vary the approach in ENCAS slightly; we generate reference points using the [Das-Dennis](https://www.researchgate.net/figure/Illustration-of-reference-points-generated-by-the-Das-and-Dennis-method-43-The-black_fig1_328881123) method, as the method used in ENCAS appears to have been deprecated in the most recent version of the Python library upon which the implementation relies. We do not believe that this substitution had much effect on the results. **Experiments with 4 objective optimization** Our experimental code can be found at [this link](https://colab.research.google.com/drive/11y3342dDoDKPFVTEBDLVTD6QTUFbFBqg?usp=sharing). ![](https://i.imgur.com/Vhm91OW.png) NSGA-III: When $d = 4$, $N=20$, $R = 100$, NSGA-III selects points which are close to optimal in the $x$ and $y$ dimensions, but does not select as strongly for $z$. RUNTIME: $<1s$ ![](https://i.imgur.com/512G37W.png) OURS: Distribution is concentrated in the lower-left hand corner (ideal) and is more balanced between objectives. Our algorithm selects more points which are optimal in the $z$ dimension than NSGA-III. RUNTIME: $<1s$ **Experiments with 5 objective optimization** In order to provide a more concrete measure of point concentration, we measure the polygon area of the points selected by each algorithm. ![](https://i.imgur.com/Egp4mj6.png) NSGA-III: When $d = 5$, $N=25$, $R = 100$, NSGA-III selects non-optimal points in one or more dimensions. 3D area of selected points: (lower values indicate tighter concentration): $0.4849$ RUNTIME: $30s$ ![](https://i.imgur.com/6RxnNLb.png) OURS: With $\alpha = 1$, our algorithm is more successful balancing the objective trade-offs; this is particularly noticeable in the $z$ dimension. The point spread is approximately $1/10th$ of what we see in NSGA-III, and it is more than an order of magnitude faster. 3D Area: $0.0428$ RUNTIME: $<1s$ **Experiments with 6+ objectives** The NAT paper cites a claim that NSGA-III has been successfully tested on up to $15$ objectives. As such, we decided to attempt to extend our experiments above $5$ dimensions. However, we were unable to do so; when we attempted to run the experiment with $d=6$, NSGA-III ran for about a minute and then exceeded Colab's memory limits. This result seems to suggest that the curse of dimensionality can make NSGA-III non-viable even for relatively small sample sizes, and considerably fewer than $15$ objectives. Our method was able to run successfully in <1s. **CIFAR Experiments** Following the methodology of the original NAT paper, rather than training a supernetwork from scratch, we fine-tune a model to fit both the architecture and the weights to a particular dataset, in this case, CIFAR-10 and CIFAR-100. (https://www.cs.toronto.edu/~kriz/cifar.html) We evaluate algorithm performance primarily by use of the **hypervolume metric** (illustrated in Fig. 14 of (https://arxiv.org/abs/2005.05859). In case of bi-objective, hypervolume measures the dominated area achieved by a multi-objective algorithm. A larger hypervolume indicates a better Pareto front achieved. We also include the Spearman's Rank Correlation of the Accuracy Predictor with the evaluated set of models. This metric measures how good a job the accuracy predictor did of ranking the true measured performance, which naturally plays an outsized role in heuristic search performance. * Experiments on CIFAR-10 | Epoch | Hypervolume: NSGA-III | Hypervolume: Ours | Spearman: NSGA-III | Spearman: Ours | |-------|-----------------------|-------------------|--------------------|----------------| | 1 | 0.79 | 0.80 | 0.71 | 0.77 | | 2 | 0.83 | 0.83 | 0.88 | 0.95 | | 3 | 0.84 | 0.85 | 0.79 | 0.96 | | 4 | 0.85 | 0.85 | 0.72 | 0.89 | | 10 | 0.87 | 0.87 | 0.30 | 0.34 | | 30 | 0.89 | 0.89 | 0.42 | 0.28 | We observe that when the accuracy predictor is reliable, our method converges slightly faster than NSGA-III, and it performs comparably even when the accuracy prediction method is relatively weak. In our implementation, the most accurate model was comparable for both methods, achieving an accuracy of ~96.4. The running time of the search was extremely fast for both methods, less than a second per search. * Experiments on CIFAR-100 | Epoch | Hypervolume: NSGA-III | Hypervolume: Ours | Spearman: NSGA-III | Spearman: Ours | |-------|-----------------------|-------------------|--------------------|----------------| | 1 | 0.02 | 0.10 | 0.84 | 0.74 | | 2 | 0.18 | 0.24 | 0.79 | 0.90 | | 3 | 0.26 | 0.33 | 0.95 | 0.96 | | 4 | 0.33 | 0.36 | 0.91 | 0.89 | | 10 | 0.45 | 0.44 | 0.70 | 0.34 | | 30 | 0.50 | 0.49 | 0.87 | 0.20 | Again, when the accuracy predictor is reliable, our method converges in fewer iterations than NSGA-III, and it again performs comparably when the accuracy prediction method is relatively weak. In our implementation, the most accurate model was comparable for both methods, achieving an accuracy of ~79.6. The running time of the search was extremely fast for both methods, less than a second per search. **Next Steps** Our CIFAR experiments take place with only two objectives and on a dataset which is extremely easy to optimize, and therefore do not illustrate our algorithm to its best effect. With additional time and computational power, we hope to explore performance with a larger number of objectives and on more challenging datasets. One possibility is experiments in pose estimation or optimal pathing for complex robotics problems.