# A Deep Value-network Based Approach for Multi-Driver Order Dispatching ###### tags: `論文翻譯` `deeplearning` `rl` `Reinforcement Learning` [TOC] ## 說明 區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/2106.04493.pdf) ::: ## ABSTRACT :::info Recent works on ride-sharing order dispatching have highlighted the importance of taking into account both the spatial and temporal dynamics in the dispatching process for improving the transportation system efficiency. At the same time, deep reinforcement learning has advanced to the point where it achieves superhuman performance in a number of fields. In this work, we propose a deep reinforcement learning based solution for order dispatching and we conduct large scale online A/B tests on DiDi’s ride-dispatching platform to show that the proposed method achieves significant improvement on both total driver income and user experience related metrics. In particular, we model the ride dispatching problem as a Semi Markov Decision Process to account for the temporal aspect of the dispatching actions. To improve the stability of the value iteration with nonlinear function approximators like neural networks, we propose Cerebellar Value Networks (CVNet) with a novel distributed state representation layer. We further derive a regularized policy evaluation scheme for CVNet that penalizes large Lipschitz constant of the value network for additional robustness against adversarial perturbation and noises. Finally, we adapt various transfer learning methods to CVNet for increased learning adaptability and efficiency across multiple cities. We conduct extensive offline simulations based on real dispatching data as well as online AB tests through the DiDi’s platform. Results show that CVNet consistently outperforms other recently proposed dispatching methods. We finally show that the performance can be further improved through the efficient use of transfer learning. ::: :::success 最近對於共乘這種訂單分派的研究強調了在分派過程中把[空間](http://terms.naer.edu.tw/detail/17625075/)與[時間](http://terms.naer.edu.tw/detail/3203010/)考慮進去對於提高[運輸系統](http://terms.naer.edu.tw/detail/17630536/)效率的重要性。同一時間,深度強化學習已經進展到在多個領域有著超人般表現的地步。在這個研究中,我們提出一個基於深度強化學習的解決方案(用於訂單分派),而且我們還在滴滴叫車平台上做了大規模的online A/B tests,這是為了展示出這個方法在司機的總收入與使用者經驗相關的指標都有著明顯的改善。特別是我們把這個叫車分派的問題建構成一個Semi Markov Decision Process,用這樣的方式將分派動作(dispatching actions)的時間相關因素給考慮進來。為了能夠提高這些使用非線性函數逼近器(nonlinear function approximators)的值迭代(value iteration)過程的穩定性(像是採用神經網路),我們提出一個擁有著創新的分佈式狀態表示層(distributed state representation layer)的Cerebellar Value Networks (CVNet)。我們進一步的推導出一個用於CVNet的正規化的策略評估方案(regularized policy evaluation scheme),這個方法它是懲罰一個value network的large Lipschitz constant,用這構的方式來增加對於對抗性[擾動](http://terms.naer.edu.tw/detail/3189850/)與噪點的魯棒性。最後,我們還在CVNet上弄了很多種的transfer learning的方法,以此提高跨多個城市的學習適應性以及效率。我們拿著這些實際的分派資料做了廣泛的離線模擬,也通過滴滴的平台做了online A/B tests。結果說明了,CVNet始終優於其它近期所提的分派方法。我們最後還說明了,通過有效地使用transfer learning可以進一步的提高效能。 ::: ## 1 INTRODUCTION :::info In recent years, the advent of large scale online ride hailing services such as Uber and DiDi Chuxing have substantially transformed the transportation landscape, offering huge opportunities for improving the current transportation efficiency and leading to a surge of interest in numerous research fields such as driving route planning, demand prediction, fleet management and order dispatching (see, e.g., [10, 18, 21, 22]). One of the key enablers of this revolution lies in the ubiquitous use of Internet connected mobile devices which collect data and provide computation in real time. How to make use of the real-time rich information to bridge the once significant gap between supply and demand and improve traffic congestion, however, remains an active applied research topic. ::: :::success 近年來,Uber、滴滴打車等大型線上叫車服務的出現大大的改變了交通生態(transportation landscape),為改善當前的交通效率提供了一個極大的機會,這也誘發人們對許多研究領域的興趣,像是駕駛路線規劃、需求預測、車隊管理以及訂單分派(see, e.g., [10, 18, 21, 22])。推動這場革命的關鍵因素之一就在於無所不在的互聯移動設備,這些設備可以即時的收集資料並提供計算。不過,如何利用這些即時(real-time)豐富的信息來消除供需之間明顯的落差以改善交通擁塞問題仍然是一個活躍的應用研究議題。 ::: :::info In this work we consider the problem of driver-passenger dispatching [8, 17, 19, 21, 22]. In a ride-sharing platform the platform must make decisions for assigning available drivers to nearby unassigned passengers over a large spatial decision-making region (e.g., a city). An optimal decision-making policy requires taking into account both the spatial extent and the temporal dynamics of the dispatching process since such decisions can have long-term effects on the distribution of available drivers across the city. Previous work [8, 22] ignores the global optimality in both the spatial and temporal dimensions, e.g., either assign the nearest driver to a passenger in a local greedy manner or match them on a first-comefirst-serve basis. In [21] the dispatching process is formulated as a combinatorial optimization problem. While only optimizing over the current time step, [21] demonstrates that by accounting for the spatial optimality alone a higher success rate of global order matches can be achieved. ::: :::success 在這項研究中,我們考慮了司機與乘客的分派問題[8, 17, 19, 21, 22]。在共享搭車的平台中,這平台必需要在一個很大的空間決策區域(像是城市)上幫還沒有被分配到司機的乘客分配一位可用的司機。一個最佳的決策策略必需要能夠把分派過程中的空間範圍與時間動態性給考慮進去,因為這類的決策可能會對整個城市裡面可以用的司機的分佈產生長期性的影響。之前的研究,[8, 22],就忽略掉空間與時間維度的[總體最佳化](http://terms.naer.edu.tw/detail/3477280/),舉例來說,要嘛就是以局部性貪婪(local greedy)的方式把最近一個司機分配給乘客,不然就是先到先贏的方式來匹配。在[21]的研究中,分派過程弄成組合最佳化問題。儘管只是在當前的時步(time step)上最佳化,不過[21]的研究還是說明了,就算你單純的考慮空間最佳化,還是可以實現更高的全域順序匹配的成功率。 ::: :::info Two of the most recent work [17, 19] stand out by explicitly considering the spatiotemporal optimality in the order dispatching model. The work in [19] takes a learning and planning approach, where the offline learning step performs Temporal-Difference (TD) update in a discrete tabular spatiotemporal space using dynamic programming and the online planning step uses the tabular value from the learning step to compute real-time matching by solving a combinatorial optimization. While [19] has successfully deployed the algorithm in the production system of Didi Chuxing and reported remarkable improvements over the baselines, the limitations of a tabular approach are obvious, e.g., unable to generalize beyond the historical training data and impossible to respond in real-time to different supply and demand conditions. Another work in [17] adopts a deep reinforcement learning approach based on Q-learning, where a deep neural network is employed to estimate the state-action value function of the driver. The reinforcement learning agent is trained, from a single driver perspective, to maximize the total revenue throughout the day based on the historical data. To improve the sample complexity of reinforcement learning, a novel transfer learning method is also proposed for order dispatching to leverage knowledge transfer across multiple cities. [17] demonstrates the successful use of a deep neural network in a single driver’s long term income maximization, but the approach is intrinsically inadequate to be used in a production dispatching system which demands coordinations among multiple agents ::: :::success 最近的兩項研究,[17, 19],因為明確的考慮了訂單分派模型中的[時空](http://terms.naer.edu.tw/detail/3222914/)[最佳性](http://terms.naer.edu.tw/detail/57617/)(spatiotemporal optimality)而亮眼。[19]裡面的研究則是採取學習(learning)與規劃(planning)的方法,其中offline learning step是使用動態規劃在離散的表格時空空間做TD learning,而online planning step則是使用從learning step中學來的表格值(tabular value)透過解組合最佳化的方式來計算即時匹配。雖然[19]的研究已經成功地佈署在滴滴打車的生產系統中,而且比起基線(baseline)有著顯著的改進,不過這種表格型的方法的限制是顯而易見的,像是無法泛化至歷史訓練資料以外的地方,而且也無法即時的去回應不同的供應條件。[17]裡面的另一項研究採用基於Q-learning的深度強化學習方法,其中的深度神經網路用來估測司機的state-action value function。這個強化學習的agent是從單一個司機的角度來看的,做的就是根據歷史資料來最大化一整天的收入。為了提升強化學習的樣本複雜度,他還提出一種新穎的transfer learning method用在訂單分派上,可以用在跨多個城市的知識轉移(knowledge transfer)。[17]證明了深度神經網路在單一司機的長期收入最大化中的成功應用,不過這個方法的本質上是不足以用於那種需要多個agents之間協調的生產分派系統。 ::: :::info Our contribution in this paper is a deep neural network based approach for order dispatching that achieves a significant improvement in large-scale online A/B test results through DiDi’s ridesharing platform. The approach is proposed to maximize the long term total driver income for the multi-driver order dispatching environment. In particular, we build upon and extend the learning and planning approach in [19]. Unlike [17, 19] that formulate the dispatching problem as a standard MDP, we base our algorithmic framework on a Semi-Markov Decision Process (SMDP) formulation (e.g., see [3, 14]). In fact, we show that the order dispatching decision process naturally involves choices among the temporally extended courses of action over a broad range of time scales, e.g., passenger assignments trigger driver transitions that take multiple time steps to finish depending on the duration of the trips. The SMDP formulation accounts for such temporal effect of dispatching actions in a systematic way. We show that this leads to an update rule that coincides with a heuristic for handling temporal extended actions used in [19]. In the learning phase the key of our method is the evaluation of a value function that estimates the expected earning potential of a driver assuming that he or she follows the online policy (unknown) till the end of a day. The use of a neural network to parameterize the value function allows us to extend beyond simply the spatiotemporal status and incorporates contextual information that reflects the real time supply/demand conditions. This provides more accurate and contextually sensitive driver income estimations. This value network is used as the input to the next planning phase, where a combinatorial problem similar to [19] is solved to compute the actual dispatching decisions while resolving coordinations among multiple drivers and passengers. ::: :::success 這篇論文中我們的貢獻是提出一種基於深度神經網路的訂單分派方法,這方法通過滴滴打車平台在大規模online A/B test上有著非常顯著的進步。提出的這個方法可以在多司機的訂單分派環境(multi-driver order dispatching environment)上最大化長期總司機收入。特別是,我們建立並擴展[19]裡面的learning與planning的方法。不同於[17, 19]把分派問題建構成一個標準的MDP,我們的演算法框架是基於Semi-Markov Decision Process(SMDP)([3, 14])。事實上,訂單分派的決策過程很自然地涉及在廣泛的時間範圍(time scale)內的時間延伸(temporally extended)方法中做選擇,舉例來說,乘客的分配觸發了司機的轉換,根據行程的持續時間,需要多個時步來完成。SMDP公式以系統性的方式說明這類分派行為(dispatching actions)的時間效應。這導致一個新的更新規則(update rule),這規則跟[19]裡面所採用的以啟發式方法處理時間延伸的行為相吻合。在學習階段中,我們方法的關鍵在於評估一個價值函數(value function),這個價值函數估測一個司機預期的可能收入(我們假設司機是依循著online policy直到一天結束)。採用神經網路把價值函數參數化讓我們能夠擴張簡單的空間狀態(spatiotemporal status),並結合反映即時供需狀況的上下文信息。這提供更準確且與上下文相關的司機收入的估測。這個價值網路(value network)做為下一個規劃(planning)階段的輸入,在該階段會解決掉類似於[19]的組合問題,以計算實際的分派決策,同時解決司機與乘客之間的協調問題。 ::: :::info The development of a contextual value network in the multidriver order dispatching environment poses several challenges that desire novel solutions in both the learning and planning phases. In fact, a straightforward substitution of function approximators for lookup tables in value iteration is not robust and may diverge, resulting in useless value estimates. This, of course, has been known for a long time and over the years numerous methods have been proposed to remedy the adverse effects. A common theme is to build certain ‘invariance’ either into the update process [2, 16] or into the function approximator itself, e.g., constrain the changes in the output across a broad region of the input domain [13, 20]. ::: :::success 在多司機訂單分派環境(multidriver order dispatching environment)中開發上下文相關的價值網路(contextual value network)構成一些挑戰,需要在學習(learning)與規劃(planning)階段都採用新穎的解決方案。事實上,在value iteration(值迭代)中直接拿function approximators來取代[查表法](http://terms.naer.edu.tw/detail/19127045/)並不是那麼魯棒性,而且可能會發散,進而導致無效的值估計(value estimates)。當然啦,這是眾所皆知的問題,而且多年來也提出多種方法來補救這種不好的影響。一個共通性就是在更新的過程[2, 16]或是在函數逼近器(function approximator)本身建構某些"[不變性](http://terms.naer.edu.tw/detail/3529608/)",像是在input domain(輸入[域](http://terms.naer.edu.tw/detail/2115051/)?)的一個寬廣的區域內限制輸出的變化[13, 20]。 ::: :::info In this paper, we introduce Cerebellar Value Networks (CVNet), which is based on a type of memory-based neural networks known as CMAC (Cerebellar Model Arithmetic Computer) [1]. CMAC uses multiple overlapping tilings/quantizations of the state space to produce feature representations. It is an associative memory with a built-in generalization and has been successfully applied to reinforcement learning problems as a sparse-coarse-coded function approximator which is constrained to be robust [13, 20]. CVNet extends CMAC to a ‘wide’ form, where tiles are associated with embedding vectors that are concatenated into the final feature representations. The representation capacity can thus be adjusted through the choices of embedding dimensions and the number of (overlapping) tiling/quantization functions. The use of CMAC ensures the built-in invariance against small perturbations as long as the current state is in the interior of all tiles. However, the value output can still suffer from abrupt changes if the perturbations result in a tile boundary crossing. Following recent works on defending against adversarial attack [4, 15] and improving the generalizability of neural networks [11], we derive the Lipschitz of the cerebellar embedding operator and formulate Lipschitz regularization for CVNet during training. Together we show that CVNet with a regularized Lipschitz constant constitutes a robust and stable value function approximator with a strong generalization. ::: :::success 在這篇論文中,我們引入Cerebellar Value Networks (CVNet),這是基於一種人稱CMAC (Cerebellar Model Arithmetic Computer)[1]的memory-based neural networks。CMAC使用狀態空間(state space)多個重疊(overlapping)的tilings(磚、瓦、切片)/[quantizations](http://terms.naer.edu.tw/detail/1254544/)(量化)來生成特徵的表示。這是一種具有內置泛化性的[連結記憶](http://terms.naer.edu.tw/detail/3271131/),也已經成功的應用到強化學習的問題上(做為一種sparse-coarse-coded function approximator),不過其魯棒性是受到限制的[13, 20]。CVNet是把CMAC擴展為"寬(wide)"形式,其tiles與串連至最終的特徵表示的嵌入向量(embedding vectors)相關聯,因此,我們可以透過選擇embedding的維度以及(重壘的,overlapping)tilings(磚、瓦、切片)/[quantizations](http://terms.naer.edu.tw/detail/1254544/)(量化)函數的數量來調整這個representation的能力。只要當前的狀態(current state)是在所有的tiles的內部,那CMAC的使用就可以確保對那些小小[擾動](http://terms.naer.edu.tw/detail/838760/)的不變性。不過,如果[擾動](http://terms.naer.edu.tw/detail/838760/)的部份導致了tile的邊界(boundary)有了交叉(crossing),那值的輸出就還是會受到突變的影響。繼近期於防禦對抗性攻擊[4, 15]與提高神經網路泛化性[11]的研究之後,我們推導出一種cerebellar(小腦) embedding(嵌入) operator(操作)的Lipschitz(一種約束式),而且還在CVNet訓練期間弄了一個Lipschitz regularization。這邊我們一起說明了一個具有正規化Lipschitz constant的CVNet所構成的一個魯棒性、穩定,並且有著絕佳泛化性的值函數近似器(value function approximator) ::: :::info Finally, we overcome multiple practical issues and test the method both in simulator built with real-world data and in the large-scale production system of DiDi, which serves tens of millions of passengers and drivers on a daily basis. In particular, to account for the temporal variance often found in real-time contextual features, during training we employ a data argumentation procedure called context randomization; to compute temporal difference in real time planning phase, we learn a separate CVNet with only the spatiotemporal inputs by knowledge distillation [5] from the original CVNet with the additional contextual inputs; and finally to facilitate learning across multiple cities, we adapt the transfer learning method proposed in [17] for single driver order dispatching and apply it to CVNet in a multi-driver environment. We obtain state-of-arts improvement on key metrics including Total Driver Income (TDI) and user experience related metrics both in extensive simulations and in real world online AB test environment. ::: :::success 最後,我們克服很多實務上會遇到的問題,並且在以實際資料建構起來的模擬器與滴滴的大型生產系統中皆測試這個方法(滴滴系統每天為數千萬的乘客與司機提供服務)。特別是,為了解釋在real-time contextual features(即時上下文特徵?)常看到的時間變化,在訓練期間,我們弄了一個稱為context randomization(上下文隨機化?)的資料增強程序;為了計算即時規劃階段的時差(temporal difference),我們利用知識蒸餾(knowledge distillation)[5]從原始的CVNet中(有著額外的上下文輸入)學習一個只有[時空](http://terms.naer.edu.tw/detail/3222914/)做為輸入的CVNet;最後,為了能夠簡化跨多個城市的學習,我們採用[17]所提出的transfer learning method用於單一個司機的訂單分派(single driver order dispatching),並將其應用於多司機環境下的CVNet。我們在大規模的模擬以及實際的online A/B testing的環境中,其關鍵的指標,包括總司機收入(TDI)與使用者經驗,都有著很棒棒的改進。 ::: :::warning * 知識蒸餾(knowledge distillation),值得再追下去是什麼。 ::: :::info In what follows, we describe our SMDP formulation in Section 2 and highlight the difference between a standard MDP. The Lipschitz regularized policy evaluation and CVNet structure are detailed in Section 3, along with the context randomization technique we use for feature learning that generalizes. Section 4 discuss how to embed this neural network into a combinatorial problem for policy improvement in the online multi-agent environment with thousands of drivers. In Section 5 we discuss the application of transfer learning in CVNet dispatching system. Experiment results are presented in Section 6. And finally Section 7 concludes the paper. ::: :::success 在下文中,我們會在Section 2說明SMDP公式,調強跟標準MDP不同的地方。然後,Lipschitz regularized policy的評估與CVNet架構的細節,還有我們用來做泛化特徵學習的context randomization會在Section 3說明。Section 4會討論如何把一個神經網路嵌路到組合問題中,這用於有著上千司機的在線[多代理人](http://terms.naer.edu.tw/detail/18976900/)(online multi-agent)環境中。Section 5會討論在CVNet dispatching system裡面關於transfer learning的應用。實驗結果在Section 6說明。最後在Secton 7來個漂亮的總結。 ::: ## 2 A SEMI-MDP FORMULATION :::info We model the system as a Markov decision process endowed with a set of temporally extended actions. Such actions are also known as options and the corresponding decision problem is known as a semi-Markov decision process, or SMDP (e.g., see [14]). In this framework a driver interacts episodically with an environment at some discrete time scale, $t \in \tau := \left\{0,1,2,\cdots,T \right\}$ until the terminal time step $𝑇$ is reached. On each time step, $𝑡$, the driver perceives the state of the environment, described by the feature vector $s_t \in \mathcal{S}$, and on that basis chooses an option $o_t \in \mathcal{O}_{s_t}$ that terminates in $s_{t'}$ where $t'=t + k_{o_t}$. As a response, the environment produces a numerical reward $r_{t+i}$ for each intermediate step, e.g., $i=1,\cdots,k_{o_t}$. We denote the expected rewards of the option model by $r^o_{s_t}:=E\left\{r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{k_{o_t} - 1} r_{t+k_{o_t}} \vert s_t=s, o_t=o \right\}$ where $1 \geq \gamma > 0$ is the discount factor for the future reward. In the context of order dispatching, we highlight the following specifics: ::: :::success 我們把系統建構成具有一組時間跨度(temporally extended)動作(action)的Markov decision process。這類的actions也稱為options,相對應的決策問題就是semi-Markov decision process,或是SMDP(參閱[14])。在這個框架中,司機會在某個離散的時間尺度上偶發性的與環境互動,$t \in \tau := \left\{0,1,2,\cdots,T \right\}$,一直到來到終止的時步(terminal time step)$T$。在每個time step,$t$,司機會接收到環境的狀態(state),以特徵向量$s_t \in \mathcal{S}$來表示,然後在這個基礎上選擇一個option $o_t \in \mathcal{O}_{s_t}$,這個option會終止於$s_{t'}$,其中$t'=t + k_{o_t}$。做為回應,環境會為每個中間的step產生一個數值型的報酬(reward) $r_{t+i}$,像是$i=1,\cdots,k_{o_t}$。我們用$r^o_{s_t}:=E\left\{r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{k_{o_t} - 1} r_{t+k_{o_t}} \vert s_t=s, o_t=o \right\}$來表示這個option model的expected rewards,其中$1 \geq \gamma > 0$指的是對未來報酬的折扣因子(discount factor)。在訂單分派的部份,我們強調下面的細節: ::: :::info **State**, 𝑠𝑡 consists of the geographical status of the driver $l_t$ , the raw time stamp $\mu_t$ as well as the contextual feature vector given by $v_t$, i.e., $s_t:=(l_t, \mu_t, v_t)$. The raw time stamp $\mu_t$ reflects the time scale in the real world and is independent of the discrete time $𝑡$ that is defined for algorithmic purposes. We use $v_t$ to represent the contextual feature vector at location $l_t$ and time $\mu_t$ . We split contextual features into two categories, the dynamic features $v_{d_t}$ such as real-time characteristics of supplies and demands within the vicinity of the given spatiotemporal point, and the static features $v_{s_t}$ containing static properties such as dayofweek, driver service statics, holiday indicator, etc. When the discussion is focused on one particular time step we may ignore the subscript $𝑡$ and directly write $v_d$ and $v_s$. ::: :::success **State**,$s_t$包含司機的[地理](http://terms.naer.edu.tw/detail/867365/)狀態$l_t$,原始的[時間戳記](http://terms.naer.edu.tw/detail/1061316/) $\mu_t$,以及由$v_t$所給定的上下文特徵向量(contextual feature vector),即$s_t:=(l_t, \mu_t, v_t)$。原始的時間戳記$\mu_t$反映的是在真實世界中的時間尺度(time scale),這跟為了演算目的所做的離散時間$t$是無關的。我們用$v_t$來表示在位置$l_t$與時間$\mu_t$的上下文特徵向量。我們把上下文特徵拆分成兩個類別,動態特徵$v_{d_t}$,像是給定一個[時空](http://terms.naer.edu.tw/detail/3222914/)點附近供給與需求的[即時](http://terms.naer.edu.tw/detail/185640/)[特徵](http://terms.naer.edu.tw/detail/704997/),以及靜態特徵$v_{s_t}$包含靜態屬性像是星期幾啊、司機的服務狀態、holiday indicator(假日指示器?)…等等。當我們關注在某個特定的時步討論的時候,也許我們會忽略掉下標$t$,直接寫成$v_d$與$v_s$ ::: :::info **Option**, denoted as $o_t$ , represents the transition of the driver to a particular spatiotemporal status in the future, i.e., $o_t:=l_{t+k_t}$ where $k_t=0,1,2,\cdots$ is the duration of the transition which finishes once the driver reaches the destination. Executing option $o_t$ from state $s_t$ means starting the transition from origin $l_t$ to the destination specified by $o_t$ . This transition can happen due to either a trip assignment or an idle movement. In the first case the option results in a nonzero reward, while in the latter case an idle option leads to a zero-reward transition that terminates at the place where the next trip option is activated. Note that different $o_t$ takes different time steps to finish and the time extension is often larger than 1, e.g., $k_t > 1$, which is one of the main differences from standard MDP. ::: :::success **Option**,以$o_t$表示,表示司機在未來向特定的時空狀態(spatiotemporal status)的轉變,也就是$o_t:=l_{t+k_t}$,其中$k_t=0,1,2,\cdots$指的是司機在完成到達目的地的轉變的持續時間。從state $s_t$執行option $o_t$意味著從原始的$l_t$開啟這個轉變的過程,然後一路到$o_t$指定的目的地。這種轉變可能是因為一個行程的分配,或是閒置移動所引起的。如果是第一種情況(行程分配),那option就會有非零的報酬(nonzero reward)產生,如果是第二種(閒置移動)的話,那idle option就會產生zero-reward的轉變,這idle option會在下一個trip option被啟動(activated)的地方終止。注意到,不同的$o_t$需要不同的time steps(時步)來完成,時間的跨度(time extension)通常會大於1,像是$k_t > 1$,這是跟標準的MDP最大的不同之一。 ::: :::info **Reward**, $R_t$ is the total fee collected from a trip with a driver transition from $s_t$ to $s_{t'}$ by executing option $o_t$ . $R_t$ is zero if the trip is generated from an idle movement. Conceptually $R_t$ can be considered as the sum of a sequence of immediate rewards received at each unit time step while executing the option $o_t$ , e.g., $ hat{R}_t=\sum_{i=1}^{k_t}r_{t+i}$. We use $\hat{R}_t$ to denote the discounted total reward over the duration of the option $o_t$ induced by the discount factor $\gamma$, e.g., $\hat{R}_t=r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{k_{t}-1}r_{t+k_t}$. ::: :::success **Reward**,$R_t$是司機透過執行option $o_t$從$s_t$到$s_{t'}$這個旅遊中所收取的總費用。如果行程是由閒置移動所生成的,那$R_t$就會是零。從概念上來說,$R_t$可以看成是在執行option $o_t$時,在每個單位時步(unit time step)所收到的一系列即時報酬的總和,也就是$ hat{R}_t=\sum_{i=1}^{k_t}r_{t+i}$。這邊我們用$\hat{R}_t$來做為option $o_t$期間由折扣因子$\gamma$所引起的折扣總報酬(discounted total reward),即$\hat{R}_t=r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{k_{t}-1}r_{t+k_t}$ ::: :::info **Policy**, $\pi(o \vert s)$ specifies the probability of taking option $o$ in state $𝑠$ regardless of the time step $𝑡$. Executing $\pi$ in the environment generates a history of driver trajectories denoted as $\left\{\tau_i \right\}_{i \in \mathcal{H}}:=\left\{(s_{i_0}, o_{i_0}, r_{i_1}, s_{i_1}, o_{i_1}, r_{i_2}, \cdots, r_{iT_i}, s_{iT_i}) \right\}_{i \in \mathcal{H}}$ where $\mathcal{H}$ denotes the index set of the historical driver trajectories. Associated with the policy $\pi$ is the state value function $V^\pi(s):=E\left\{\sum^T_{i=t+1} \gamma^{i-t-1} r_i \vert s_t = s \right\}$ which specifies the value of a state $s \in \mathcal{S}$ under the policy $\pi$ as the expected cumulative reward that the driver will gain starting from $𝑠$ and following $\pi$ till the end of an episode. ::: :::success **Policy**,$\pi(o \vert s)$,指的是在state $s$選擇option $o$的機率,不考慮time step $t$。在環境中執行$\pi$會產生司機軌跡(trajectory)的歷史記錄,以$\left\{\tau_i \right\}_{i \in \mathcal{H}}:=\left\{(s_{i_0}, o_{i_0}, r_{i_1}, s_{i_1}, o_{i_1}, r_{i_2}, \cdots, r_{iT_i}, s_{iT_i}) \right\}_{i \in \mathcal{H}}$來表示,其中$\mathcal{H}$表示為歷史的司機軌跡的索引集。與policy $\pi$相關的是state value function $V^\pi(s):=E\left\{\sum^T_{i=t+1} \gamma^{i-t-1} r_i \vert s_t = s \right\}$,這指的是在使用policy $\pi$的情況下,state $s \in \mathcal{S}$的值做為司機從state $s$一直到episode結束的期望累積報酬(expected cumulative reward)。 ::: :::info Given the above SMDP and the history trajectories H, our goal is to estimate the value of the underlying policy. Similar to the standard MDP, we can write Bellman equations for general policies and options [14], $$ \begin{matrix} V^\pi(s) &= E \left\{r_{t+1} + \cdots + \gamma^{k_{o_t}-1}r_{t+k_{o_t} \vert s_t=s} \right\} \\ &= E \left\{r^o_{s_t} + \gamma^{K_{o_t}}V^\pi(s_{t+k_{o_t}}) \vert s_t = s \tag{1} \right\} \end{matrix} $$ where $k_{o_t}$ is the duration of the option selected by $\pi$ at time $t$ and $r^o_{s_t}$ is the corresponding accumulative discounted reward received through the course of the option. ::: :::success 有鑒於上述SMDP與歷史軌跡$\mathcal{H}$的觀點,我們的目標就是去估測underlying policy的價值。跟標準的MDP類似,我們可以把一般的general policies與options[14]寫成Bellman equations(貝爾曼方程式), $$ \begin{matrix} V^\pi(s) &= E \left\{r_{t+1} + \cdots + \gamma^{k_{o_t}-1}r_{t+k_{o_t} \vert s_t=s} \right\} \\ &= E \left\{r^o_{s_t} + \gamma^{K_{o_t}}V^\pi(s_{t+k_{o_t}}) \vert s_t = s \tag{1} \right\} \end{matrix} $$ 其中$k_{o_t}$指的就是在時間$t$透過policy $\pi$選擇的optiond的持續期間,而$r^o_{s_t}$則是option在過程中所得到相對應的累積折扣報酬。 ::: :::info **Discussion**. The Bellman equations (1) can be used as update rules in dynamic-programming-like planning methods for finding the value function. The main divergence from the standard MDP transition is that the update rules need to reflect the fact that the temporal extension from state to state spans different time horizons. As an example, consider one transition from $s_t$ to $s_{t+k_t}$ resulted from executing option $o_t$ . We can update the value function in this case as follows, $$ V^{k+1}(s_t) \leftarrow r_{t+1} + \cdots + \gamma^{k_{t}-1} r_{t+k_t} + \gamma^{k_t}V^k(s_{t+k_t}) $$ In the case of order dispatching, the total fee collected from the transition is $R_t$. Assuming that $R_t$ is spread uniformly across the trip duration, we can then compute the discounted accumulative reward $\hat{R}_t$ as $$ \begin{align} \hat{R}_t &= \dfrac{R_t}{k_t} + \gamma \dfrac{R_t}{k_t} + \cdots + \gamma^{k_t-1} \dfrac{R_t}{k_t}\\ &= \dfrac{R_t(\gamma^{k_t} - 1)}{k_t(\gamma - 1)}, \text{ where $0 < \gamma < 1$, $k \geq 1$} \end{align} $$ And the update rule for $V$ becomes $$ V^{K+1}(s_t) \leftarrow \dfrac{R_t(\gamma^{k_t}-1)}{k_t(\gamma - 1)} + \gamma^{k_t}V^K(s_{t+k_t}) \tag{2} $$ Note that compared to a standard MDP update rule without reward discount $R_t + \gamma^{k_t}V^K(s_{t+k_t})$, (2) acts effectively like a smooth version of reward clipping that is commonly used to improve performance in reinforcement learning [9]. ::: :::success **Discussion** Bellman equations (1)可以拿來做為dynamic-programming-like planning methods的更新規則,這可以用來尋找value function。與標準MDP轉移(transition)最主要的差異在於,更新規則需要反映一個事實,那就是從state到state的時間跨度(temporal extension)跨越不同的[時程](http://terms.naer.edu.tw/detail/578137/)(time horizons)。舉例來說,我們考慮一個因為執行option $o_t$所導致的從$s_t$到$s_{t+k_t}$的轉移。這種情況下,我們可以用下面的方式來更新value function, $$ V^{k+1}(s_t) \leftarrow r_{t+1} + \cdots + \gamma^{k_{t}-1} r_{t+k_t} + \gamma^{k_t}V^k(s_{t+k_t}) $$ 在訂單分派的情況下,這個轉移收到的總費用為$R_t$。假設,$R_t$在整個行程的期間是均勻分佈的,那我們就可以把折扣累積報酬$\hat{R}_t$像下面這樣計算為: $$ \begin{align} \hat{R}_t &= \dfrac{R_t}{k_t} + \gamma \dfrac{R_t}{k_t} + \cdots + \gamma^{k_t-1} \dfrac{R_t}{k_t}\\ &= \dfrac{R_t(\gamma^{k_t} - 1)}{k_t(\gamma - 1)}, \text{ where $0 < \gamma < 1$, $k \geq 1$} \end{align} $$ 這樣,這個$V$的更新規則就變成 $$ V^{K+1}(s_t) \leftarrow \dfrac{R_t(\gamma^{k_t}-1)}{k_t(\gamma - 1)} + \gamma^{k_t}V^K(s_{t+k_t}) \tag{2} $$ 注意,對比沒有獎勵折扣$R_t + \gamma^{k_t}V^K(s_{t+k_t})$的標準MDP的更新規則,上述公式(2)的作用類似於一種reward clipping(報酬剪裁)的smooth version(平滑版本),通常是用在提高強化學習的效能的[9]。 ::: ## 3 DISPATCHING POLICY EVALUATION WITH NEURAL NETWORKS :::info We assume the online dispatching policy 𝜋 is unknown and the goal is to evaluate the value of the policy from the given historical trajectories data. We use a neural network to approximate this value function based on the historical trajectories. The network structure is illustrated in Figure 1. Later we will discuss how to embed this neural network into a combinatorial problem for policy improvement in the online multi-agent environment with thousands of drivers. ::: :::success 我們假設online dispatching policy $\pi$是未知的,然後目標是從給定的歷史的軌跡資料中去評估策略(policy)的價值。我們使用神經網路根據歷史軌跡資料來近似這個value function。網路的架構如Figure 1所說明。後面我們會來討論一下怎麼把這個神經網路塞進去一個組合問題中,用這樣的方式在一個有著上千司機的online multi-agent的環境中做策略改進(policy improvement)。 ::: :::info ![](https://hackmd.io/_uploads/B10ZQ92uc.png) Figure 1: Feature marginalization and network structure of the Lipschitz-regularized dispatching value function $V$ and its distilled network $\tilde{V}$. Figure 1: Lipschitz-regularized dispatching value function $V$與其distilled network(蒸餾網路?) $\tilde{V}$的特徵[邊緣化](http://terms.naer.edu.tw/detail/66972/)與網路架構。 ::: ### 3.1 Cerebellar Embedding :::info Learning a good state representation is usually the key step to solving a practical problem with neural networks. It is even more so for a large scale problem like order dispatching which requires the parse of complicated state information as the basis for longterm reasoning in a ever changing environment. Here we propose a method called cerebellar embedding that combines CMAC with embedding to obtain a distributed state representation [6] that is generalizable, extensible and robust. One way to view a CMAC is to consider a sparse, coarse-coded function approximator which uses multiple overlapping tilings of the state space to obtain a feature representation. Each input point to the CMAC activates as many tiles as the number of tilings used. The total number of tiles is referred to as the size of the conceptual memory. The mapping from input points to tiles is done such that points close together in the input space have considerable overlap between their set of activated tiles. Each tile in the conceptual memory is associated with a weight in the actual memory which is iteratively updated through training. And the output of CMAC is computed as the sum of the weights of the activated tiles. Note that the size of the actual memory does not need to match that of the conceptual memory. In fact, the so-called ‘hashing trick’ [13] is often employed to reduce the memory requirements – a consistent random collapsing of a large set of tiles into a smaller one. ::: :::success 學習一個好的狀態表示通常是解決神經網路問題的關鍵步驟。對於像訂單分派這種大型問題更是如此,這種大型問題需要解析複雜的狀態信息(state information)來做為在不斷變化的環境中做長期推論的基礎。這邊我們要來提出一種稱為cerebellar embedding(小腦嵌入)的方法,結合CMAC與embedding,用這樣的方式來獲得可泛化、可擴展而且魯棒棒的分佈式狀態表示(distributed state representation)[6]。觀察CMAC的一種方法就是考慮稀疏的coarse-coded(粗編碼?) function approximator,這方法使用狀態空間的多個重疊(overlapping)的tillings([磚式](http://terms.naer.edu.tw/detail/18708900/))來獲得特徵表示。每個輸入到CMAC的輸入點啟動的tiles([瓦片](http://terms.naer.edu.tw/detail/18708892/))數量與使用的tillings([磚式](http://terms.naer.edu.tw/detail/18708900/))數量一致。tiles([瓦片](http://terms.naer.edu.tw/detail/18708892/))的總數被稱為conceptual memory的大小。這樣就完成輸入點到tiles([瓦片](http://terms.naer.edu.tw/detail/18708892/))的映射,就可以讓在輸入空間中接近的點可以在其啟動的tiles([瓦片](http://terms.naer.edu.tw/detail/18708892/))集合中有相當大的重疊。conceptual memory中的每個tiles都與actual memory中的權重相關聯,這個權重是透過訓練迭代來更新的。然後,CMAC的輸出則是被計算為activated tiles的權重和。注意一點,actual memory的大小並不需要跟conceptual memory的大小相匹配。事實上,所謂的"hashing trick"[13]通常是拿來降低記憶體需求的,把一組比較大的tiles一致隨機的折疊成一個比較小的tiles。 ::: :::info The cerebellar embedding extends CMACs by using an embedding matrix as the actual memory and implements the mapping using a sparse representation. In particular, the cerebellar embedding defines multiple quantization (or tiling) functions $\left\{q_1, \cdots,q_n \right\}$. Each function maps the continuous input to a unique string id indicating one discretized region of the state space such that $q_i(s) \neq q_j(s), \forall{s}, i\neq j$. The set of activated tiles for a given input $s$ is given by $\left\{q_i(s) \right\}^n_{i=1}$ , and the set of all such strings constitutes the conceptual memory. The size of the actual memory is denoted as $A$ which does not have to equal to the size of the conceptual memory. Let $g(\cdot)$ denote a mapping function from the conceptual memory to the range $0,1,\cdots,A-1$. The perfect mapping is when no conflict occurs, e.g., $g(q_i(s)) \neq g(q_j(s)), \forall i \neq j$. Under the given set of quantization functions, we obtain the activation vector, denoted as $c(s) \in \mathbb{R}^A$, by iteratively adding 1 to the $g(q_i(s))$-th entry of $c(s)$ (initialized to 0) for each $q_i$ , e.g., $c_{g(q_i)}(s)+1, \forall{i}$. Hence $c(s)$ contains at most $n$ non-zero entries (exactly $n$ when it is perfect mappings) and is a sparse vector since $n \ll A$. ::: :::success cerebellar embedding透過使用embedding matrix做為actual memory來擴展CMACs,然後使用稀疏表示(sparse representataion)來實現映射。特別是,cerebellar embedding定義了多個quantization(或是tiling) functions $\left\{q_1, \cdots,q_n \right\}$。每個function都會把[連續輸入](http://terms.naer.edu.tw/detail/3381333/)映射到一個唯一的字串id,這id指向狀態空間中的一個離散區域,因此$q_i(s) \neq q_j(s), \forall{s}, i\neq j$。給定輸入$s$的activated tiles這個集合是由$\left\{q_i(s) \right\}^n_{i=1}$所給出,所有這類的字串就會構成conceptual memory。actual memory的大小以$A$來表示,這大小並不會跟conceptual memory相等。假設$g(\cdot)$代表從conceptual memory映射到$0,1,\cdots,A-1$的[映射函數](http://terms.naer.edu.tw/detail/2119347/)(mapping function)。沒有衝突發生就是一種完美的映射(perfect mapping),也就是$g(q_i(s)) \neq g(q_j(s)), \forall i \neq j$。在給定quantization functions集合的情況下,透過迭代地把1加到$c(s)$的第$g(q_i(s))$項的方式(初始為0)(對每個$q_i$),我們會獲得activation vector,以$c(s) \in \mathbb{R}^A$表示,也就是$c_{g(q_i)}(s)+1, \forall{i}$。因此,$c(s)$包含最多$n$個non-zero的項目(如果它是完美映射的話,那就會是$n$),而且會是一個稀疏向量,因為$n \ll A$。 ::: :::info Finally, we initiate a random embedding matrix $\theta^M \in \mathbb{R}^{A \times m}$ as the actual memory. Each tile in the conceptual memory is associated with a row in $\theta^M$ which is a dense $m$-dimensional vector. The sparse activation vector $c(s)$ is multiplied by the embedding matrix, yielding the final dense representation of the input point $x$, i.e., $\dfrac{c(s)^T\theta^M}{n}$ where $n$ is the number of used quantization functions and the embedding matrix $\theta^M$ is iteratively updated during training. Note that the dot product $c(s)^T\theta^M$ grow linearly in magnitude with respect to the number of tilings so we scale it by $\dfrac{1}{n}$ to prevent diminishing gradients. ::: :::success 最後,我們啟動了一個隨機的嵌入矩陣(random embedding matrix) $\theta^M \in \mathbb{R}^{A \times m}$來做為actual memory。conceptual memory中的每個tile都跟$\theta^M$裡面的一個row相關聯,也因此它是一個$m$維的向量。稀疏啟動向量(sparse activation vector)$c(s)$跟embedding matrix相乘,然後得到輸入點(input point)$x$的最佳密集表示(dense representation),也就是$\dfrac{c(s)^T\theta^M}{n}$,其中$n$為所使用的quantization functions的數量,然後embedding matrix $\theta^M$則是在訓練期間會迭代地更新。注意一點,$c(s)^T\theta^M$這個內積的[規模](http://terms.naer.edu.tw/detail/439069/)會以相對於tilings數量呈現線性增長,因此我們會利用$\dfrac{1}{n}$來縮放它,避免梯度[遞減](http://terms.naer.edu.tw/detail/19087309/)。 ::: :::info 3.1.1 Hierarchical Coarse-coding in the location space. To quantize the geographical space, we use a hierarchical hexagon tiling system (illustrated in Figure 2). Using a hexagon as the tile shape is beneficial since hexagons have only one distance between a hexagon centerpoint and its neighbors. The hexagon tiling system we use supports multiple resolutions, with each finer resolution having tiles with one seventh the area of the coarser resolution. Having such hierarchical quantization with different resolutions enables the information aggregation (and, in turn, the learning) to happen at different abstraction levels automatically adaptive to the nature of the geographical district, e.g., downtown, suburbs, community parks, etc. ::: :::success 位置空間中的Hierarchical Coarse-coding([階層式](http://terms.naer.edu.tw/detail/18640989/)粗略編碼?)。為了能夠量化地理空間,我們使用一個hierarchical hexagon tiling system(階層式六邊型磚式系統?)(如Figure 2所說明)。使用六邊型來做為tile(瓦片)是有好處的,因為六邊形的中心點跟其它相鄰點之間只有一個距離(one distance)。我們用的這種hexagon tiling system支援多種解析度,每個較為精細的解析度的面積都是解析度較粗糙的tlies的七分之一。這種不同解析度有著階層式的量化能夠讓信息聚合(進而能夠學習)在不同的抽象級別上自動地適應地理區域的性質,像是市中心、郊區、社區公園、等。 ::: :::info ![](https://hackmd.io/_uploads/SkWdH10dc.png) Figure 2: Coarse Coding with Hierarchical Hexagon Grid. The geo point (red) activates two grid cells (orange and blue). The final representation is the average of the two grid cells’ embedding vectors. Figure 2:具有階層式六角網頁的粗略編碼。地理點(紅色)啟動兩個網格單元(grid cells)(橘與藍)。最終的表示是兩個網格單元的嵌入向量的平均值。 ::: ### 3.2 Robustness in Value Network :::info Enforcing a robust state value dynamic with respect to the spatiotemporal status of the driver is critical in a production dispatching system. Dramatic changes or irregular value estimations will be further augmented due to either long chain of downstream tasks or simply large scale of the inputs, which can cause instability and abnormal behavior at the system level. To obtain robustness against perturbations, mathematically we would like the output of the value function to be bounded, with respect to the $p-$norm of interest $\Vert \cdot \Vert_p$, by its input state for all state in $\mathcal{S}$, e.g., $$ \Vert V(s_1) - V(s_2) \Vert_p \leq L_p \Vert s_1 - s_2 \Vert, \forall{s_1, s_2} \in \mathcal{S} \tag{3} $$ Here the value of L_p$ , known as the Lipschitz constant, represents the worst case variation of $V$ with respect to a change in its input $s$. In this case we would like to regularize L_p$ during training for a robust value function. ::: :::success 在要生產分派系統中去針對司機的時空狀態(spatiotemporal status)去執行一個魯棒的state value dynamic(狀態價值動態?)是非常重要的。由於下游任務的[長鏈](http://terms.naer.edu.tw/detail/451658/)(long cnain)或簡單的大規模輸入會造成劇烈變化或非正則的價值估測會進一步的增加,這是有可能會導致系統面的不穩定與異常行為。為了能夠獲得對這種擾動有更好的魯棒性,就數學上來說,我們會希望value function的輸出是可以被限制住的(bounded),相對於感興趣的$p-$norm,$\Vert \cdot \Vert_p$,透過$\mathcal{S}$中所有狀態(state)的輸入狀態,也就是: $$ \Vert V(s_1) - V(s_2) \Vert_p \leq L_p \Vert s_1 - s_2 \Vert, \forall{s_1, s_2} \in \mathcal{S} \tag{3} $$ 上面這個$L_p$的值就是Lipschitz constant,表示$V$相對於其輸入$s$變化的最壞情況變化。這種情況下,我們會希望在訓練期間能夠對$L_p$做個正規化,希望可以得到一個魯棒性棒棒的value function。 ::: :::info An upper bound for the Lipschitz constant of a neural network can be computed as the product of the Lipschitz constant of each individual layer of the network. This is easy to show once we notice that neural networks can be expressed as a series of function compositions, e.g., $V(s) = (v_h \circ v_{h-1} \cdots \circ v_1)(s).$ $$ L(V) \leq \prod^h_{i=1} L(v_i) \tag{4} $$ Hence to control the neural network’s global Lipschitz constant it is sufficient to regularize the Lipschitz for each individual layer. The value network that we use, as depicted in Figure 1, consists of both the cerebellar embedding layer and the multilayer perceptron. We now give the Lipschitz constants of these two layers as a function of their parameters. ::: :::success 對於神經網路的Lipschitz constant的[上界](http://terms.naer.edu.tw/detail/2127189/)(upper bound)可以計算成是網路的每個各別的層(individual layer)的Lipschitz constant的[乘積](http://terms.naer.edu.tw/detail/3190088/)。證明這個太簡單了,只要我們注意到,其實神經網路可以表示為一系列的函數組合,也就是$V(s) = (v_h \circ v_{h-1} \cdots \circ v_1)(s).$ $$ L(V) \leq \prod^h_{i=1} L(v_i) \tag{4} $$ 因此,要去控制神經網路的global Lipschitz constant,就只需要去對每個individual layer的Lipschitz做正規化就好了。我們所使用的value network(如Figure 1所示)是由cerebellar embedding layer與multilayer perceptron所組成。我們現在把這兩層的Lipschitz constants視為其參數的函數。 ::: :::info **Multilayer Perceptron**: Assume one linear layer followed by an ReLU activation. The Lipschitz of the ReLU operation is bounded by 1, e.g., $L_p^{relu}=1$, since the maximum absolute subgradient of ReLU is 1. For the linear layer, assuming it is parameterized by a weight matrix $\theta^l$ and a bias vector $b^l$ , we can derive its Lipschitz constant as follows, $$ \begin{align} & \Vert \theta^l s1 + b^l - (\theta^l s_2 + b^l) \Vert_p \leq L^l_p \Vert s_1 - s_2 \Vert \\ & \Rightarrow L^l_p \geq \dfrac{\Vert \theta^l(s_1 - s_2) \Vert_p}{\Vert s_1 - s_2 \Vert_p} \\ & L^l_p = \sup_{s \neq 0} \dfrac{\Vert \theta^l s \Vert_p}{\Vert s \Vert_p}, s=s_1-s_2 \end{align} $$ which is the operator norm of weight matrix $\theta^l$ . When $p=1$ the Lipschitz constant of the linear layer $L^l_p$ is given by the maximum absolute column sum of the weight matrix; when $p=\infty$ it is the maximum absolute row sum and when $p=2$ it is the spectral norm of $\theta^l$ which can be approximated using the power method. ::: :::success **Multilayer Perceptron**:假設一個線性層(linear layer)的後面是跟著一個ReLU的啟動函數。ReLU這個[運算](http://terms.naer.edu.tw/detail/2120972/)的Lipschitz就會以1為界限,也就是$L_p^{relu}=1$,因為ReLU的最大絕對[劣梯度](http://terms.naer.edu.tw/detail/2125609/)(subgradient)就是1。對線性層來說,假設它是由權重矩陣$\theta^l$與偏差向量$b^l$所參數化,那我們可以推導出它的Lipschitz constant就會是: $$ \begin{align} & \Vert \theta^l s1 + b^l - (\theta^l s_2 + b^l) \Vert_p \leq L^l_p \Vert s_1 - s_2 \Vert \\ & \Rightarrow L^l_p \geq \dfrac{\Vert \theta^l(s_1 - s_2) \Vert_p}{\Vert s_1 - s_2 \Vert_p} \\ & L^l_p = \sup_{s \neq 0} \dfrac{\Vert \theta^l s \Vert_p}{\Vert s \Vert_p}, s=s_1-s_2 \end{align} $$ 這就是權重矩陣$\theta^l$的[算符](http://terms.naer.edu.tw/detail/2120981/)範數(operation norm)。當$p=1$的時候,那線性層(linear layer)$L^l_p$的Lipschitz constant就會由權重矩陣絕對值最大的那個行加總(column sum)所給出;當$p=\infty$的時候,就會是絕對值最大的那個列加總(row sum),當$p=2$的時候,就會是權重矩陣$\theta^l$的[譜範數](http://terms.naer.edu.tw/detail/2125058/)(spectral norm),這是可以利用[冪方法](http://terms.naer.edu.tw/detail/2122132/)(power method)來近似的。 ::: :::info **Cerebellar Embedding**: Recall that in Section 3.1 the embedding process can be expressed as a sparse dot product $\dfrac{c(s)^T\theta^M}{n}$ where $c(s)$ is a sparse vector with at most 𝑛 non-zero entries. Since this operation is linear in $c(s)$, the Lipschitz can be computed similarly as that of the linear layer. In this case it is the operator norm of the transpose of the embedding matrix $\theta^M$. Note that because quantizations are used, there will be a sudden change in the output value at the boundary of the quantization. This will not be an issue in practice as long as the scale of the change is controlled. That is if we regularize the operator norm of $\theta^M$. In fact, note that the vector $c(s_1) - c(s_2)$ can have at most $2n$ non-zero entries for any $s_1, s_2$, e.g., when $s_1$ and $s_2$ have no overlap in the conceptual memory. Hence the output of the cerebellar embedding layer is bounded as follows, $$ \begin{align} & \Vert c(x_1)^T\theta^M - c(x_2)^T\theta^M \Vert_p / n \\ &= \Vert (c(x_1) - c(x_2))^T\theta^M \Vert_p /n \leq 2 \max_i \Vert \theta^M_i \Vert_p \end{align} $$ where $\theta^M_i$ is the $i$th row of $\theta^M$. When $p=1$, for example, $\max_i \Vert \theta^M_i \Vert_1 = \Vert \theta^M \Vert_{\infty}$ which is the infinity norm of the matrix $\theta^M$. ::: :::success **Cerebellar Embedding**:回想一下Section 3.1裡面說的,嵌入的過程可以表示為一種疏失的點積(dot product) $\dfrac{c(s)^T\theta^M}{n}$,其中$c(s)$是一個疏失向量,最多$n$個非零的元素(non-zero entries)。由於這個運算在$c(s)$裡面是線性的,也因此,計算Lipschitz就跟線性層的計算差不多。這種情況下,它就會是嵌入矩陣$\theta^M$轉置的算符範數(operator norm)。這邊也要注意一下,因為使用了[量化](http://terms.naer.edu.tw/detail/2122792/)(quantizations),所以在量化的邊界處的輸出值就會有著劇烈的變化。不過只要控制得宜,這就不會是什麼鳥不起的問題。也就是說,如果我們把$\theta^M$的算符範數(operator norm)給正規化了。事實上,注意看一下向量$c(s_1) - c(s_2)$最多會有$2n$個非零的元素(對於任意的$s_1, s_2$),舉例來說,當$s_1$與$s_2$在conceptual memory中沒有重疊的時候。因此,cerebellar embedding layer的輸出就會被如下公式給bounded: $$ \begin{align} & \Vert c(x_1)^T\theta^M - c(x_2)^T\theta^M \Vert_p / n \\ &= \Vert (c(x_1) - c(x_2))^T\theta^M \Vert_p /n \leq 2 \max_i \Vert \theta^M_i \Vert_p \end{align} 其中$\theta^M_i$是$\theta^M$的第$i$個row。舉例來說,當$p=1$的時候,那$\max_i \Vert \theta^M_i \Vert_1 = \Vert \theta^M \Vert_{\infty}$就會是矩陣$\theta^M$的無窮 範數(infinity norm)。 ::: ### 3.3 Policy Evaluation :::info Given the semi-MDP defined in Section 2, we want to solve for the value function under the unknown dispatching policy $\pi$. We collect the historical driver trajectories and divide it into a set of tuples with each representing one driver transition spending $k$ time steps from $s$ to $s'$ during which the driver receives a total trip fee $R$, i.e., $(s, R, s')$. Training follows the Double-DQN structure [16] for better training stability. The main value network is denoted as $V^\pi(s\vert\theta)$ where $\theta$ representing all trainable weights in the neural network, and a target $V-$network $\hat{V}^\pi(s\vert \hat{\theta})$, maintained and synchronized periodically with the main network $V^\pi(s\vert\theta)$, is used to evaluate the update rule as given in (2). This update is converted into a loss to be minimized $L(\theta)$, most commonly the squared loss. Following the discussions in Section 3.2, we add a penalty term $\mathcal{R}(\theta)$ on global Lipschitz constant to the loss and introduce a penalty parameter $\lambda > 0$, $$ \begin{align} & \min_\theta \mathcal{L}(\theta) + \lambda \cdot \mathcal{R}(\theta) := \\ & \dfrac{1}{2} \left\{V^\pi(s\vert \theta) - (\dfrac{R(\gamma^k - 1)}{k(\gamma - 1)} + \gamma^k \hat{V}^\pi(s'\vert\hat{\theta})) \right\}^2 + \lambda \cdot \sum_{i=1}^h L(v_i) \tag{5} \end{align} $$ ::: :::success 來看Section 2的semi-MDP,我們想要解的是在未知的分派策略$\pi$下的那個價值函數(value function)。我們收集了歷史司機軌跡記錄,然後將之劃分為一組元組(a set of tuple),每一個tuple都表示一個司機花費$k$個時步從$s$到$s'$這個期間的轉移,這段期間司機會收到一份總的行程費用$R$,也就是$(s, R, s')$。訓練過程則是依著Double-DQN的架構[16],這可以得到較佳的訓練穩定度。主要的value network(價值網路)以$V^\pi(s\vert\theta)$來表示,其中$\theta$表示神經網路中所有可訓練的權重,然後target $V-$network $\hat{V}^\pi(s\vert \hat{\theta})$則是定期地維護、同步主要的網路$V^\pi(s\vert\theta)$,然後用公式2所給出的更新規則來評估(evaluate)。這個更新就是要轉換成把損失(loss)$L(\theta)$最小化,最常見的就是squared loss。根據Section 3.2的討論,我們在global Lipschitz constant加了一個懲罰項$\mathcal{R}(\theta)$,然後引入一個懲罰參數$\lambda > 0$: $$ \begin{align} & \min_\theta \mathcal{L}(\theta) + \lambda \cdot \mathcal{R}(\theta) := \\ & \dfrac{1}{2} \left\{V^\pi(s\vert \theta) - (\dfrac{R(\gamma^k - 1)}{k(\gamma - 1)} + \gamma^k \hat{V}^\pi(s'\vert\hat{\theta})) \right\}^2 + \lambda \cdot \sum_{i=1}^h L(v_i) \tag{5} \end{align} $$ ::: :::info **Context Randomization**: During training we augment each historical driver trajectory with contextual features $\left\{v_i \right\}$ extracted from the production logging system. Contextual features, especially real-time supply/demand statistics, often come with high variance, e.g., it is common to notice a $\pm 30$ minutes shift of the rush hour peak. Another issue is the scheduling bias in the logging system, e.g., logging triggered every 5 minutes, which can cause too many failed feature associations when matching using the exact value of the spatiotemporal states. To account for those bias in the training and to build temporal invariance into the system, we use context randomization (6) in the augmentation process. That is, instead of matching with the exact spatiotemporal status, we implement a procedure called hierarchical range query $\Upsilon(\cdot)$, which allows the specification of a range for the given query and returns a set of contextual features within that range, i.e., $\Upsilon(l,\mu,rg) \subseteq \left\{v_i \right\}$ where $rg$ specify the query range for time $\mu$ such that all contextual features within $[\mu - rg, \mu + rg]$ are returned. ::: :::success **Context Randomization**:訓練過程中,我們會從生產日誌系統中去提取contextual features $\left\{v_i \right\}$(上下文特徵?),用這個來增強(augment)每個歷史的司機軌跡(historical driver trajectory)。contextual features,特別是即時的供需統計,通常會有很高的方差,舉例來說,我們經常會注意到高峰時段會有$\pm 30$分鐘的偏移(shift)。另一個問題就是日誌系統中的排程偏差(scheduling bias),舉例來說,5分鐘觸發一次日誌,當我們用這個時空狀態(spatiotemporal states)的[精確值](http://terms.naer.edu.tw/detail/18507680/)來匹配的時候,就會導致過多的失敗的特徵關聯。為了要解決這些訓練中的偏差,然後在系統建立時間不變性(temporal invariance),我們在增強過程中使用context randomization (6)。也就是說,我們不跟確切的時間狀態做匹配,我們做的是實現一個稱之為hierarchical range query $\Upsilon(\cdot)$(階層式範圍查詢?)的過程,這過程允許我們去指定查詢範圍,然後回傳這個範圍的contextual features,$\Upsilon(l,\mu,rg) \subseteq \left\{v_i \right\}$,其中$rg$就是對時間$\mu$的指定範圍,這樣就會回傳$[\mu - rg, \mu + rg]$內的所有contextual features。 ::: ## 4 PLANNING WITH MULTI-DRIVER DISPATCHING :::info The production environment is intrinsically multi-agent with multiple drivers fulfilling passengers orders at the same time. A matching problem [19] is usually formulated at this stage to optimally assign the orders collected within a dispatching window to a set of drivers, while also avoiding assignment conflicts such as matching one order with multiple drivers. A utility score $p_{ij}$ is used to indicate the value of matching each driver $i$ and order $j$ pair, and the objective of the matching problem is to maximize the total utilities of the assignments $\arg\max_{x\in C} \sum^m_{i=1}\sum_{j=1}^n p_{ij}x_{ij}$ where $\left\{x_{ij} \right\}$ are binary decision variables subject to a set of constraints C to ensure the feasibility of the final assignment solution, e.g., each order is at most assigned to one driver, etc. This problem can be solved by standard matching algorithms, such as the Hungarian Method (a.k.a. KM algorithm). ::: :::success 生產環境本質上是多代理(multi-agent),有多個司機在同一時間滿足乘客訂單。在這個階段通常都制定一個匹配問題[19],最佳化地將分派窗格(dispatching window)內所收集到的訂單分配給一組司機,同時要避免分配衝突,像是把一張訂單分配給多個司機。[效用得分](http://terms.naer.edu.tw/detail/19046870/)(utility score)$p_{ij}$是用來表明每個司機$i$與訂單$j$匹配出來的價值,匹配問題的目標就是最大化分配的總效用,也就是$\arg\max_{x\in C} \sum^m_{i=1}\sum_{j=1}^n p_{ij}x_{ij}$,其中$\left\{x_{ij} \right\}$是一組constraints $C$所影響的[二元決策](http://terms.naer.edu.tw/detail/18597758/)變數,這用來確保最終分配解決方案的可行性,像是,每個訂單最多分配給一位司機之類的。這問題可以用標準的匹配演算法來處理,像是Hungarian Method(匈牙利演算法,又名KM algorithm)。 ::: :::info Similar to the work in [19], we use the Temporal Difference error between order’s destination state $s_j$ and driver’s current state $s_i$ as the utility score $p_{ij}$ . Given the policy value function $V(s)$ as described above, this could be computed as below, $$ \rho_{ij} = R_{ij}\dfrac{(\gamma^{k_{ij}} - 1)}{k_{ij}(\gamma - 1)} + \gamma^{k_{ij}}V(s_j) - V(s_i) + \Omega \cdot U_{ij} \tag{7} $$ where $R_{ij}$ is the trip fee collected after the driver $i$ deliver order $j$; $k_{ij}$ is the time duration of the trip and $\gamma$ is the discount factor to account for the future uncertainty. Aside from the long term driver income captured in the first part of (7), we also add an additional term $\Omega \cdot U_{ij}, \Omega \geq 0$ where $U_{ij}$ characterizes the user experience from both the driver $i$ and the passenger $j$ so that we optimize not only the driver income but also the experience for both sides. As an example, setting $U_{ij}$ to be the negative of driver-passenger distance will have the effect of minimizing the waiting time for the passenger. ::: :::success 類似於[19]的研究,我們使用訂單的目地狀態$s_j$與司機當前狀態$s_i$之間的Temporal Difference error(TD error)做為效用得分$p_{ij}$。給定如上面所說的的策略價值函數(policy value function)$V(s)$,可以像下面這樣計算: $$ \rho_{ij} = R_{ij}\dfrac{(\gamma^{k_{ij}} - 1)}{k_{ij}(\gamma - 1)} + \gamma^{k_{ij}}V(s_j) - V(s_i) + \Omega \cdot U_{ij} \tag{7} $$ 其中$R_{ij}$是司機$i$在完成訂單$j$之後所收的行程費用;$k_{ij}$的話則是行程的[持續時間](http://terms.naer.edu.tw/detail/17629314/),$\gamma$則是考慮未來不確定性的折扣因子(discount factor)。除了公式7第一部份裡面所補抓到的長期性司機收入之外,我們還增加一個額外項目,$\Omega \cdot U_{ij}, \Omega \geq 0$,其中$U_{ij}$描述的是司機$i$與乘客$j$的[使用者經驗](http://terms.naer.edu.tw/detail/19303140/),所以,我們所優化的不僅僅是司機收入,還有雙方的互動經驗。舉例來說,把$U_{ij}$設置成是負的(司機乘客之間的距離),那就會讓乘客的等待時間最小化。 ::: ### 4.1 Feature Marginalization via Distillation :::info $V(s_j)$ in (7) represents the state value at order’s destination. The real time dynamic features $v_d$ at the order’s destination, however, is not available until the driver actually finishes the trip. In other words, we need a separate V-function that can evaluate the state value under the absence of those real time dynamic features. Let us call this V-function $\tilde{V}$ . Given $V^\pi$, $\tilde{V}$ can be obtained through the marginalization over those features, $\tilde{V}^\pi= E_{v_d}\left\{V^\pi(l,\mu,v_s,v_d) \right\}$. Here, the contextual features $v$ are split into two groups, the static features $v_s$ and those dynamic features that require real-time computation $v_d$ . The expectation is taken under the historical distribution of $v_d$ , e.g., $p(v_d \vert l,\mu,v_s)$. ::: :::success 公式7裡面的$V(s_j)$表示訂單目地的的狀態價值(state value)。不過,這個訂單目的地的即時動態特徵$v_d$在司機確實的完成這個行程之前是無法使用的。換句話說,我們需要一個單獨的V-function可以在沒有這些即時性動態特徵的情況下去評估狀態價值(state value)。我們把這個V-function稱為$\tilde{V}$。給定$V^\pi$,然後我們就可以透過從這些特徵做邊緣化來得到$\tilde{V}$,也就是$\tilde{V}^\pi= E_{v_d}\left\{V^\pi(l,\mu,v_s,v_d) \right\}$。在這邊,contextual features $v$拆分成兩個群組(groups),分別是靜態特徵$v_s$與需要即時計算的動態特徵$v_d$。期望值的部份則是取自於$v_d$的歷史的分佈(historical distribution),也就是$p(v_d \vert l,\mu,v_s)$ ::: :::info We make use of knowledge distillation to approximate this expectation, treating $V$ as the teacher network and training $\tilde{V}$ to mimic the output of $V$ . Figure 1 illustrates the network structures of $\tilde{V}$ which is built on top of the structure of $V$. $V$ and $\tilde{V}$ share the same state representation layers to encourage common knowledge transfer but distinguish from each other by having their own MLP and final output layers. We use the full original training set for $V$ as the transfer set and evaluate $V$ on 𝑙, 𝜇,𝜐 sampled from the transfer set to obtain the targets for $\tilde{V}$ . We activate distillation during the training of $V$ before each model checkpoint. The weights of $V$ , including the shared weights, are frozen and only the MLP layers of $\tilde{V}$ are updated during distillation. We find that the distillation usually converges in less than 5 epochs and the distillation becomes much faster at later stage of the training as $V$ training also converges. So we anneal the number of distillation epochs in the beginning of the training. Afterwards we only run one epoch of updating $\tilde{V}$ for every model checkpoint. We find that this helps prevent overfitting while reducing the computation overhead to the training. ::: :::success 我們利用知識蒸餾(knowledge distillation)來近似這個期望值,我們把$V$視為teacher network,教導著$\tilde{V}$,讓它可以模仿$V$的輸出。Figure 1說明著$\tilde{V}$的網路架構,它是建立在$V$的架構之上。$V$跟$\tilde{V}$共享相同的狀態表示層(state representation layers),以此促進通用知識的轉移,不過在最終的輸出層(output layer)還是有分開就是了。我們使用$V$的完整原始訓練集當做轉移集(transfer set),然後在從這個transfer set採樣的$l,\mu,v$上評估$V$,以此獲得$\tilde{V}$的目標(target)。我們在$V$的訓練過程中,每個模型的檢查點之前會啟動[蒸餾](http://terms.naer.edu.tw/detail/19344401/)(distillation)。$V$的權重,包含共享的權重,都會被凍結住,只有$\tilde{V}$的MLP layers會在蒸餾被更新。我們發現到,這個蒸餾的過程通常會在不到5個epochs內收斂,後期會收斂的更快,因為$V$也隨著訓練過程收斂。也因此,我們會在訓練開始的時候對蒸餾的迭代次數進行退火(anneal,這邊感覺說明的就是次數減少)。之後,我們就只會為每個模型檢查點(model checkpoint)做一次的更新$\tilde{V}$的動作。我們發現到這有助於防止overfitting,同時也可以減少訓練的計算開銷。 ::: ## 5 MULTI-CITY TRANSFER :::info Order dispatching can be naturally formulated as a multi-task learning problem with each task targeting at one particular regional area (e.g., a city). Training a single agent for all tasks may not scale well and can even raise many practical concerns in both deployment and model serving. On the other hand, training each task independently is clearly suboptimal. To efficiently scale CVNet to multiple cities, in this work we employ a method called CFPT (correlated-feature progressive transfer) proposed by [17] for the single driver dispatching environment. In CFPT the state space (and its corresponding structure) is split into two groups based on their adaptivity across different cities and transfered using a parallel progressive structure [12]. The idea is to maximize the knowledge transferred from those adaptive inputs like contextual features and time, while letting target training focus on the nonadaptive part such as the absolute GPS locations that are specifically tied to the task/city. We include the specific network structure and implementation details in the supplement material and report in Section 6 experiment results comparing different transfer methods with CFPT. We find that the performance of CVNet can be further improved through the efficient use of knowledge transferred from another city. ::: :::success 訂單分派可以很自然地被表述為一個[多任務學習](http://terms.naer.edu.tw/detail/18978342/)問題(multi-task learning problem),每個任務都針對一個特定的區域(像是一個城市)。為所有的任務都各自訓練一個agent可能沒有辦法很好的擴展(sacle),而且在佈署與模型服務上可能也會有很多問題。另一方面,獨自的訓練每個任務可能也不是一個很好的想法。為了能夠有效的擴展CVNet到多個城市,在這研究中,我們採用一種稱為CFPT(correlated-feature progressive transfer)的方法。在CFPT中,狀態空間(state space)與其相關結構基於它們在不同城市的適應性拆分成兩組,然後利用平行[漸進](http://terms.naer.edu.tw/detail/19407309/)架構[12]來做轉移(transfer)。這個想法是最大限度的從那些應適性輸入(像是contextual features、time)做知識轉移,同時讓目標訓練關注在非適應性的部份,像是與任務/城市特別特別相關的絕對GPS位置。補充教材中會包含具體的網路架構與實做的細節,然後在Section 6中報告關於不同的轉移方法與CFPT之間比較的實驗結果。我們有發現到,透過有效地利用從另一個城市轉移過來的知識,CVNet的效能是可以再進一步的提升的。 ::: ## 6 EXPERIMENTS ### 6.1 Characteristics of CVNet :::info We design various experiments to illustrate the robustness and spatiotemporal effect of CVNet. In this set of experiments we train CVNet on City A with embedding dimension set to 5 and without the use of contextual features. The results are presented in Figure 4. ::: :::success 我們設計多個實驗來說明CVNet的魯棒性與時空效應(spatiotemporal effect)。在這組實驗中,我們在城市A訓練CVNet,然後embedding dimension(嵌入維度)設置為5,沒有使用contextual features。結果如Figure 4所示。 ::: :::info ![](https://hackmd.io/_uploads/SkEaOHpFq.png) Figure 4: (a). Temporal patterns in the learned value network and how it reacts against the time discount factor $\gamma$; (b). Comparison of value distributions at a given time between DQN [17] and CVNet; (c). The change of global Lipschitz during training under different regularization $\lambda$; (d). Comparison of robustness of CVNet w/o Lipschitz Regularization (net1 is trained with $\lambda=0.1$ and net2 is trained with $\lambda=0$). Figure 4:(a),學習價值網路(value network)中的[時間模式](http://terms.naer.edu.tw/detail/2795403/)以及它對time discount factor $\gamma$(時間折扣因子)的反應;(b),給定時間情況下,其值分佈(value distributions)在DQN[17]與CVNet之間的比較;(c),在不同正規化$\lambda$情況下,於訓練期間global Lipschitz的變化;(d),比較CVNet w/o Lipschitz Regularization的魯棒性(net1是以$\lambda=0.1$訓練,net2是以$\lambda=0$訓練) ::: :::info Figure 4a plots the change of the mean and standard deviation of $V$ value, evaluated on a fixed set of sampled locations from City A, against the time id of the day. Three curves are plotted, each with a different time discount factor $\gamma$ used during training. We notice that for all curves the value of CVNet decreases towards zero as the time approaches the end of the day. This accords with the definition (ref. Section 2) of CVNet. Also note a higher averaged $V$ value as $\gamma$ increases to one (no discount). In general, a small $\gamma$ induces a short-sighted strategy, e.g., the earnings over a one-hour period from now, while a large one encourages long-term behaviors. This also has an effect on the shape of the temporal patterns, as can be seen in the figure that for a small $\gamma = 0.8$ the value curve moves upwards temporarily during the morning rush hour period while the curves with large $\gamma$ approach zero in a more monotonic manner. ::: :::success Figure 4a畫出$V$值的均值與標準差的變化,這是根據一天的time id在城市a的一組固定的採樣位置上做的評估。有三條曲線,每一條曲線都是訓練期間使用不同的時間折扣因子$\gamma$繪製而成。我們注意到,所有的曲線來看,當時間接近一天的尾端的時候CVNet的值都會往零的方向一路減少。這符合CVNet的定義(參考Section 2)。另外也注意到,隨著$\gamma$增加到1(也就是沒有折扣),$V$值的平均值也變的更高。一般來說,較小的$\gamma$會導致一種短視近利的策略(像是從現在開始一小時內的收益),較大的話則是鼓勵長期行為。這也對時間模式的形狀有著一定的影響,從圖可以看的出來,對於一個小的$\gamma=0.8$,其價值曲線(value curve)在早[高峰時段](http://terms.naer.edu.tw/detail/19401412/)會暫時性的向上移動,而大的$\gamma$則是會以一種單調的方式接近0。 ::: :::info Figure 4c demonstrates the effectiveness of Lipschitz regularization with the parameter $\lambda$ introduced in Section 3.3. In Figure 4c we plot, for different values of $\lambda$, the change of the bound on global Lipschitz (4) as training progresses. As expected the Lipschitz value explodes when there is no regularization $\lambda=0$. To see how the use of Lipschitz regularization improves the robustness and the training stability of CVNet, we employ a technique called weight corruption which adds random noises to the weights of the hidden layers, analogous to, for example, the effect of a bad gradient descent step during training. In this case, we corrupt the first layer of CVNet – the embedding matrix $\theta^M$ . We compute the output distribution against a fix sampled set of locations in City A and compare the change of this distribution before and after corruption. As is shown in Figure 4d, the CVNet trained with Lipschitz regularization $\lambda=0.1$ is much more robust against such noises compared to the one trained without Lipschitz regularization (both are corrupted with the same noise). As we mentioned, any dramatic changes in the output distribution like the blue dashed curve shown in Figure 4d can have a detrimental effect on the training due to the recursive nature of the update rule, e.g., (2). ::: :::success Figure 4c說明的Section 3.3所說的,加入參數$\lambda$的Lipschitz regularization的有效性。Figure 4c中,我們畫了不同$\lambda$的值,隨著訓練的進行,global Lipschitz (4)的邊界(bound)的變化。就跟我們所猜想的那樣,當沒有正規化,也就是$\lambda=0$的時候,Lipschitz的值會爆炸。為了能夠喵喵看究竟Lipschitz regularization是如何的增加CVNet的魯棒性與訓練的穩定性,我們採用一種稱為weight corruption的技術,這會增加一些隨機噪點在隱藏層的權重上,舉例來說,這就類似於訓練過程中那些不好的梯度下降(gradient descent)步驟所造成的影響。這種情況下,我們corrupt(破壞?腐化?)掉CVNet的第一層,也就是mebedding matrix $\theta^M$。我們根據城市A的一組固定採樣位置來計算輸出分佈,然後比較在corrupt(破壞?腐化?)前後的變化。如Figure 4d所示,相較於訓練過程中沒有採用Lipschitz regularization的CVNet,使用Lipschitz regularization $\lambda=0.1$的CVNet在面對這類噪點上有著較佳的魯棒性(兩個模型都採用相同噪點做corrupt)。就像我們說的那樣,由於更新規則的遞迴性質,輸出分佈中的任何變化(像是Figure 4d中的藍色虛線)都可能對訓練產生不良的影響,像是公式2。 ::: :::info Finally, we compare the value distribution of CVNet with that of DQN [17] to show the "built-in" generalization of the cerebellar embedding. The two methods are trained on the same dataset and evaluated at a given time step using the same set of popular latitude/longitude pairs from City A. It can be seen from the results in Figure 4b that the value distribution of DQN not only exhibits a large variance, but also contains quite a few outliers that are clearly not correct (negative value). Most of those abnormal values are from unseen locations that require generalization. On the other hand, CVNet attains a distribution that is compact and robust against outliers. With the use of cerebellar embedding, CVNet can generalize well to most unseen data points, since they are mapped to tiles whose embeddings are collaboratively learned during training. ::: :::success 最後,我們比較CVNet與DQN[17]的值分佈(value distribution),來說明cerebellar embedding內置(bulit-in)的泛化性。兩個方法用相同的資料集訓練,然後在給定的時步上(time step)使用來自城市A的同一組常用的緯度/經度資料做評估。結果在Figure 4b可以看的出來,DQN的值分佈不僅有著較大的方差,還包含很多明顯不正確的[離群值](http://terms.naer.edu.tw/detail/2121271/)(負值)。這些[異常值](http://terms.naer.edu.tw/detail/2491741/)多數來自需要泛化的那些看不見的位置。另一方面,CVNet則是得到一個緊湊且面對離群值有著較佳魯棒性的分佈。使用cerebellar embedding,CVNet對多數沒看過的資料點有著較好的泛化性,因為它們會被映射到那些在訓練期間協作學習而得的tiles。 ::: ### 6.2 Order Dispatching Experiments :::info Experiments in this section validate the capabilities of CVNet to improve the total driver income in a dynamic and complicated multi-driver environment. ::: :::success 這個setion的實驗驗證了CVNet在一個動態性且複雜的多司機環境中(multi-driver)提高總司機收入的能力。 ::: ### 6.2.1 :::info Results on Simulations With Real Data. We first use simulations with real data collected from DiDi’s platform to validate CVNet, and more importantly as a level ground for comparing various order dispatching policies. ::: :::success 以實際資料模擬而得的結果。我們首先使用使用從滴滴平台中收集而得的實際資料來模擬,用它來驗證CVNet,更重要的是,做為比較各種訂單分派策略的平台。 ::: :::info We give a brief descriptions below of the policies we compare with in the experiments. Note that all policies compute the final dispatching decisions based on the assignment problem described in Section 4. The variations come from the way the utility scores or edge weights on the bipartite graph $p_{ij}$ are computed, as we described below. * **Baseline**. A myopic method that maximizes the current batch rate of dispatching, with $p_{ij}$ being the negative distance between driver-order pairs. * **Tabular Value function (TVal) [19]**. $p_{ij}$ is computed similarly as in (7) where the state evaluations only consider two variables, the time and location. The values are obtained by performing dynamic programming in a discrete tabular space. We ask authors of [19] to kindly provide the production-ready set of values in a lookup table format for us to compare in this experiment. * **DQN [17]**. A deep Q network is trained from a single driver’s perspective. Here we compute $p_{ij}$ as the maximum Q value over all feasible action set, e.g., $\max_{a \in \tilde{\mathcal{A}}(s)}Q^*(s,a)$ where $s$is approximated by the grid center and $\tilde{\mathcal{A}}(s)$ is computed from the set of historical trips originating from the vicinity of $s$. In this experiment we use the implementation and hyperparameters provided by the authors of [17] and train the deep Q networks using data listed in Table 1. * **CVNet Basic**. The method proposed in this work with a simpler cerebellar embedding layer (flat hexagon grids with no hierarchy) and no contextual features. The final values are also stored in a lookup table format for use by the simulator. The main difference from TVal is then the use of a neural network as the function approximation. This serves the purpose of ablative analysis against CVNet. * **CVNet**. The method proposed in this work with hexagon grid layers of three hierarchies. Four additional contextual features are used including one static feature, the dayofweek, and three dynamic ones including the trip query, order and empty driver count in the last one minute in the neighborhood of $s$. ::: :::success 下面我們簡要的說明實驗中我們比較的策略。注意到,所有的策略都是基於Section 4所述的分配問題來計算最終的分派決策。變化來自在[二分圖](http://terms.naer.edu.tw/detail/18884776/)$p_{ij}$上的效用得分或是邊緣權重的計算方式,如下所述。 * **Baseline**,一種最大化當前的批次分派率的短視近利的方法,就是把$p_{ij}$取做driver-order pair之間的距離取負。 * **Tabular Value function (TVal) [19]**,$p_{ij}$的計算類似於公式7,只是狀態的評估只考慮兩個變數,也就是時間與位置。價值(value)是在一個離散的表格空間中透過動態規劃計算而得。我們拜託了[19]的作者以[查表法](http://terms.naer.edu.tw/detail/19127045/)的格式給我們一份可用於生產的設定值(set of values),這樣我們才有辦法在這個實驗中做比較。 * **DQN [17]**,從單一個司機的觀點訓練而得的deep Q network。這邊我們將$p_{ij}$計算成是所有可能的action set上的最大Q value,像是$\max_{a \in \tilde{\mathcal{A}}(s)}Q^*(s,a)$,其中$s$是由網格中心近似,然後$\tilde{\mathcal{A}}(s)$則是根據一組源自$s$附近的歷史行程計算而得。在這個實驗中,我們使用[17]的作者所提供的實作與超參數,然後使用Table 1所列出的資料來訓練deep Q networks。 * **CVNet Basic**,這項研究中所提出的方法是一種較為簡單的cerebellar embedding layer(沒有層次結構的flat hexagon grids(扁平六角網格?)),而且沒有contextual features。最終的values還會以查表法的格式來保存,用以提供模擬器使用。與TVal最主要的差異在於,這是使用神經網路來做為函數的近似。這用於對CVNet進行做[ablative analysis](https://blog.csdn.net/weixin_45884316/article/details/107895219)。 * **CVNet**,這項研究中提出的方法具有三層六角網格層。使用四個額外的contextual features,包含一個靜態特徵(星期幾)、三個動態特徵(行程查詢、訂單、以及最後一分鐘在$s$附近的閒置司機)。 ::: :::info ![](https://hackmd.io/_uploads/HJyUr9RFc.png) Table 1: Stats of the training data consisting of one-month of driver trajectories and contextual features collected from three Chinese cities. Features are stored in a <key, value> format with key being the name, time and location. Table 1:訓練資料的[統計數字](https://dictionary.cambridge.org/zht/%E8%A9%9E%E5%85%B8/%E8%8B%B1%E8%AA%9E-%E6%BC%A2%E8%AA%9E-%E7%B9%81%E9%AB%94/stat?q=Stats),包括從中國三個城市中收集而得的一個月的司機軌跡與contextual features。特徵以`<key, value>`的格式保存,key的話是名稱,時間與位置。 ::: :::info Figure 5 plots the simulations results averaged over three days including both weekdays and weekends. The Total Driver Income (TDI) of the Baseline is considered as one unit and the numbers from other methods are standardized accordingly. We notice in the results a consistent advantage of CVNet over the other methods across cities and days. In comparison with the Baseline, we notice that CVNet maintains an improvement ranging from 1% to 15% with an averaged improvement (across days) from 3% to 8%. CVNet Basic also performs quite well. While its advantage over TVal and DQN is not as significant as CVNet, it is considerably more robust and consistent in all situations. Finally, note that DQN underperforms noticeably in City A. We suspect this is due to both the limitations of DQN’s structure, e.g., no special treatment to ensure robustness, and the restriction of the single driver assumption, which make DQN vulnerable to both the noises in training and the changing dynamics in multi-driver environment. ::: :::success Figure 5畫出三天(包含工作日與週末)的平均模擬結果。基線(Baseline)的總司機收入(TDI)被視為一個單位,其它方法的數值就做相對應的標準化。我們在結果中有注意到CVNet跟其它方法比起來都有著比較好的結果。與Baseline相比,我們看到CVNet維持在1%到15%的提升,平均(跨天)來看是3%到8%。就算是CVNet Basic的表現也是不錯的。儘管TVal與DQN的優勢不像CVNet那麼明顯,不過在所有情況下都還是較為魯棒性與一致。最後,可以看的到DQN在城市A的表現是明顯不好的。猜測這是因為DQN這架構的局限性,像是沒有特別的處理來確保魯棒性,還有單一司機假設的這個限制,這都造成DQN容易訓練中的噪點與多司機環境中多變的動態性的影響。 ::: :::warning 這邊說的應該是,把Baseline視為1,其它的就做相對應的縮放。 ::: :::info ![](https://hackmd.io/_uploads/Bk6jZgk99.png) Figure 5. CVNet achieves the highest Total Driver Income (TDI) improvements compared to all other methods. Figure 5. 對比其它方法,CVNet得到最高的總司機收入(TDI)的提升。 ::: ### 6.2.2 *Results on the Real World* :::info We conduct real-world experiments through DiDi’s platform and report online A/B testing results in Table 2. In the experiments CVNet Basic1 is compared with the online production dispatching policy on three cities across China2 . The experiments design and setup are similar to that used by [19]. Besides total driver income, we report two additional metrics including order answer rate and order finish rate. Results are presented in Table 2. We notice from the results a consistent 0.5% − 2% improvement over the production baseline in all metrics across all three experiment cities. An increase in order answer rate implies that over time the driver distribution has been optimized to align better with where orders might appear, given that an order request will not be answered if there is no empty driver found in its neighborhood (usually a radius of 2km), and that the number of driver is not affected by the dispatching policy. An increase in finish rate indicates that there are fewer trip cancellations after the orders are answered. Together they show that CVNet improves both the driver income and user experiences for the platform. ::: :::success 我們在滴滴平台上做真實的實驗,然後在Table 2中報告online A/B testing的結果。在這實驗中(中國的三個城市),我們把CVNet Basic與線上生產分派策略做了比較。實驗的設計與設置與[19]類似。除了總司機收入之外,我們還報告兩個額外的指標,包含訂單的回覆率(answer rate)與訂單的完成率(finish rate)。結果如Table 2所示。從結果可以看的出來,所有三個實驗的城市中的所有指標,生產基線(production baseline)通通提升0.5%到2%。訂單答覆率的提高意味著隨著時間的推移,司機的分佈已經得到優化,可以更好的與訂單可能出現的位置一致,因為如果附近(通常為半徑2公里)沒有閒置的司機的話,那就不可能去回應訂單需求,而且司機的數量並不會受到分派策略的影響。完成率的增加則是說明著訂單在得到答覆之後還會去取消的行程減少了。這在在指出CNVet提升平台上的司機收入與使用者經驗。 ::: :::info ![](https://hackmd.io/_uploads/rkcinly95.png) ::: ### 6.3 Results on Transfer Across Cities :::info In this set of experiments we study the effect of using transfer learning to scale CVNet across multiple cities. The experiments are conducted using the simulator. Four different transfer strategies are compared: CFPT [17], Finetuning [7], Progressive [12] and Baseline which simply trains a CVNet independently for each city without transfer. City A is used as the source city and the rest three are used as the target city. Figure 6 shows the scaled Total Driver Income (TDI) and pickup distance under different pickup distance penalties – $\Omega$ which is introduced in (7). It can be observed from the figure that by altering the value of $\Omega$ we obtain a trade-off between TDI and pickup distance. By using transfer learning methods, especially CFPT, with CVNet, the trade-off curve is shifted significantly upwards. As a result, it is possible to attain a greater improvement on TDI while maintaining a short pickup distance. ::: :::success 在這組實驗中,我們研究了使用transfer learning在多個城市中擴展CVNet的效果。實驗的部份是利用模擬器進行的。比較四種不同的遷移策略(transfer strategies):CFPT [17]、Finetuning [7]、Progressive [12]以及Baseline,Baseline的部份只是單純的在每個城市做CVNet的訓練。城市A用來做為source city,其它的則是target city。Figure 6說明著按比例縮放的總司機收與(TDI)與接人上車的距離,比例縮放的部份是公式7中所引入的接人上車距離的懲罰項,$\Omega$。從圖表可以看的出來,透過改變$\Omega$的值,我們可以在TDI與接人上車距離之間取得一個權衡。透過在CVNet使用transfer learning,特別是CFPT,權衡曲線明顯的向上偏移。結果來看,可以在維持一個較短的接車距離的同時取得TDI更大的改進。 ::: ## 7 CONCLUSIONS AND FUTURE WORK :::info This paper has proposed a deep reinforcement learning based solution for order dispatching. The method has been shown to achieve significant improvement on both total driver income and user experience related metrics in large scale online A/B tests through DiDi’s ride-dispatching platform. First of all, a novel SMDP formulation has been proposed for the order dispatching problem to account for the temporally extended dispatching actions. Secondly, a new network structure, Cerebellar Value Networks (CVNet), and a novel Lipschitz regularization scheme based on that structure have been proposed to ensure both the robustness and the stability of the value iteration during policy evaluation. Experiments using real data demonstrate that CVNet is robust against outliers and generalizes well to unseen data. Results on extensive simulations and online A/B testing have shown that CVNet outperforms all the other dispatching policies. Finally, we show that using transfer learning can further improve on the previous results and facilitate the scaling of CVNet across cities. ::: :::success 這篇論文提出一種基於深度強化學習的訂單分派解決方案。這方法也通過在滴滴打車的司機分派平台上做大規模的online A/B tests證明是可以有效改善總司機收入與使用者經驗的相關指標。首先,我們針對訂單分派問題提出一個創新的SMDP,以此說明時間跨度(temporally extended)的分派動作。接下來,一個新的網路架構,嘛都西Cerebellar Value Networks (CVNet),以及一個基於該架構的Lipschitz regularization,這可以確保策略評估期間關於value iteration(值迭代)的魯棒性與穩定性。使用實際資料的實現也說明著,CVNet對於離群值的魯棒性,而且可以很好的泛化至沒看過的資料上。大量模擬以及online A/B testing的結果也說明著,CVNet優於其它的分派策略。最後,我們說明了使用transfer learning可以進一步的提升先前的結果,也有利於CVNet在各城市間的擴展。 ::: :::info Our proposed approach consists of learning and planning two separate steps. Ideally we would like to combine them into one step that enables learning from end to end. We would also like to explore ways of extending CVNet to other transportation applications like fleet management which has the similar goal of bridging the gap between supply and demand. We leave these ideas as future directions of research. ::: :::success 我們提出的方法包括learning與planning兩個各別的步驟。想法上,我們希望可以把這兩個步驟結合成一個步驟,讓學習可以是端到端(end-to-end)的方式完成。我們還希望可以探索將CVNet擴展到其它的運輸應用上,像是[機隊管理](http://terms.naer.edu.tw/detail/202013/),有著類似的目標,就是把供需之間的差距減少。我們把這些想法留著做為未來的研究方向。 ::: ## A TRAINING CONFIGURATION :::info To train the CVNet used in the experiments, we employ 3 cerebellar quantization functions and use a memory size $A$ of 20000. The embedding dimension $m$ is chosen to be 50. Following the cerebellar embedding layer are fully connected layers having [32, 128, 32] hidden units with ReLU activations. We maintain a target network which is updated every 100K steps. We use a batch size of 32 and run training for 20 epochs, with each epoch being one pass through the whole dataset. We apply Adam optimizer with a constant step size $3e^{-4}$. The Lipschitz regularization parameter $\lambda$ is chosen to be $1e^{-4}$ since we find that a small $\lambda$ is already quite effective at bounding the Lipschitz, as demonstrated in Figure 4a. For context randomization we use a range $rg$ of 30 minutes. ::: :::success 為了訓練這個實驗中使用的CVNet,我們佈署了三個cerebellar quantization functions,並且使用記憶體大小20000的$A$。嵌入的維度$m$選擇為50。cerebellar embedding layer的後面就是接著全連接層(fully connected layers),hidden units為[32, 128, 32],使用ReLU做為啟動函數。我們維護一個targer network,每100K個steps就更新一次。batch size為32,執行20個epochs的訓練迭代次數,每個epoch就是所有的資料集下去訓練。我們的最佳化是使用Adam,step size為$3e^{-4}$。Lipschitz regularization parameter $\lambda$選擇為$1e^{-4}$,因為我們發現比較小的$\lambda$在Lipschitz的bounding上是十分有效率的,如Figure 4a所說明。上下文[隨機化](http://terms.naer.edu.tw/detail/2294702/)(context randomization)的部份,我們則是使用30分鐘的範圍$rg$。 ::: :::info Finally, we illustrate the training progress under different discount factors $\gamma$ in Figure 7. During training we record the average 𝑉 for each input batch and we plot its change against the training steps. The average value the function 𝑉 converges to depends on the $\gamma$ being used. The value is smaller with a smaller $\gamma$, in which case the convergence also happens faster. Note that a smaller $\gamma$ implies a more aggressive discounting of future values, hence a shorter lookahead horizon beyond which any rewards are close to zero once brought to the present. The training becomes easier in this case since it does not need to look far into the future. In general $\gamma$ represents a trade-off between foresight and variance. The $\gamma$ used in the experiments is chosen to be 0.92 which is determined by a randomized search based on out-of-sample simulations. ::: :::success 最後,我們在Figure 7中說明不同的discount factors $\gamma$情況下的訓練情形。訓練過程中,我們為每個輸入批次(input batch)記錄平均$V$,然後把它在訓練步驟中的變化繪製出來。函數$V$的平均值的收斂是取決於$\gamma$怎麼用。$\gamma$愈小,這個值就愈小,那收斂的也快。注意到,比較小的$\gamma$意味著對未來的價值(future values)有著較為積極的折扣,所以沒什麼前瞻性,比較久遠的獎勵來到這邊都是接近於零。這種情況下,訓練變的較為容易,因為它並不需要展望未來。一般來說,$\gamma$代表著[遠見](https://dictionary.cambridge.org/zht/%E8%A9%9E%E5%85%B8/%E8%8B%B1%E8%AA%9E-%E6%BC%A2%E8%AA%9E-%E7%B9%81%E9%AB%94/foresight)(先見之明)與方差之間的權衡。實驗中使用的$\gamma$選擇為0.92,這是透過基於樣本外的模擬(out-of-sample simulations)隨機搜尋所定下來的。 ::: :::info ![](https://hackmd.io/_uploads/SkPOFUyc5.png) Figure 7: The average $V$ as training progresses. Each curve is generated by using a different discount factor $\gamma$ during policy evaluation, e.g., $V$ converges in roughly 1M steps when $\gamma=0.8$. Figure 7:隨著訓練的進行,其$V$值的變化。每條曲線都是在策略評估期間透過使用不同的折扣因子$\gamma$所生成。當$\gamma=0.8$的時候,$V$大概是經過1M個steps之後收斂。 ::: ## B TRANSFER NETWORK CONFIGURATION :::info In this supplement section, we will present the details of the three network models we implemented for knowledge transfer across multiple cities: fintuning [7], progressive network [12], and correlated feature progressive transfer (CFPT). ::: :::success 這在補充的章節中,我們要來介紹一下三個網路模型的細節(實作跨多個城市的知識轉移):fintuning [7]、progressive network [12]、and correlated feature progressive transfer (CFPT)。 ::: :::info Details of these algorithms could be found in [17]. Here we provide the network adaption due to the change of training algorithms (from Q network to state-value network). ::: :::success 三個演算法的細節可以在[17]中找的到。這邊我們由於訓練演算法的變化(從Q network到state-value network)而導致的網路[適應性](http://terms.naer.edu.tw/detail/708819/)。 ::: :::info **Finetuning**: After training the network of the source city (using the original model in Figure 8), we initialize the “green blocks” of the target city network with the trained weights as in Figure 9 and continue to train on the new dataset. ::: :::success **Finetuning**:訓練完source city network之後(使用Figure 8中所述的原始模型),我們使用訓練後的權重來初始化target city network的"綠色區塊"(green block),如Figure 9所示,然後繼續在新的資料集上訓練。 ::: :::info ![](https://hackmd.io/_uploads/BkXFdLkcq.png) ::: :::info ![](https://hackmd.io/_uploads/Sym9OUk95.png) Figure 9: Structures of finetuning and progressive network. Green blocks are weights initialized with trained network from the source city. Frozen layers would keep the transferred weights during the target training. Figure 9:微調架構與漸進式的網路。綠色區域是用source city訓練後的網路權重來初始化的。凍結層會在target訓練期間持續的做權重的轉移。 ::: :::info **Progressive Network**: Instead of directly initializing the target network, lateral connection is used to leverage the trained weights as in 9. The parallel network (green blocks) remains the trained weights from the source city. The connection is defined as: where $h^{(t)}_i$and $h^{(s)}_i$ denote the outputs of layer $i$ in the target network and the source network, correspondingly. $W^{(t)}_i$ is the weight matrix of layer $i$ of the target network, and $U^{(c)}_i$ is the lateral connection weight matrix from the “green blocks” of the source tasks. $f(\cdot)$ is the activation function. ::: :::success **Progressive Network**:不直接初始化target network,我們採用的是像Figure 9所示那般的橫向連接的方式來利用訓練好的權重。parallel network(綠色區塊)的部份仍然是來自source city的訓練權重。而連接的定義為: $$ h^{(k)}_i = f (W_i^{(k)}h^{(k)}_{i-1} + \sum_{j<k}U_i^{(j:k)}h^{(j)}_{i-1}) \tag{8} $$ 其中$h^{(t)}_i$與$h^{(s)}_i$表示target network與source network裡面的layer $i$的輸出。$W^{(t)}_i$則是target network裡面layer $i$的權重矩陣,$U^{(c)}_i$是來自source tasks的"綠色區塊"的橫向連接權重矩陣。$f(\cdot)$就是啟動函數(activation function)。 ::: :::info **CFPT**: Different from the first two methods, CFPT (Figure 10) already separates data “tunnels” during the training on the source city. Green block only process correlated features between cities, which are suitable for transfer. After training on the source city, we copy the green blocks to the target network ( using the same structure). The lateral connection remains (8). Notice this structure is unique because for the first two transfer methods, the source city uses the original CVNet structure in Figure 8. ::: :::success 跟前兩個方法不同,CFPT(Figure 10)在source city的訓練期間就已經分離資料的"tunnels"(通用?隧道?)。綠色區塊的部份只處理城市之間那些適合遷移(transfer)的相關特徵。在source city訓練之後,我們複雜綠色區塊到target network(使用相同的結構)。橫向連接的部份仍然是公式8。注意到,這個架構是唯一的,因為前兩個遷移的方法來看,source city使用的是Figure 8中那個原始的CVNet架構。 ::: :::info ![](https://hackmd.io/_uploads/rJ54G1xcq.png) Figure 10: For CFPT network, we separate the input space and green blocks are the data tunnel that would process those features suitable for transfer. Figure 10:CFPT network,我們分離了輸入空間,綠色區塊處理的是那些適合遷移的特徵的data tunnel(資料通道?)。 ::: ## C ORDER DISPATCHING SIMULATOR :::info We evaluate the proposed algorithms using a complicated and realistic order dispatching simulator. The simulator is able to provide an intuitive assessment for different order dispatching policies. Based on the historical real-data of particular dates, the simulator is first initialized by the the drivers’ status as well as orders information of this date at the beginning of the simulation. Afterwards, the drivers’ status are totally determined by the simulator, either fulfilling orders assigned by a certain order dispatching policy, or random walking followed by offline(online) operation according to certain models. Specifically, with the help of a particular order dispatching policy, the simulator periodically performs the order-driver matching using KM algorithm, where the drivers and orders form the bipartite graph. The meaning of the edge weights of the bipartite graph varies due to the difference of order dispatching policy (e.g., as to the distance-based policy, the weights between drivers and orders indicate the distance between orders’ start location and drivers’ location). Every time after order dispatching, drivers who assigned orders (busy drivers) would go the appointed locations, pick up passengers and carry passengres to the destination. Drivers who miss order assignment (idle drivers) would updates its destination according to a driver movement model. Moreover, before next order dispatching round, idle drivers could be offline and new drivers perhaps appear online in the platform. Therefore, an online/offline operation is performed according to the online/offline model. Both driver movement and the online/offline models are generated on the basis of real-world historical data. The workflow of the simulator is depicted in Fig 11. ::: :::success 我們用一個複雜且合理的訂單分派模擬器來評估所提出的演算法。模擬器可以為不同的訂單分派策略提供一個直觀的[評估](http://terms.naer.edu.tw/detail/659106/)。根據特定日期的歷史實際資料,模擬器在模擬開始的時候會先利用司機的狀態與該特定日期的訂單信息來初始化。接下來,司機的狀態會完全由模擬器來決定,看它是要透過某一種訂單分派策略來滿足訂單分配,還是根據某一種模型隨機走動,然後做offline(online)的操作。具體來說,在特定的訂單分派策略協助下,模擬器會使用KM演算法定期地執行訂單與司機的匹配,這裡面的司機與訂單會形成bipartite graph([二分圖](http://terms.naer.edu.tw/detail/19321021/))。bipartite graph的邊緣權重(edge weights)會因為不同的訂單分派策略而有不同的意義(舉例來說,如果是distance-based policy,那司機與訂單之間的權重就會是訂單的起始位置與司機的位置之間的距離)。在每次訂單分派之後,被分配到訂單的司機(忙碌的司機)就會到指定的位置,接客上車,然後把客人帶到目的地。沒有分配到訂單的司機(閒置的司機)就會根據司機動態模型(driver movement model)來更新它的目的地。此外,在下一輪訂單分派之前,閒置的司機可能會offline(下線),新的司機也可能會出現在平台上。因此,這會根據online/offline model來做online/offline的操作。司機的動態與online/offline模型都是基於實際的歷史資料所生成的。模擬器的工作流程如Figure 11所示。 ::: :::info ![](https://hackmd.io/_uploads/BklHg-ecc.png) Figure 11: Composition and workflow of the order dispatching simulator. Figure 11:訂單分派模擬器的組成與工作流程。 :::