A Reinforcement Learning Environment For Job-Shop Scheduling

# A Reinforcement Learning Environment For Job-Shop Scheduling ###### tags: `論文翻譯` `deeplearning` `rl` `Reinforcement Learning` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/2104.03760.pdf) ::: ## Abstract :::info Scheduling is a fundamental task occurring in various automated systems applications, e.g., optimal schedules for machines on a job shop allow for a reduction of production costs and waste. Nevertheless, finding such schedules is often intractable and cannot be achieved by Combinatorial Optimization Problem (COP) methods within a given time limit. Recent advances of Deep Reinforcement Learning (DRL) in learning complex behavior enable new COP application possibilities. This paper presents an efficient DRL environment for Job-Shop Scheduling – an important problem in the field. Furthermore, we design a meaningful and compact state representation as well as a novel, simple dense reward function, closely related to the sparse make-span minimization criteria used by COP methods. We demonstrate that our approach significantly outperforms existing DRL methods on classic benchmark instances, coming close to state-of-the-art COP approaches. ::: :::success 排程是各種自動化系統應用程式中的一個基本任務，舉例來說，生產機器的最佳化排程可以有效降低生產益本與浪費。不過呢，要找出這樣的排程是非常棘手的，而且通常無法在有限時間內通過組合最佳化問題(Combinatorial Optimization Problem (COP))的方法來找出的。深度強化學習(DRL)在學習複雜行為方面的最新進展為COP的應用提供了一個新的可能性。這篇論文為Job-Shop Scheduling(生產排程)提供一個有效的深度強化學習環境，這是該領域的一個問題議題。此外，我們設計了一個有意義且緊湊的狀態表示(state representation)，跟一個創新、簡單的dense reward function，這個COP使用的稀疏排程總製造時間最小化準則是習習相關的。我們證明了我們的方法在經典案例上優於現在的DRL方法，而且也接近目前最好的COP方法。 ::: ## Introduction :::info Deep Reinforcement Learning (DRL) has gained an increasing interest in the AI community, thanks to many recent successes like Atari (Mnih et al. 2013), AlphaGo (Silver et al. 2018), or AlphaStar (Vinyals et al. 2019). Most of the approaches in this field focus on well-established benchmarks from the OpenAI gym library or video games (Atari, Starcraft, Dota, etc.). Given these successful applications, researchers soon recognized the potential benefits of DRL in practical domains, such as Combinatorial Optimization Problems (COPs) (Bengio, Lodi, and Prouvost 2021). ::: :::success 受惠於上面提到的好多前輩們的成功，深度強化學習(DRL)在人工智慧社群裡面得到愈來愈多的關注。這個領域多數方法長期以來都是來自OpenAI gym library或是video games。鑒於這些成功的應用，研究人員很快的意會到DRL在實際領域上的潛在優勢，像是Combinatorial Optimization Problems (COPs)(Bengio, Lodi, and Prouvost 2021) ::: :::info In this paper, we present a DRL-based approach to JobShop Scheduling (JSS), an important problem that was among the first COPs ever studied (J.F. Muth 1966). Its application domains are wide, from the ordering of computation tasks to the scheduling of manufacturing systems. In its classic variant, each instance of JSS comprises two sets of constants representing jobs $\mathcal{J}$ and machines machines $\mathcal{M}$. Each job $J_i \in \mathcal{J}$ must go through each machine in M$\mathcal{M}$ in a specific order denoted as $O_{i1} \to \cdots \to O_{in_i}$, where each element $O_{ij}(1 \leq j \leq n_i)$ is called an operation of $J_i$ with a processing time $p_{ij} \in \mathbb{N}$. The binary relation → is called the precedence constraint. There is a total of |J |×|M| operations to perform, and their executions are non-preemptive. Also, no machine can perform more than one operation simultaneously. Fig. 1 represents a solution for a small instance composed of three jobs and three machines. ::: :::success 在這篇論文中，我們提出一種基於DRL的Job-Shop Scheduling(JSS)，這是一個很重要的問題，也是人類史上最早研究的COPs之一(J.F. Muth 1966)。它的應用領域很廣，從計算任務的排序到製造系統的排程調度。在其經典變化中，每個JSS實例(instance)都會包含兩組常數來表示jobs $\mathcal{J}$與machines $\mathcal{M}$。每個$J_i \in \mathcal{J}$的job都必需要經過$\mathcal{M}$裡面的每一個機台，按序表示為$O_{i1} \to \cdots \to O_{in_i}$，這裡面的每一個元素$O_{ij}(1 \leq j \leq n_i)$都稱做$J_i$的一個操作(operation)，這個operation需要一個處理時間$p_{ij} \in \mathbb{N}$。二元關係(binary relation)$\to$稱之為優先順序條件約束(precedence constraint)。所以總共會有$\vert \mathcal{J} \vert \times \vert \mathcal{M} \vert$個操作要執行，它們的執行模式是不可搶奪型的(non-preemptive)。然後，也沒有機台是可以同一時間執行多個操作的。Fig 1說明了三個工作與三台機器組成的小範例。 ::: :::warning * operation：猜測這邊指的是一個製程、工藝，或是有些人說Routing * non-preemptive：不可搶奪型，又稱為合作式排程。可參考[資訊人筆記](https://www.kshuang.xyz/doku.php/operating_system:course_concept:cpu_scheduling)。 ::: :::info ![](https://hackmd.io/_uploads/BkN7TYG49.png) Figure 1: Example solution for a small instance composed of three jobs and three machines. Each machine sees each job once, and the time spent on each operation varies from one job to another. Figure 1：三個工作與三台機器組成的小範例。每個機台都會喵過每個工作乙次，每個操作所花費的時間會因工作而異。 ::: :::info The application of Reinforcement Learning (RL) to JSS provides several advantages. First, it is more flexible than traditional priority dispatching rule heuristics, whose performance can vary significantly from instance to instance. In addition, the development of such heuristics is quite tedious since it requires a lot of specialized knowledge about a scheduling instance to be effective. Unlike classic COP methods, such as linear programming (LP) or constraint programming (CP), RL environments can model stochastic decisions, e.g., random factory outage, non-deterministic job re-entry, random processing time, etc., which simulate the conditions faced by real scheduling systems. Second, in contrast to traditional scheduling methods that focus on the given set of jobs only, RL provides a possibility to incrementally schedule the incoming jobs as they appear in the queue by considering the impact of a schedule for known jobs on the new ones. Finally, the most promising possibility offered by RL is the concept of lifelong learning (Chen and Liu 2018), where an agent will not only learn to optimize one specific JSS instance but reuse what it has learned from previous instances. This property is essential for industrial problems where instances usually share a lot of similarities. ::: :::success 把Reinforcement Learning (RL)用在JSS給出了幾個優點。首先，比起傳統那種優先序調度規劃啟發法(priority dispatching rule heuristics)更為靈活，而且這種啟發法的效能會因實例而有所差異。此外，這種啟發法的開發非常無聊，因為它需要大量關於排程實例的專業知識才能做啊。不同於經典的COP方法(像是線性規劃(LP)或是約束規劃(CP))，強化學習的環境可以弄成一個隨機決策，像是隨機的讓工廠停擺、非確定性的作業重工、隨機的處理時間等等，可以模擬出實際排程系統所會面臨到的問題。第二，對比於單純關注在給定的資料集的傳統排程方法，強化學習它提供了一種可能性，也就是透過考慮已知的工作(job)排程對新的工作(job)的影響，在新進工作出現在隊列的時候，對其做增量排程。最後，強化學習是最有可能給出終身學習(lifelong learning)(Chen and Liu 2018)這個概念的，agent不只是學會最佳化一個特定的JSS實例，還可以重用從先從的實例中所學到的東西。這樣的特性對於工業問題是不可或缺的，因為工業問題通常有很多相似之處。 ::: :::info In this paper we make the following contributions: * We propose to model JSS as a single-agent RL problem, where the agent is a dispatcher that needs to choose which job to work on at each step. In particular, we use the actor-critic Proximal Policy Optimization (PPO) algorithm (Schulman et al. 2017) to learn a policy, i.e., a function that maps a state to an action probability distribution, and approximate the state-value function, i.e., the excepted cumulative reward given a state. * We design an environment with a meaningful and compact state representation, as well as a novel and simple dense reward function, closely related to the sparse make-span minimization objective of COP methods. * Experiments on instances that are hard for COP methods under time constraints indicate that the application of RL algorithms in our environment results in solutions close to those of the best scheduling techniques on the market and far better than results reported in the RL literature before. ::: :::success 這篇論文中，我們做出下面貢獻： * 我們把JSS問題弄成single-agent的強化學習問題，agent在裡面扮演調度員的角色，在每個step選擇要處理的工作。特別是，我們使用actor-critic Proximal Policy Optimization(PPO)(Schulman et al. 2017)來學習policy，也就是把一個state映射到action的機率分佈，然後去近似state-value function，也就是給定state預期得到的cumulative reward(累積報酬)。 * 我們設計出一個有意義而且緊湊的狀態表示的環境，跟一個創新、簡單的dense reward function，這個COP使用的稀疏排程總製造時間最小化準則是習習相關的。 * 在時間限制下對於COP來說不好處理的實驗，如果在我們的環境中的強化學習演算法是可以得到接近市場最好的排程技術的解決方案，而且遠遠超過之前強化學習文獻中的結果。 ::: ## Background :::info In this section, we provide some mathematical background for Markov Decision Processes (MDPs) and RL, give details of the PPO algorithm, and discuss other RL scheduling approaches found in the literature. ::: :::success 在這一節中，我們會給出一些Markov Decision Processes(MDPs)與強化學習的數學背景，然後詳細說明PPO，再來討論一些文獻中發現的其它強化學習排程方法。 ::: :::warning 這部份關於強化學習的知識並沒有翻譯，可參考在下另外翻譯的[Reinforcement Learning: An Introduction](https://hackmd.io/@shaoeChen/Syp8clcd_) ::: ### Related Work :::info Recently, DRL has gained attraction as an end-to-end approach to solve COPs. From the Traveling Salesman Problem over Graph Optimization to the Satisfiability problem, the number of DRL applications is increasing sharply. However, applications of DRL to scheduling problems are more recent and limited. ::: :::success 最近，DRL作為一種解决COPs的端到端方法受到了廣泛的關注。從Graph Optimization的Traveling Salesman Problem(旅行家問題)到[滿足性問題](http://terms.naer.edu.tw/detail/1285827/)，這種DRL應用程式的數量正在急劇上升。然而，應用在排程問題上的DRL就相對較少。 ::: :::info One classic approach to model scheduling problems is to consider each machine as an agent. Waschneck et al. (2018) model JSS by a multi-agent system, where agents standing for machines are trained one after the other using Deep QNetwork (DQN) learning. An evaluation of their approach on a non-public instance set for a semiconductor factory shows that it is resilient to disruptions. It cannot beat heuristics but reaches expert-level performance after two days of training. Another multi-agent approach was suggested in (Liu, Chang, and Tseng 2020), where the authors applied Deep Deterministic Policy Gradient to train the agents. On small to midsize instances—maximum of 20 jobs and 15 machines with known optimal solutions—their approach performs better than dispatching rules, even though First-In-First-Out (FIFO) comes very close. Similar to Waschneck et al. (2018), Liu, Chang, and Tseng (2020) focus on flexibility and propose a scenario with random machine breakdowns. ::: :::success 對這種排程問題建模的一種經典方法就是把每一台機台都視為一個agent。Waschneck et al. (2018)利用multi-agent system建構一個JSS，其中代表機台的agents用Deep QNetwork (DQN) learning一個接著一個做訓練。在一家[半導體](http://terms.naer.edu.tw/detail/1059067/)工廠的非公開實例上對他們的方法進行的評估來看，該方法對disruptions是有彈性的。這方法並不能擊敗啟發式學習方法，不過在兩天的訓練之後還是可以來到專家級的性能。Liu, Chang, and Tseng 2020提出另一種multi-agent的方法，作者用了Deep Deterministic Policy Gradient來訓練agents。這是在中小型的實例上(最多20個jobs與15台機器)有著最佳解決方案，他們的方法分派規劃還要來的好，儘管FIFO非常的接近。類似於 Waschneck et al. (2018), Liu, Chang, and Tseng (2020)專注於靈活性，並提出一種隨機機台故障的情況。 ::: :::warning disruptions：有中斷也有干擾的意思，感覺這邊比較是干擾的意思。 ::: :::info While the multi-agent approach seems appealing, it can be hard to learn a policy where agents collaborate and share a global vision on the schedule to find good solutions. The disjunctive graph representation of JSS is a modeling approach allowing for such a global vision. Zhang et al. (2020) model the problem by a single agent and exploit the disjunctive graph of a JSS instance to design an embedding of states using a Graph Neural Network. Moreover, the authors use a Multi-Layer Perceptron (MLP) to compute a scalar score for each action based on the graph embedding and apply a softmax function to output an action probability distribution. The advantage of this approach is that the policy network is independent of the size of an instance. It can be trained on small instances and then generalize to bigger ones. Similarly, Han and Yang (2020) use the disjunctive graph to encode states as an image of size |J | × |M| representing the total number of operations in the instance. Channels of the image represent three features: processing time, the schedule at the current time-step, and machine utilization. Such an image is passed to a convolutional neural network that acts as a state-action value function approximator. The action space is defined as a set of 18 different dispatching rules that the agent can select. The reward function, like ours, reflects the impact of the job allocation on machine utilization. However, the disjunctive graph formulation makes computing a reward function, for which Han and Yang (2020) compare the average machine utilization before and after taking action, harder than our representation. Finally, Lin et al. (2019) use a multi-class DQN method to solve JSS instances. Their agent learns to assign one of 7 dispatching rules for each machine. Their evaluation results on small to mid-size instances—maximum of 20 jobs and 20 machines with known optimal solutions—show that, although the suggested approach found better solutions than the individual dispatching rules, they are still far from being optimal. ::: :::success 雖然這種multi-agent看似具有吸引力，不過它很難學習學習到一種可以讓agents協作並且在排程上共享全域視野來找到好的解決方案的策略。JSS的disjunctive graph representation(分離圖表示)是一種可以以全域視野來建模的方法。Zhang et al. (2020)就用單一個agent來對這個問題做模型，然後利用JSS instance的disjunctive graph(分離圖)設計一個使用GNN的狀態嵌入。此外，作者使用Multi-Layer Perceptron (MLP)(多層感知器)為每個基於圖嵌入(graph embedding)的action計算其純量分數(scalar score)，然後用softmax function來輸出action的機率分佈。這方法的優點在於，policy network跟實例的大小是無關的。它可以在一個很小的實例上訓練，然後再泛化到更大的實例上。類似地，Han and Yang (2020)用disjunctive graph(分離圖)把states編碼為大小為$\vert \mathcal{J} \vert \times \vert \mathcal{M} \vert$的照片，以表示實例中operations的總數。它的三通道則表示三個特徵：處理時間、當前時步的排程、以及機台的稼動率。這樣的影像丟到卷積神經網路，這個網路就做為state-action value function的近似器(approximator)。action space就定義成一組agent可以從中選擇的18種不同的分派規則。reward function就跟我們的一樣，反映job的分配對機器稼動率的影響。不過，disjunctive graph(分離圖)公式讓計算成為一種報酬函數(reward function)，Han and Yang (2020)比較了採取action前後的平均機台稼動率，這比我們的表示更為困難。最後，Lin et al. (2019)用了multi-class DQN來解JSS instances。它們的agent學習為每台機器分配七個分派規則的其中一個。他們的方法在中小型實例上的評估結果(最多20個jobs與20台機器且具有已知的最佳解)就說明了，儘管建議的方法找到比個別分派規則還要好的解，但它們離最佳解還很遠很遠很遠。 ::: ## Job-Shop Scheduling Environment :::info In this section, we first introduce the basics of our JSS environment and explain its internal working. Then we detail our reward function and define the state space. Finally, we present some implementation details of our environment. ::: :::success 在這一節中，我們首先介紹一下我們的JSS，然後解釋一下它的內部作業模式。然後我們會細部說明我們的reward function並定義state space。最後，我們會介紹一些環境實作上的細節。 ::: ### Basic Environment :::info We consider JSS as a single-agent problem, where a dispatcher chooses a job to work on at each time-step. We define the action space as the discrete set of jobs plus an extra action, called No-Op, to go to the next time-step without scheduling any operation: A = {J0, ..., J|J |−1, No-Op}. ::: :::success 我們把JSS想成一個single-agent的問題，然後對有一個調度員(dispatcher)在每個time step選擇一個工作。我們把action space定義成離散的工作集合再加上一個稱為No-Op的額外的action，這樣你就可以在不需要排定任何工作的情況下進到下一個time step：$A=\left\{J_0, \cdots, J_{\vert \mathcal{J} \vert - 1}, \text{No-Op} \right\}$。 ::: :::info In some cases, we cannot allocate a job at each step since some jobs may already be allocated to a machine, the machine can be in use, or we have completed all the job’s operations. To comply with these constraints, our environment also uses a Boolean vector indicating which actions are legal. Each time the agent allocates a job, it adds the assigned job to the ordered stack of the next time-step we will visit. When we cannot allocate any further job or decide for No-Op at a certain time-step, we iteratively go to the next time-step in order to allocate jobs again (i.e., a machine becomes free). ::: :::success 在某些情況下，我們沒有辦法在每個step都分配工作給機器，因為有些工作可能已經分配給某些機器了，機器也可能正在使用中，或者我們已經完成所有的工作製程(operations)。為了能夠遵守這些約束式，我們的環境使用一個布林向量(boolean vector)來指出那些actions是可用(legal)的。每次agent在分配工作的時候，它會把分配到的工作加到下一個time step我們會去看的有序堆疊中(orderd stack)。當我們在某個time step無法進一步分配工作或是決定No-Op的時候，我們就會迭代進入下一個time step，然後重新分配一次工作(也就是把機台上的排程清空)。 ::: :::info In Fig. 2, we illustrate the internal status of the environment. The vertical red line represents the current time-step. We can allocate job $J_2$ (in yellow) to machine $M_0$ (the machine we will allocate the job to is not given explicitly by the environment). Or, as we have only one legal action, we can go to the next time-step $T_8$ for machine $M_1$ to become free and allocate job $J_0$ to it. ::: :::success 在Fig. 2中，我們說明環境的內部狀態。垂直的紅線代表當前的time step。我們可以把工作$J_2$(黃色)分配給機器$M_0$(工作要分配給那一個機器並非由環境明確指定)。或者，由於我們只有一個能用的action，我們可以進入下一個time step $T_8$，這樣機器$M_1$就變成是閒置狀態，就可以把工作$J_0$分配給它。 ::: :::warning * 由圖來看，action space是jobs，所以上圖來看，可用的action就剩一個job 2 * 上面提到進到time step $T_8$讓$M_1$變閒置狀態就可以把$J_0$分配給它，不是很明白這個意思，目前可用的action看起來應該是只有$J_2$ ::: :::info ![](https://hackmd.io/_uploads/rkdWp5M4q.png) Figure 2: Representation of the environment’s internal state comprising the information about the current time-step, the allocation of jobs to machines, the jobs that can be allocated now and the future time-steps. Figure 2：環境的內部狀態表示，包含當前的時步(current time step)，機台上分配到的工作，目前可分配的工作以及未來的time step等資訊。 ::: ### Search-space Reduction :::info To achieve good performance in a short time, we need to guide our agent to avoid exploring sub-optimal solutions. ::: :::success 為了在短時間內得到好的效能，我們要指導agent避免探索次優的解決方案(sub-optimal solutions)。 ::: :::info Considering two jobs, $J_1$ and $J_2$, to allocate to the same machine, if $J_1$ is at its final operation (i.e., the job has already completed $\vert \mathcal{M} \vert - 1$ operations and needs only one more operation to be finished) and $J_2$ is not, it is always better to allocate $J_2$, as it will still need another machine afterwards. We refer to this optimization as non-final prioritization. ::: :::success 我們現在想喔，兩個工作，$J_1$與$J_2$，分配給相同的機台，如果$J_1$是最後一個製程(也就是已經完成$\vert \mathcal{M} \vert - 1$個製程，只需要再一個製程就可以完成)，而$J_2$還有很多製程，這時候分配$J_2$總是比較好的，因為它之後仍然還需要另一台機器。我們將這樣的優化稱之為non-final prioritization。 ::: :::info Compared to other actions, the No-Op action deserves particular attention. This action is complex to learn, as it often leads to worse results than greedy policies (disregarding No-Op and allocating jobs to machines whenever possible) if not used with care. If we do not guide the agent to learn how to use No-Op, it will stop using it and become a greedy agent because the weight of the No-Op’s logit will tend to 0. ::: :::success 與其它的actions相比，No-Op值得特別的關注。這個action學起來很複雜的，一個不小心，它造成的結果就會比greedy policies還要來的糟，所以忽略掉No-Op，然後盡可能的把工作分配給機器。如果沒有教好agent怎麼去使用No-Op這個操作的話，agent就會停止使用No-Op，然後變成一個greedy agent，因為No-Op的logit(羅吉特機率)的權重會趨近於0。 ::: :::info The first rule we apply marks jobs that could be allocated to a machine as illegal when No-Op is used instead, as long as we do not allocate a new job to the machine (i.e., a job that was not legal before the No-Op). We do this because it is always better (or equivalent) to allocate a job $J$ directly at time-step $T$ rather than waiting for $X > 0$ time-units and then allocating J at time-step $T + X$. This restricts the use of the No-Op action, as we insist on the availability of some new job to allocate in the future, or the considered machine will be blocked indefinitely otherwise. ::: :::success 我們教agent的第一個規則就是，只要我們沒有分配新的工作給機器，當使用No-Op的時候，我們就可以把分配給機器的工作標記為不合非的(illegal)(也就是在No-Op之前不合法的工作)。我們這麼做是因為，在time-step $T$直接分配工作$J$總是比較好的(或等效)，而不是等到time-unit $X > 0$然後才在time-step $T+X$分配$J$。這就限制了No-Op這個action的使用，因為我們堅持在未來分配一些新的工作的可用性，不然所考慮的機器將會被無限期的阻塞。 ::: :::info To restrict No-Op further, for each machine, we compute the minimum duration D of each legal job we can schedule. Next, we check whether we will have a new job to allocate to the machine in less than D time-units. This rule is based on the observation that, if we make a pause longer than D, it would have been better to allocate a job of duration D before. Additionally, the new job we need to allocate next should not be at its final operation because waiting for a job rejected by non-final prioritization (see above) is pointless. ::: :::success 為了更進一步的限制No-Op，對於每一個機器，我們計算我們可以排程的每個合法(legal)工作的最短持續時間$D$。接著，我們檢查在少於$D$個time-units(時間單位)內是否有新的工作可以分配給機器。這條規則是基於一個觀察，如果我們暫停的時間比$D$還要來的久，那麼在這之前分配一個持續時間為$D$的工作會是比較好的。此外，下一個要分配的新工作不應該是它的最終製程，因為這已經被我們上面說的non-final prioritization卡住了。 ::: :::info To prevent almost certainly bad uses of No-Op and avoid time-consuming No-Op checks, we disallow the No-Op action if there are (1) four or more machines with some allocatable job, or (2) five or more allocatable jobs (more than 15% of the jobs/machines available in total). If the agent still takes the No-Op action, we iterate over the next time-steps until some new job becomes allocatable. In Fig. 3, we see that the agent learns the potentially deleterious effect of the No-Op action on the solution quality quickly and uses it sparingly towards the end of the training episodes. ::: :::success 為了防止那種你幾乎可以確定是錯誤使用No-Op，也避免花太多時間在No-Op的檢查，我們直接在下面情況下擋掉No-Op的使用：(1)四個或四個以上的機器具有某些可分配的工作，或(2)五個或五個以上的可分配工作(超過總體可用工作/機器的15%)。如果agent採取No-Op，我們會迭代到下一個time-steps，一直到有一個新的工作狀態變成是可分配的。在Fig. 3中，我們看到agent快速的學習到No-Op的潛在有害影響，訓練episodes的過程中一路謹慎使用。 ::: :::info ![](https://hackmd.io/_uploads/S1fHioUE5.png) Figure 3: The agent learns to use the No-Op action sparingly without avoiding it completely. The number of No-Op actions selected was measured using a running average over the last 2000 episodes for each of Taillard’s instances. This plot represents the mean and 95% confidence interval. Figure 3：agent學會謹慎使用No-Op，而不是完全避免使用它。No-Op選擇到的次數的計算是取Tailard的每個實例(instnace)於過去2000個episodes的執行平均。圖表說明的是平均值與95%的[信賴區間](http://terms.naer.edu.tw/detail/3647392/)。 ::: ### Problem Symmetries :::info JSS comprises a lot of symmetries that can increase the search space unnecessarily. Our environment considers two of them: (1) interchangeability of allocations at a time-step, and (2) the execution of jobs and the No-Op action on a machine. In the first case, e.g., an agent might allocate a job J1 to machine M1, and then job J2 to machine M2 at the same time-step. This allocation order is symmetric to allocating J2 to M2 first and then J1 to M1. Such symmetry is less pronounced in our environment because the applicability of the No-Op action depends on the number of jobs and machines available, so that the order of allocation can matter. Note that we cannot break this symmetry by ordering the machines and first allocate some job to the machine with the smallest identifier. Such a model leads to a multi-agent system with policy sharing, thus losing the global view and specializing the No-Op action to each individual machine. ::: :::success JSS包含很多對稱性(symmetries)，而這些對稱性會增加一些不必要的搜尋空間。我們的環境考慮了其中兩點：(1)在一個time-stpe上分配的[可互換性](http://terms.naer.edu.tw/detail/3649892/)，(2)工作的執行以及機台上No-Op的操作。在第一種情況中，舉例來說，agent也許會分配工作$J_1$給機器$M_1$，然後在同一個time-step把工作$J_2$分配機器$M_2$。這種分配順序是對稱的，你也可以先把$J_2$分配給$M_2$，再把$J_1$分配給$M_1$。這類的對稱性在我們的環境中不是那麼明顯，因為No-Op的適用性是取決於可用的工作數與機器數，因此分配的順序是重要的。注意到，我們並不能打碰這種對稱性，你不能說我們就來排序機器，然後把一些工作就分配給有著最小[標誌符](http://terms.naer.edu.tw/detail/2117533/)的機器。這樣的模型會變成一個具有策略共享(policy sharing)的[多代理系統](http://terms.naer.edu.tw/detail/16527512/)(multi-agent system)，從而失去全域觀點(global view)，也會讓No-Op這個action專門用於每個各別機器上。 ::: :::info The No-Op action itself introduces some symmetry. If a job is allocated to a machine that needs to wait afterwards, we can either complete the job and wait, or first wait and then allocate the job. As explained in Section , we break such symmetry by marking allocatable jobs as (temporarily) illegal when No-Op is used, which forces either immediate job execution or future allocation of some new job instead. ::: :::success No-Op這個action本身引入一些對稱性。如果一個工作被分配到一台接著等待的機器上，我們這時候要嘛就是完成這個工作然後等待，不然就是先等待然後再分配工作。正面前面所述，當使用No-Op的時候，我們透過將可分配的工作標記為(暫時)不合法的方式來打破這種對稱性，這會強制立即執行工作或是強制在未來的時候多分配一些新的工作。 ::: :::info In addition, there may be intra-machine symmetries, when the execution of two non-final operations of different jobs can be swapped due to waiting times before allocating their respective successor operations. Such dynamic symmetries cannot easily be broken at the time-step of job allocation. Non-final prioritization (see Section ) breaks these symmetries partially, whenever it is equivalent to allocate a final operation first and a non-final job afterwards ::: :::success 此外，當兩個不同工作的非最終製程的執行可以因為兩個機器各自後續的工作之前的等待時間而可以交換的時候，那就可能存在機器內的對稱性(intra-machine symmetries)。這類的動態對稱性在工作分配的time-step上是不容易被打破的。Non-final prioritization某種程度上的打破一些些這種對稱性，只要它相當於先分配最終製程(final operation)再接著做非最終製程的工作(non-final job)。 ::: :::warning 上面說的應該就是當兩個機器的各自接到下一個工作之前的閒置時間夠長的話，那兩個機器的工作其實是可以互換的。 ::: ### Reward Function :::info The traditional metric to evaluate a solution is the maximum make-span across all the jobs, i.e., the time needed to have finished every job. However, this metric might lead to a sparse reward function, which provides feedback only at the end of the episode. As a result, the agent might struggle to determine the impact of each action it has taken on the global outcome. ::: :::success 估測解決方案的一個傳統指標就是所有工作的排程總製造時間(makespan)，也就是剛成所有工作所需要的時間。不過厚，這個指標可能會導致稀疏的報酬函數，它只會在episode結束的時候給出一個反饋。因此，agent可能比較不好確定其所採取的每個action對於全域的結果所帶來的影響。 ::: :::info To improve learning, we have designed a dense reward function based on the scheduled area. After each action, we compute the difference between the duration of the allocated operations and the introduced holes—the idle time of a machine: $$ R(s, a) = p_{aj} - \sum_{m \in \mathcal{M}} \text{empty}_m(s, s') $$ where $s$ and $s'$ the current and next state respectively; $a$ the $j^{th}$ operation of $J_a$ with a processing time $p_{aj}$ scheduled (i.e., the action); $s'$ the next state resulting from applying action $a$ to state $s$; $\text{empty}_m(s, s')$ a function returning the amount of time a machine m is IDLE while transitioning from state $s$ to $s'$ . Note, a re-scaled reward—R(s, a) divided by the maximum operation length, turned out to be more suitable for training of neural networks in our experiments. ::: :::success 為了改善學習，我們設計出一個based on scheduled area(暫定區域？)的密集報酬函數(dense reward function)。在選擇每個action之後，我們計算分配製程的持續時間與機台閒置時間之間的差異： $$ R(s, a) = p_{aj} - \sum_{m \in \mathcal{M}} \text{empty}_m(s, s') $$ 其中$s$與$s'$各別為當前的state與下一個state；$a$的話就是$J_a$的第$j^{th}$個操作(operation)，其處理時間為$p_{aj}$(也就是action)；$s'$這個next state則是來自於在state $s$選擇action $a$所導致的結果；$\text{empty}_m(s, s')$則是一個函數，回傳的是從state $s$轉到$s'$的時候所造成的閒置時間(IDLE)。注意喔，在我們的實驗中，我們會重新縮放報酬(re-scaled reward)-將$R(s,a)$除上最大的操作長度(operation length)，這讓我們的情況能夠更適用於神經網路的訓練。 ::: :::info For example, in Fig. 2, suppose we allocate job $J_2$ on machine $M_0$. In that case, we get as a positive reward—the length of the operation we have scheduled. Since there is no more available action, the environment will automatically jump to the next time-step. Nevertheless, this action creates a hole on machine M2. As we currently are at time-step T7, and we jump to time-step T8, the negative reward will be 1. We then sum the positive and negative rewards up to get the total reward. Therefore, the reward function results in the minimization of the schedule area on a Gantt chart, i.e., minimize the holes for each machine and maximize machine usage. This is different from minimizing the make-span, but both criteria are closely related—a good solution has as few holes as possible. This relation can be observed in Fig. 4 where we compare the evolution during the training of the mean make-span of a solution found and its cumulative reward. The figure represents an inverse variation relationship: the make-span decreases whenever the reward increases. This experiment provides sufficient evidence for the relation of these two criteria. ::: :::success 舉例來說，在Fig. 2中，假設，我們分配工作$J_2$在機台$M_0$上。這種情況下，我們會得到一個正報酬(positive reward)，也就是我們已經排定的操作(operation)長度。因為已經沒有更多的action可以用，所以環境就會自動地跳到下一個時步(time-step)。然而，這個action會在機台$M_2$上造成一個空洞(閒置時間)。因為我們現在的時步(time-step)是$T_7$，然後跳到$T_8$，因此，其負報酬(negative reward)就會是1。然後我們再把正報酬與負報酬做個加總，得到最終的總報酬。這樣厚，rewrd function就會讓甘特圖上的scheduled area(暫定區域？)最小化，也就是最小化每個機台的空洞(就是閒置)並最大化機台的使用率。這跟最小化makespan是不一樣的，不過兩個標準之間是密切相關，一個好的解決方案會盡可能有較少的空洞(閒置時間)。這種相關性可以在Fig. 4中看的出來，我們比較了在訓練過程中的平均的makespan與其累積報酬的演變。這圖呈現出來的是一種[逆變分](http://terms.naer.edu.tw/detail/2118393/)關係：只要reward增加，它的makespan就會減少。這個實驗為這兩個標準之間的關係給出充份的證據。 ::: :::info ![](https://hackmd.io/_uploads/BJhFhRH8c.png) Figure 4: There is a clear correlation between the real, sparse objective function (make-span) and our dense reward function. Both metrics were collected on Taillard’s instances using a running average over the last 2000 episodes. The lines represent the means over all instances, and surroundings give the 95% confidence interval. Figure 4：在實際的稀疏目標函數(makespan)與我們的dense reward function之間有著明顯的相關性。兩個指標都是在[Taillard's instances](https://paperswithcode.com/dataset/taillard-instances)上用過去2000個episodes的平均值收集來的。線條所表示的就是所有實例上的平均值，周圍給出95%的置信區間。 ::: :::info ![](https://hackmd.io/_uploads/rkdWp5M4q.png) Figure 2: Representation of the environment’s internal state comprising the information about the current time-step, the allocation of jobs to machines, the jobs that can be allocated now and the future time-steps. Figure 2：環境的內部狀態表示，包含當前的時步(current time step)，機台上分配到的工作，目前可分配的工作以及未來的time step等資訊。 ::: :::info Nevertheless, this reward function is not perfect because there is still a delay between an action and its impact. For example, if we can allocate multiple jobs at a certain time-step, then only the last action of this time-step will create holes. As a result, only this action will get negative feedback. We mitigate this issue by using the algorithm that does multiples steps to compute each action’s return. Hence, small delays in the reward will not affect it much as the number of actions at a time-step is relatively small. Another problem is that some choices done in the past can have a huge impact on the future outlook, e.g., force us to create a lot of holes. Unfortunately, this issue is inherent to the complexity of this problem. ::: :::success 然而，這個reward function並不完美，因為在action與其影響之間仍然存在著延遲的問題。舉例來說，如果我們可以在某個time step分配多個工作(jobs)，那也只有這個time step的最後一個action會產生空洞(閒置時間)。所造成的也就只有這最後一個action會得到負的反饋(feedback)。我們用一個multiples steps的演算法來計算每個action的return，減經這個問題。因此，reward中的小延遲就不會影響太多，因為一個time step的action的數量是相對較小的。另一個問題就是，過去所做的一些選擇會對未來的展望有著比較大的影響，舉例來說，像是迫使我們製造出較多的空洞(閒置時間)。不幸的是，這個問題的複雜性是一定會有的。 ::: ### State Representation :::info Our state representation is a $(\mathcal{J}, 7)$ matrix, containing, for every row (i.e. every job) 7 attributes: $(a_1)$ A Boolean to represent if the job can be allocated. $(a_2)$ The left-over time for the currently performed operation on the job. This value is scaled by the longest operation in the schedule to be in the range $[0, 1]$. $(a_3)$ The percentage of operations finished for a job. $(a_4)$ The left-over time until total completion of the job, scaled by the longest job total completion time to be in the range $[0, 1]$. $(a_5)$ The required time until the machine needed to perform the next job’s operation is free, scaled by the longest duration of an operation to be in the range $[0, 1]$. $(a_6)$ The IDLE time since last job’s performed operation, scaled by the sum of durations of all operations to be in the range $[0, 1)$. $(a_7)$ The cumulative job’s IDLE time in the schedule, scaled as a6 to be in the range $[0, 1)$. This information is sufficient to reconstruct the current state of the schedule, and so, it respects the Markov condition. However, the reconstructed solution may not be unique but will be equivalent, i.e., leads to the same solution. We have considered other attributes, such as the number of other jobs that need the same next machine and more global information about the process like the number of next time-steps, the number of legal actions, etc. However, the number of attributes needs to stay short to avoid too much state computation time, i.e., the environment became slow, and avoid too much neural network computations, i.e., the agent became slow. After experimentation, we have select these seven attributes out of other possible attributes. ::: :::success 我們的state representation是一個$(\mathcal{J}, 7)$的矩陣，每個row(也就是每個job)包含七個屬性：$(a_1)$表示是否可以分配該job的布林值。$(a_2)$該job當前執行的操作(operation)的[剩餘時間](http://terms.naer.edu.tw/detail/18082298/)(left-over time)。這個值會被排程中最長的那個操作(operation)縮放至$[0, 1]$之間。$(a_3)$job的完成百分比。$(a_4)$一直到job全部完工的剩餘時間，這個值會被排程中最長的那個操作(operation)縮放至$[0, 1]$之間。$(a_5)$執行下一個job所需要的校車時間，這個值會被持續最長的操作(operation)縮放至$[0, 1]$之間。$(a_6)$從上一個job執行操作(operation)以來的閒置時間，這個值會被所有操作(operation)持續的總和時間縮放至$[0, 1)$之間。$(a_7)$排程中所累積的工作的閒置時間，會像$(a_6)$縮放到$[0, 1)$之間。這些資訊足夠讓我們重構排程的當前狀態(current state)，因此，它是符合馬可夫條件的(Markov condition)。然而，重構之後的解可能不是唯一的，不過依然是等價的，不過這會導致相同的解。我們有考慮其它的屬性，像是在下一個機台會用到一機台的其它工作(job)數量，還有關於該過程更為全域的資訊，像是下一個time step的數量，可用的actions數量，等等。不過厚，屬性的數量還是短一點好，這可以避免花太多時間來計算這些狀態(state)資訊，像是，環境就會變的慢，也可以避免過多的神經網路的計算，也就是agent會變慢。在實驗之後，我們就從其它可能的屬性中精挑細選這七個屬或。 ::: :::info One attractive property of this representation is how it easily allows constructing dispatching rules. For example, the Most Work Remaining (MWKR) is equivalent to take the job with the largest value of the a4 attribute, i.e., the job that has the most left-over time until completion. Similarly, the First In First Out (FIFO) amounts to take the biggest value of the a6 attribute, i.e., the job which was idle for the most time since its last operation. ::: :::success 這種表示的一個吸引人的特性在於，它可以很輕易的建構出一個分派規則(dispatching rules)。舉例來說，Most Work Remaining(MWKR，最大未完成任務)就相當於$a_4$這屬性最大值的那個job，也就是完成前最大剩餘時間的那個工作(job)。同樣的，First In First Out (FIFO，先進先出)就相當於去取得$a_6$屬性的最大值，也就是從上次的操作(operation)之後閒置最久的那個job。 ::: ### Problem Specification :::info An episode is composed of a lot of steps, i.e., $\mathcal{O}(\vert \mathcal{J} \vert \times \vert \mathcal{M} \vert)$ and because of the No-Op operation, this is a lower bound. As the choice at the first steps can significantly impact the solution’s outlook, the agent needs to use the experience it has learned during previous iterations to improve the next one. More formally, the current state contains a lot of information from the previous step, and one cannot recover from a bad state. ::: :::success 一個episode是由很多個步驟所組成，即$\mathcal{O}(\vert \mathcal{J} \vert \times \vert \mathcal{M} \vert)$，而且因為No-Op這個operation，所以這是一個下限(lower bound)。由於第一個step的選擇很明顯的影響解決方案的展望，所以agent會需要用前一個迭代中學習到的經驗來改進下一個迭代。更正確的說，當前狀態(current state)包含上一個步驟的大量資訊，而且你沒有辦法從一個不好的狀態中復原。 ::: :::info We have applied some optimizations to make the environment fast enough. On one hand, we tried to keep the state representation as small as possible. On the other, various code improvements were made to speed up the step computation. Thus, we keep the state representation in memory and update only the attributes that need to be updated at each step rather than recomputing them. ::: :::success 我們已經做了一些優化讓環境夠快了。一方面我們試著讓狀態的表示(state representation)盡可能的少。另一方面，我們也在程式碼上做足功課，加速每一個步驟的計算。因此，我們把狀態表示保留在記憶體中，只更新每個步驟中需要更新的屬性，而不是整個重新計算。 ::: ## Method :::info In this section we provide details of our approach and describe its implementation. ::: :::success 在本節中，我們會詳細介紹我們提出的方法並說明它的實作。 ::: ### Action Selection :::info Our environment gives us a tabular state represented by a matrix. To compute the action distribution and the state-value estimation, we flatten this matrix into a vector and pass it to the agent MLPs. ::: :::success 我們的環境給了我們一個由矩陣表示的表格型狀態(tabular state)。為了計算action distribution與state-value的估測值，我們把這個矩陣平展(flatten)，轉為一個向量(vector)，然後傳給做為agent的MLPs。 ::: :::info As explained previously, our environment has an additional specificity compared to a pure MDP model because we also get a mask of legal actions our agent can perform. Different techniques can be applied to overcome this. One is to give a negative reward if the agent takes one illegal action and hopes it will learn to recognize them. This approach yields poor performance because it makes the problem even harder, as the agent has to learn at the same time to differentiate between a legal/illegal action and a good/bad action. ::: :::success 如前所述，相比於pure MDP model，我們的環境有著額外的獨特性，因為我們還會得到一個agent可以執行的legal actions的mask。這可以用不同的技術來克服。一個方法就是如果agent採用一個illegal action，並且希望agent可以學習辨識它，那就給一個負報酬(negative reward)。不過這種方法的效能很糟，因為它讓問題更複雜，agent要同時學會區分legal/illegal action與good/bad action。 ::: :::info Our approach is to apply a mask on the output of a neural network to transform values of illegal action into a small negative number, i.e., the smallest representable number. It makes the illegal action probability very close to 0 when we apply the softmax function. This technique has already been studied in previous work and seems to yield the best results(Huang and Ontanon 2020). ::: :::success 我們的方法是在神經網路的輸出上用這個mask，以此將illegal action的值轉成一個很小很小的負值，也就是最小的可表示的數值。這種情況下我們使用softmax function就可以讓這個illegal action的機率非常接近0。這種技術在先前的研究中已經有人做過了，而且還得到最好的結果(Huang and Ontan on 2020) ::: ### Training Process :::info Usually, the goal of reinforcement learning is to train an agent that can perform a specific task. For our setting, the problem is different, we still want to learn to schedule the production, but we are more interested in the agent’s best solution. To comply with this goal, we keep track of each solution’s make-span found by the agent at the end of the episode. Often in reinforcement learning, we train for a fixed number of steps or episodes, but, in our case, we have a time constraint. We have limited the training time to 10 minutes to have a realistic approach close to industrial constraints. ::: :::success 通常來說，強化學習的目標是訓練agent來執行特定的任務。對我們的設置而言，問題是不一樣的，我們仍然想學習排定生產，不過我們更感興趣的是agent的最佳解。為了實現這個目標，我們持續的跟蹤agent在每個episode結束的時候所發現的每個解的makespan。通常在強化學習中，我們會訓練固定數量的步驟(steps)或是episodes，不過在我們的情況中，我們做的是時間限制。我們限制訓練時間為10分鐘，這比較貼近現實。 ::: ### Implementation :::info We have implemented our approach using RLLib (Liang et al. 2018) and Tensorflow (Abadi et al. 2015). The environment is implemented with OpenAI Gym toolkit. ::: :::success 我們用RLLib (Liang et al. 2018)、Tensorflow(Abadi et al. 2015)來實現我們的方法。環境的話則是用OpenAI Gym toolkit來實現。 ::: :::info We have used WandB (Biewald 2020) Bayesian optimization to perform the hyper-parameter search and log all results provided in this paper.2 Unfortunately, due to computing resources restriction, we have used only one instance (ta41) to perform our hyper-parameter search and used the best configuration on this unique instance for the whole group. We have not performed extra hyper-parameter searches for Demirkol’s instances and use the same ones we have used for Taillard’s. ::: :::success 我們用WandB (Biewald 2020) Bayesian optimization來尋參，論文中的結果在『[連結](https://wandb.ai/ingambe/PPOJss/sweeps)』。不幸的是，因為計算資源的限制，我們只用一個實例(**ta41**)來做我們的超參數搜尋，然後用最佳配置。我們並沒有對Demirkol’s instances做額外的超參數搜尋，而是用Taillard相同的方式。 ::: ### Model and Configuration :::info The action selection network $MLP_{\theta_\pi}$ and state-value prediction network $MLP_{\theta_v}$ do not share any layers. Both networks have 2 hidden layers with hidden dimension 319 and ReLU as activation function. For PPO, we set the epochs of updating network to 12, the clipping parameter $\epsilon_{PPO}$ to 0.541, and the coefficient for policy loss and value function to 0.496 and 0.7918. We used a linear scheduler for the learning rate and the entropy coefficient who decay linearly each parameter from $6.831 \times 10^{-4}$ to $7.783 \times 10^{-5}$ and from $2.042 \times 10^{-3}$ to $2.458 \times 10^{-4}$ respectively. We set the discount factor $\gamma$ to 1, and use the Adam optimizer. We do rollouts of size 704, i.e., more than one episode per iteration, and train with mini-batches of the size 33 000. ::: :::success 選擇action的網路$MLP_{\theta_\pi}$與預測state-value的網路$MLP_{\theta_v}$並沒有分享任何的layers。兩個網路都有兩個hidden layers，維度為319，用ReLU做為activation function。Proximal Policy Optimization (PPO)的話，更新網路的epochs設為12，然後做參數的剪裁(clipping)$\epsilon_{PPO}$為0.541，然後policy loss與value function的參數則是0.496與0.7918。我們對learning rate與entropy coefficient用linear scheduler，兩個參數會線性的衰減，各自從$6.831 \times 10^{-4}$到$7.783 \times 10^{-5}$以及$2.042 \times 10^{-3}$到$2.458 \times 10^{-4}$。我們把discount factor $\gamma$設置為1，然後使用的是Adam來最佳化。取樣(rollouts)大小為704，也就是每個迭代會包含一個以上的episode的資料，然後訓練的mini-batches為33000。 ::: ## Experiments :::info This section presents our experimental results against public benchmarks. ::: :::success 這一章就來聊聊我們的實驗對比一些公開比較基準的結果。 ::: ### Benchmark Instances :::info We picked one classical set of benchmark instances provided by Taillard (1993). These well-studies instances are of different sizes with the upper and lower bound solution provided for each instance.3 We have selected the class of 10 instances with 30 jobs and 20 machines because they are considered as hard (i.e., harder than bigger instances). The optimal solutions are not yet known. Still, the literature provides us a lower and upper bound for each instance. ::: :::success 我們選擇Taillard (1993)提供的一組經典比較基準用的實例。這些充份研究的實例的大小不同，也為實例給出不同的upper bound與lower bound的[解決方案](http://jobshop.jjvh.nl/index.php)。我們選擇的是10實例的類別，有30個jobs與20台機器，因為它們被認為是困難的。目前還不知道它的最佳解是什麼。儘管如此，文獻還是給了我們每個實例的上、下限。 ::: :::info To assess that our environment’s performance does not overfit Taillard’s instance, we also select another set of instances of Demirkol, Mehta, and Uzsoy (1998). We picked the 5 instances with, as Taillard’s instances, 30 jobs and 20 machines where the goal is also to minimize the make-span. Our approach is not constraint by the size of the problem and can be applied to smaller or bigger instances. ::: :::success 為了評估我們的環境效能是不是會過度擬合於Taillard's instance，我們還選擇另一組實例(Demirkol, Mehta, and Uzsoy (1998))。我們選擇5個instances做為Tailard's instances，30個jobs與20台機器，目標也是最小化排程總製造時間。我們的方法並不會受到問題大小的限制，可以用在更小或是更大的實例。 ::: ### Baseline Selection :::info To ensure our agent is learning complex dispatching strategy, we compare it against two well-known dispatching rules from the literature. FIFO selects the job that has waited the most, and MWKR the job with the most left-over processing time until completeness. These dispatching rules usually perform quite well as they act greedy (i.e., do not intentionally create holes) and tend to balance job allocation. Dispatching rules are deterministic and run only once. Therefore, they give a solution in less than a second. ::: :::success 為了確保我們的agent正在學習複雜的分派策略(dispatching strategy)，我們將之與文獻中兩個著名的分派規則做了比較。FIFO，選擇等待最久的job，MWKR，選擇剩餘處理時間最長的job。這些分派策略通常做的還不錯，因為它們的選擇是greedy(也就是不故意製造空洞(閒置))，而且傾向於平衡整個分作分配。分派規則是確定性的，而且只會執行一次。因此，他們可以在不到一秒鐘的時間內給出解決方案。 ::: :::info We have selected the two papers from the literature that evaluate their approaches on the same instances as us. In (Zhang et al. 2020), all the instances we have selected are benchmarked too, so we have added their results to our benchmark. Han and Yang (2020) have only selected 3 out of 15 of the instances we have selected. These results are included and only allow for a partial comparison. ::: :::success 我們從文獻中選擇兩篇論文，我們會用跟我們相同的情況來評估兩篇論文的方法。在(Zhang et al. 2020)，我們把所有的實例都拿來做基準測試，因此我們將其結果加到我們的基準測試中。Han and Yang (2020)的話則是從15個實例中取出3個。這些結果包含在內，只允許做部份的比較。 ::: :::info To have a good idea about our approach performance, we have include OR-Tools (Perron and Furnon 2019) the best tool on the market used to solve JSS problem (Da Col and Teppan 2019). We also give the upper solution bounds found in the literature to indicate the solution’s qualities. ::: :::success 為了更好的瞭解我們的方法效能，我們把市場上用來解JSS問題的最好的工具都派出來了(OR-Tools)。我們還給出文獻中發現的上限解決方案，以此瞭解這些解決方案的品質。 ::: ### Results :::info We conducted a number of experiments on a server with 2 Intel Xeon 6138 CPU and a single Nvidia Titan Xp GPU.4 Obtained results are presented in Table 1 which shows all found solutions by each approach for every instance of our two datasets. As one can see, our approach performs better than dispatching rules as it found better solutions on each instance. On average, our method finds solutions 11% better make-span than the best MWKR dispatching rule on Taillard’s instances and 12% better on Demirkol’s one. ::: :::success 我們在有2顆Inter Xeon 6138 CPU與單塊Nvidia Titan Xp GPU上做大量的實驗。得到的結果如Table 1所示。這表格說明了每一種方法在兩個資料集的每個實例上找到的所有解決方案。可以看的出來，我們的方法比分派規劃表現的還要好，因為它在每個實上都找到更好的解決方案。平均來看，我們的方法在Tailard's instances上比最佳的MWKR dispatching rule就makespan來看還要好11%，如果是在Demirkol的話是好12%。 ::: :::info Table 1: Make-span of solutions found by different approaches on Taillard’s and Demirkol’s instances. While we do not achieve the same performance as the best tool on the market (OR-Tools), our approach performs better than dispatching rules and previous RL literature. ![](https://hackmd.io/_uploads/S1M0Cr9U9.png) ::: :::info These results are interesting as we have used the same hyper-parameters we have used for Taillard’s instances, and we have not used Demirkol’s instances during the hyperparameters search. However, these two instances share some similarities related to the problem size (i.e., they have the same number of operations to plan). It indicates that our approach generalizes well, as it exhibits similar performances on instances from different datasets. ::: :::success 這些結果很有趣，因為我們使用了與Taillard's instances相同的超參數，而且在超參數尋找過程中沒有使用Demirkol's instaces。然而，兩個instances在問題大小上又有一些相似之處(也就是它們需要計畫的operation的數量相同)。這說明了我們的方法有著很好的泛化性，因為它在不同的資料集的實例上可以有著相似的效能。 ::: :::info Even static dispatching rules perform quite well in our environment. This can be explained because they act greedy (i.e., they do not take the No-Op action). The dispatching is done with regard to the currently available jobs at a given time-step, not all the jobs, as it is generally the case with dispatching rules when using the disjunctive graph formulation. They also take advantage of the non-final prioritization rules we have implemented. Outside of our environment, the best dispatching rules MWKR gives an average solution make-span of 3193 (Zhang et al. 2020) on Taillard’s instances compared to the 2449 we have found. ::: :::success 即使是靜態的分派規則在我們的環境中也可以表現的很好。這可以解釋成是因為它們的執行是greedy所造成的(也就是它們不會採取No-Op這個action)。分派完成指的是在給定的時步當下可用的jobs的完成，而不是所有jobs，因為在使用disjunctive graph formulation的時候，分派規則通常就是這種情況。它們還用了我們實現的non-final prioritization rules。在我們的環境之外，最佳的分派規則MWKR在Taillard's instances上給出平均排程總製造時間為3193(Zhang et al. 2020)，但我們找出來的解是2449。 ::: :::warning disjunctive graph formulation：disjunctive有[析取](http://terms.naer.edu.tw/detail/2114933/)的意思，不過查詢disjunctive graph有人譯為分離圖。 ::: :::info Although we have not reached the performance of cutting edge OR-Tools CP solver who found on average solutions 7% better on Taillard’s instances and 6% better on Demirkol’s one; this work focus only on designing an efficient environment and uses a standard DRL algorithm who can be further enhanced to tackle scheduling problems. ::: :::success 雖然我們還沒有來到最好棒棒的OR-Tools CP solver的效能(它們所找出的解在Taillard好7%，在Demirkol是好6%)，不過我們這個研究是關注在設計一個高效的環境，用的是標準的DRL演算法，這演算法是可以更進一步的加強解決排程問題。 ::: :::info We tried to run our algorithm for more than 10 minutes, but PPO was stuck in a local minimum (i.e., the average solution make-span and the best one converge). Even entropy regularization was not able to prevent this phenomenon. ::: :::success 我們試著執行我們的演算法超過十分鐘，不過PPO卻是陷入局部最小值。即使用了entropy regularization還是無法阻止這種現象發生。 ::: ### Comparison with Literature Results :::info It is often difficult to compare different RL approaches as the goal/settings/algorithm used are different. We will put these results in perspective in this sub-section. ::: :::success 因為使用的目標/設置/演算法的不同，所以通常很難去比較不同的RL方法。我們就在這一小節裡面來看看這些結果。 ::: :::info In (Zhang et al. 2020), authors have not trained their agents on the benchmark instances but on randomly generated ones, since their goal was to obtain an agent that generalizes beyond the presented instances. This is a different goal than ours, which is more traditional for combinatorial problems as it is instance-specific and does not reuse the knowledge gained from other instances. However, we obtain better results with static dispatching rules on our environment than their RL agents applied without training. This observation highlights the quality of our environment. In fact, our agents found 18% and 20% better solutions than of Zhang et al. on Taillard’s or Demirkol’s instances, respectively. ::: :::success 在(Zhang et al. 2020)論文中，作者並沒有在基準實例(benchmark instances)上訓練他們的agents，而是利用隨機生成的實例來訓練，因為他們的目標是希望得到一個能夠泛化到所出現過的實例以外的agent。這跟我們的目標不一樣，我們的目標對於組合問題來說更為統，因為它是特定的實例(instance-specific)，而且不會拿來從其它實例中獲得知識。然而，在我們的環境中，使用靜態分派規則比不經過訓練的RL agents獲得更好的結果。這個觀察突顯出我們環境的品質優良。事實上，我們的agents找出的解比Zhang et al在Taillard或Demirkol上各自好18%與20%。 ::: :::info The comparison with (Han and Yang 2020) is fairer as the authors also train their agent on the benchmark instances. In all three of our common instances, our approach outperforms theirs. For the two Taillard’s instances and the unique Demirkol’s instance, we obtain 10% and 5% better solutions, respectively. Nevertheless, they used different RL algorithms (Dueling Double DQN, an off-policy, value-based algorithm) with different settings. Their training is limited to 8000 episodes and not by a time limit. ::: :::success 跟Han and Yang 2020比是比較公平，因為作者還在基準實例上訓練他們的agent。在我們三個常見的實例中，我們的方法都優於他們的方法。對於兩個Taillard和唯一的Demirkol，我們分別獲得了10%和5%更好的解決方案。要知道，他們還用了不同設置的RL演算法(Dueling Double DQN, an off-policy, value-based algorithm)。它們的訓練限制是8000個episodes，而不是時間(我們是時間，十分鐘)。 ::: ## Conclusions and Future Work :::info In this paper, we have built an optimized environment and present an end-to-end DRL based method to automatically learn to dispatch jobs in a time constraint scenario that models industrial exigence. Our benchmark well confirms our method’s superiority to the traditional fixed priority dispatching rules and is state-of-the-art performance with respect to the current limited RL literature. The solutions we were able to find are close to the ones found by one of the best constraint solvers. ::: :::success 這篇論文中，我們建立一個好的環境，然後提出一個基於end-to-end的DRL方法來自動化學習分派工作，而且是在時間限制的情況下模擬出來的。我們的基準測試證明我們的方法比傳統固定優先序的分派策略還要好，而且不是普通的好，是相當於當前有限的RL文獻來說最好棒棒的好。我們找出的解與最好的約束求解器之一的軟體所找到的解相近。 ::: :::info This paper focuses mainly on building an efficient environment. In future work, we plan to improve agent exploration to avoid PPO being stuck in a local optimum and explore different transfer learning approaches to enhance our method’s performance further. ::: :::success 這篇論文關注在建構一個高效的環境。未來的工作中，我們計劃改進agent的探索，以避免PPO落入局部最佳，並探索不同的遷移學習(transfer learning)方法以進一步提高我們方法的效能。 :::