# Paper Review [TOC] :::success :tada: Rangkuman ini telah selesai :tada: ::: *** ## An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm ### Introduction Q-Learning is one of the most known and employed RL algorithms and belongs to the class of off-policy methods since its convergence is guaranteed for any agent’s policy. At the beginning of the training process, the Q-Matrix is initialized with random or zero values, and it is updated by using (1). ![](https://i.imgur.com/WKIS6MP.png) The variables in (1) refer to: • st and st+1: current and next state of the environment. • at and at+1: current and next action chosen by the agent (according to its policy, e.g : epsilongreedy). • γ : discount factor, γ ∈ [0, 1]. how much the agent has to take into account long-run rewards instead of immediate ones. • α: learning rate, α ∈ [0, 1]. how much the newest piece of knowledge has to replace the older one. • rt: current reward value. It is based on the concept of Quality Matrix, also known as Q-Matrix. The size of this matrix is N × Z where N is the number of the possible agent’s states to sense the environment and Z is the number of possible actions that the agent can perform >* throughput : unit processed / time >* multi-agent : multiple interacting intelligent agents to solve problem that can't be solved by single agent >* off-policy : updates Q-values using Q value of next state s′ and the greedy action a′ >* on-policy : updates Q-values using the Q-value of the next state s′ and the current policy's action a′′ >* fixed point analysis : In mathematics, a fixed-point theorem is a result saying that a function F will have at least one fixed point ### Proposed Architecture Two main blocks : Policy Generator (PG, e.g : epsilon-greedy) and Q-Learning accelerator. ![](https://i.imgur.com/nHphPvq.png) Delay is a register and acts to obtain at and st. PG is application-defined. ![](https://i.imgur.com/z9RqnEC.png) Q matrix is stored in Action RAM as column per actions. Each ram contains row per states N. Read adress is for st and write adress for st+1 Decoder will select Q-value to be updated by enable signal. MUX will choose action to update Q(st,at) from Q(st,A). R #### Max Block ![](https://i.imgur.com/MIuSA4T.png) is main limitation of speed. Therfore it uses binary tree architecture. If propagation delay too long, could use pipeline. #### Q-Updater Block ![](https://i.imgur.com/7uxiYq6.png) Uses more efficient equation of ![](https://i.imgur.com/W7zad8F.png) ![](https://i.imgur.com/jdGeuYd.png) to use 2 multipliers and 2 adders. > Approximated Multipliers is a method replace full multipliers with a barrel shifters. For z = x.y : ![](https://i.imgur.com/rN6UOAo.png). This method proven to be almost insensitive to the approximation error since the convergence conditions of Q-Learning are still satisfied ( α,γ ≤ 1). ### Implementation Experiments > timing constraints define the operating frequency for the clock (or clocks) in the system to be developed > the load capacitance is charged and stores energy during 3/16 of all input transitions. This fraction is called the activity factor or alpha. With activity factor alpha, P is reduced to P = alpha*1/2*C*(V^2)*f. worst case = 0.5 *Findings* * No DSP slices needed for 8-bit data, 3 DSPs and 5 DSPs required for 16-bit and 32-bit Q-Value data. * Clock for particular data-path bit-width unaltered. Frequency dropped when action increases. * Number of LUT related to NxZ size. N=8 to N=32 remains the same, for N=64 needed higher number of LUT. ![](https://i.imgur.com/yjivBE5.png) * Power is proportional to number of LUTs. ![](https://i.imgur.com/pyGRoQI.png) ![](https://i.imgur.com/7n0ODdF.png) Effects of Approximated Multipliers (not using DSPs on traditional) * They are more power efficient by using less hardware and faster than full multiplier * ![](https://i.imgur.com/El8AINp.png) State of the Art Architecture Comparison compared to following paper * smaller architecture * could implement single block for computation of Qnew * allow distributed RAM or embedded block-RAM > Pipelining : partitioning instruction to several process such as fetch, decode, execute, memory access, and memory writeback by latches in critical path. ## Parallel Implementation of Reinforcement Learning Q-Learning Technique for FPGA ### Introduction > ![](https://i.imgur.com/v5lsp0k.png) > naive algorithms : simplest algorithm. Has the chance to be the worst / best solution The low execution time of FPGA devices in comparison with its software counterparts is the main reason for its use as a platform for the development of the Q-learning Reinforcement Learning Technique. ### Main Contribution The main idea is based on the development of a modular and parallel architecture to enable an increase in the algorithm execution speed or lower power consumption by decreasing the clock frequency. ### Implementation Description #### Proposed Architecture Overview ![](https://i.imgur.com/H8f28Dq.png) The system receives the FPGA clock asi nput and the initial state value, s0, must be randomly initialized in the REG1 register. The architecture is composed by five main modules types: The GA module, responsible for randomly choosing the actions of the algorithm; The EN modules, which determine which state-action pair should be updated; The RS modules, responsible for storing the reward values; The S modules, responsible for the calculation of the Q value function; And the SEL module, where the future state selection and the storage of the Q value function are made. ![](https://i.imgur.com/v4kYuA2.png) ![](https://i.imgur.com/iTfQsks.png) ##### GA - Action Draw This paper does not use epsilon-greedy policy, instead uses a pseudo-random number generator. ![](https://i.imgur.com/058d0Ju.png) ![](https://i.imgur.com/9M2yoXS.png) modulo operation performed by wrap-around, but problems can occur where the number of possible actions are not multiple of two **(?)**. So the following is proposed. ![](https://i.imgur.com/zU34WAq.png) ![](https://i.imgur.com/FKysScJ.png) ##### EN - Update Module to select state-action pair that will be updated. Amongs NxZ output, only one will be high because when updating Q value, only the current state and choosen action will be updated. ![](https://i.imgur.com/RwW15Eb.png) the logical operation to determine ukn is : ![](https://i.imgur.com/gVUOgKL.png) (>>) is a logical shift operator, A is Z zeroes vector. if skn = n means only when the current state (Skn) match with the Nth-module(n). **(?) logical right shift 1 bukankah = 1 ? Jawab : tujuan rigth shift karena untuk masuk ke modul Qn yang dibaca per bit. 1||0 bukankah 1? Jawab :di or kan supaya apabila panjang bit belum sampai log2(Z), maka akan menjadi log2(Z) sehingga untuk setiap action pada modul Qn memiliki input yang terdefinisi.** ##### RS - Reward Function Module It has N (state) number of cells with vector of Z (Action) elements. Action leading to goal has positive value, undisired actions negavive numberic boost, actions lead to other state recieve zero. ##### S - Value Function Calculation Module Each Sn separated into two blocks : ![](https://i.imgur.com/gXKQBGP.png) ###### Qn - Value Function calculation This module calculate and store the new Q-value. ![](https://i.imgur.com/WIRq2i7.png) it will calculate every possible action in a choosen (enabled) state, has its own register (perhaps) to avoid memory reading. **Karena dia membandingkan besar Q value untuk semua action, bukankah ada kemungkinan Q value yang terpilih paling tinggi tidak sesuai dengan action yang diambil? Jawab : TIdak Masalah, karena memang tidak mempertimbangkan action yang dipilih, bahkan action bergerak secara independen tanpa memperhatikan Q-value. Karenanya dinamakan off-policy. Selain itu, dia juga tidak mengupdate untuk semua kemungkinan action karena terdapat enable. Meski terdapat enable, ada kemungkinan Q value yang terpilih tidak sama dengan action, sekali lagi, tidak masalah karena yang penting mencari Qmax.** ###### SFn - Future State Choose the (registered) next state for a given action in a choosen (enabled) state. ![](https://i.imgur.com/Ez2Igc7.png) ##### SEL - Future State Selection Module ![](https://i.imgur.com/VCHfJ4j.png) It has three main function : * choose the next state (already calculated for the optimal step) from the current state * choose max Q value for the choosen future state and multiply with gamma * Store Q value **darimana dia dapat Sk? Mungkin feedback current state** ### Jadi, kenapa pararel? Mungkin, mengurangi waktu akses memori karena tanpa perlu mencari alamat dari state seperti pada paper sebelumnya, di paper ini setiap state dan action di definisikan dalam jalur register yang berbeda. Namun, tentu akan sangat boros energi karena seluruh jalur mengkomputasi meskipun outputnya tidak diterima di register. Selain itu, prinsip pararel nya juga kurang baik karena pada Q-learning perhitungan Q-value dilakukan hanya sekali setiap iterasi. Kedua, alih-alih mengoptimasi jalur kritis seperti penghitungan Qmax dengan pararelisme, penulis justru melakukan pararelisme di elemen lain yang tidak membutuhkan pararel seperti penghitungan Qmax. ![](https://i.imgur.com/9a0e9hu.png) ![](https://i.imgur.com/BbWUH8a.png) sebagai perbandingan, untuk jumlah action 4 dan state yang hampir mirip (6 dengan 8) untuk jumlah bit 16, jauh sekali perbandingan LUT yang digunakan.Selain itu, kecepatan hardware juga jauh berbeda. Untuk solusi pararel hanya sebesar 43,47KHz, sedangkan dengan metode sebelumnya sebesar 152MHz. Untuk daya, arsitektur pararel lebih hemat karena throughput nya lebih lamban 6.4x. ![](https://i.imgur.com/UTumBCE.png) akan tetapi, pekerjaan di paper tersebut tetap memberikan hasil yang baik dibandingkan dengan pekerjaan lainnya seperti ditampilkan tabel berikut ![](https://i.imgur.com/Hjjm6pb.png) ![](https://i.imgur.com/5iGmNuh.png) hal menarik lainnya adalah, rupanya dengan throughput tinggi, pemrosesan big data pun dapat diolah dalam orde milisekon. Sehingga memungkinkan aplikasi di dunia nyata. ###### tags: `MBKM` `Reinforcement Learning` `SoC`