# [2021-12-24] Mr. Shao-Huan Sun, University of Southern California (USC), "Program-Guided Framework for Interpreting and Acquiring Complex Skills with Learning Robots"
Robot learning是ML和機器人學的交集,使機器人有學習的能力。主要在解的是RL, Meta, Meta Reinforcement Learning,並著重在Robot Learning上。製造業中,機械手臂廣泛的應用於各大工廠中,但是缺少學習能力,無適應環境的能力,將機器人帶回家中,環境將會更加複雜,因此希望能有探索環境的能力,並估計新物件的用途,並達成一些沒有想過的task,做好雜事。
Supervise方法如何用在robot上呢?學一個input-output mapping,並給定pair of data,機器無法指定data。在robot learning上,output impact了input,機器是主動系統,不適合用supervised。
常用方法為RL,通過observation產生action,有action改變environment state,environment給予機器人reward,希望maximize the return。但Deep RL對機器人學習仍有issue,可能interpretability無法達到,看不到失敗的問題出在哪。沒有generalization, limited to short-horizon tasks, no skill reuse等。
因此提出program guided framework,要機器人寫出一個製作流程,因此有了interpretability,可以釐清失敗原因;也可以generalize by running the program it writes可以跑任意digits;也可以用基礎技能hierarchically 組合成複雜技能;最後有不同module處理不同問題,診斷問題更容易。
Pipeline為:skill specification,描述要學習甚麼,並要求生成program以解決該任務,由observation在生成high level plan,最後由plan figure out怎麼使機器人精準完成plan。
那要怎麼生成program,program inference。由Imitation learning,讓機器人和專家學習,follow demo to learn或是learn demo by NN,該project是要求agent寫出program,並執行program使robot有imitation result。Demo變為demo factor,summarize expert logic,decode logic to program。和NN policy比較後,program法較為好,因為program把else detail也學好了。
但若是難以取得demo,只有reward function,怎麼從中直接生成program?用program表示policy,model要學會語法、與環境的關聯性、期望的行為結果,特別的behavior為何,但是這樣model太多要學了,因此用二階段架構,第一階段學習語法和環境關聯,生成program embedding space,隨機產生的,非最後要學的;第二階段search一個best program使結果maximize。一階段使用VAE學program embedding,第二階段使用Cross Entropy Method,search一個best program maximize rewards,pick-noise-sample-close to better one…。和deep RL量化比較,普遍表現較佳。是否有generalization ability,zero shot generalization to big environment表現極佳。
Interpretability上,由人改進其program結果顯著提升,意即robot生成的program人也看得懂且易於修正。
Program as policy is interpretability and generalizable,但前提是program可執行,沒有執行器妥善執行,所以要學習program executor,由high level plan學到low level execution to meet subtasks。比較自然語言以及program,哪個對agent學習更好by end to end learning,拆三塊,執行、percept、interact。結果program比NLP更能generalization因為不ambiguous,不同module做不同任務更有幫助。
如何從continuous high level plan生成low level execution,given basic skill and high level plan,generate a series of low level plan。發現不能直接執行low level plan,policy間有個緩衝狀態,transition policy,以適合執行下一個policy,怎麼學?先執行transition,執行下一個policy,分成成功/失敗,學predictor,successor buffer/failure buffer to score the policy。Meta policy determine the next policy, do transition policy and then do the low level plan。結果比其他法法更有效率,更高成功率,都是從transition policy後才做low level。
要怎麼學basic skill efficiently?用meta RL快速學習一系列技能,或learn from demo。Meta learning,MAML學task distribution,但只適用相似任務,在複雜任務表現差,solution is MMAMAL,加入modulation network,embed the task,determine the type of work,activate/non-activate the neuro, like two stage, propose end to end way。在多任務上表現好、在多domain影像表現更好,可以學shared knowledge,在RL也有好結果。tSNE embedding上task encode 學得好,有分群。那長時間/稀疏reward上的meta learning。Meta learning focus on short time/dense reward,skill based learn some skill first,how to concatenate?offline task learn fast, and apply to meta learn to compose, and use in real case。Result is fast converge!
Efficiently learn basic skill, learn from demo, learn from observe, one need state and action, one only need state. Why learn from observe, because expert hard to collect, and physical structure is different cannot learn directly. Result is good.
Deep RL issue, so we use program to illustrate the plan, has interpretability, separate module to perform better, form demo to pipeline, how to execute program, continuous space execution, meta/learn from observation.
## Note
### The note I write is totally summarized version of speaker with minor my opinion. The citation is described below.
## Citation
### Topic: Program-Guided Framework for Interpreting and Acquiring Complex Skills with Learning Robots
### Speaker: Mr. Shao-Huan Sun