ECCV Reading - HackMD

ECCV Reading === p-module --- ### Neural Task Programming: Learning to Generalize Across Hierarchical Tasks The module takes the observation and the arguments as input and generate the next program embedding, arguments and the termination signal. The observation of the sub-program is in the form of a segment of its caller's observation. It is trained with a supervised method given the correct program trace. ### Meta Learning shared Hierarchies ### Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning ### Learning to Compose Skills Seperately learn skills by training a skill-state embedding, which translate the observation to a skill-embedding, along with a shared policy layer, which translate the embedding into actions. Then some novel combination layers (e.g. logic-and, logic-while, logic-until) are trained to tackle the new task. Idea: temporal consistency in the skills, interpretation of the network, action attribute learning p-vdemo --- ### Deep Q-learning from Demonstrations Combining temporal difference updates with supervised classification to pretrain the policy so that it can (i) stick to the demonstration by SL (ii) obey Bellman equation with DDQN. ### Time-Contrastive Networks: Self-Supervised Learning from Video They focus on learning a shared representation across the demonstration and the robot environment. Then the demonstration can act as a reward in the embedding space to direct the RL of robot. However, a certain degree of supervised information is required to align between demonstration and the environment (robot hand in the pouring task, or the human supervision in the imitation task). The reward is postively related to the difference in the embedding space between demonstrations and simulations. Problem: Except for synchronized videos from two viewpoints, it requires videos of the robot, which is somehow problematic (overlapping between training and testing). ### Temporal Relational Reasoning in Videos Similar with relational network with ordered frame as objects. $$ T_d(V) = h^{(d)}_\phi(\sum_{i_1<i_2<...<i_d}g^{(d)}_\theta(f_{i_1}, f_{i_2}, ..., f_{i_d})) $$ Idea: balancing the shuffle and ordered situation ### Third-person imitation learning. (TCN cited [13]) Two adversial training is used. GAIL is applied to learn the expert policy while domain confusion is applied to learn a domain invariant state embedding. ### Unsupervised state representation learning with robotic priors: a robustness benchmark Given sequences of observation-action-reward tuples, the paper proposed to learn a low-dim embedding using five priors as constraints with siamese network. Two computationally cheap evaluations are applied to evaluate the learnt embedding. Problem: They claimed that the learnt embedding can be used to achieve other tasks, which is not shown. Similar work: Learning to See by Moving Unsupervised Perceptual Rewards for Imitation Learning --- Segment real-world video by the std of the visual features/frames. Design an unsupervised reward function based on the segmentation. Select features that is corelated with the reward from a deep network. Do RL with that reward function. ### Unsupervised perceptual rewards for imitation learning. (TCN cited [15]) This figure should be self-explanatory: ![](https://i.imgur.com/UII9sdi.png) Drawback: Tasks/controls are too simple. Stage-discovery method is hand-crafted (some assumption on the video input). DART: Noise Injection for Robust Imitation Learning --- DR Hindsight Experience Replay --- DR End-to-End Differentiable Adversarial Imitation Learning --- DR GAIL: Generative Adversarial Imitation Learning --- Motivation: train a discriminator that can distinguish the trajectories of expert and the agent to be trained. The signal from the discriminator is then used to train the agent in a GAN way. Rather than train a sequence-classifier, they train a state-action classifier. TRPO: Trust Region Policy Optimization --- DR DAgger: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning ---- Expert signal is needed to argument the dataset when the agent cannot find a specific state in the demonstration set. [Imitation Learning Lecture](https://www.youtube.com/watch?v=rOho-2oJFeA) -- DR ### Online customization of teleoperation interfaces. (TCN cited [14]) DR ### Imitation from observation: Learning to imitate behaviors from raw video via context translation. (TCN cited [16]) DR