# RLHF introduction
reinforcement learning with human feedback(RLHF) is an approach combining the strengths of reinforcement learning with human expertise to train AI agents.
The learning process involves an iterative interaction between the AI agent and human expert. Initially, the agent explores the environment and takes action based on its current policy. The human expert observe the agent's behavior and provide feedback in the forms of evaluation and demonstration.
To be more specifc:
Reinforcement Learning with Human Feedback (RLHF) is an approach to reinforcement learning that incorporates feedback from human experts to improve the learning process. In traditional reinforcement learning, an agent learns by interacting with an environment, receiving reward signals, and adjusting its behavior to maximize cumulative rewards. RLHF extends this framework by allowing humans to provide additional feedback to guide the learning process.
The goal of RLHF is to leverage the expertise and knowledge of human trainers to accelerate and refine the learning of the agent. Human feedback can take various forms, such as explicit reward signals, demonstrations, preferences, or critiques. By incorporating this feedback, RLHF aims to address challenges such as sample inefficiency, exploration in complex environments, and safety concerns.
There are different ways to integrate human feedback into reinforcement learning:
Reward Shaping: Humans can provide additional reward signals to guide the agent's behavior. For example, they can assign rewards based on desired outcomes or intermediate goals, helping the agent to focus on relevant behaviors and learn more quickly.
Demonstrations: Human trainers can provide demonstrations of desired behavior, showing the agent how to perform certain tasks correctly. By observing and imitating these demonstrations, the agent can learn more efficiently and generalize from the provided examples.
Preference-based Feedback: Instead of explicit rewards or demonstrations, humans can provide comparative feedback or preferences. They can rank or compare different action sequences or provide pairwise comparisons to guide the agent's decision-making process.
Critiques and Corrections: Humans can provide feedback to correct the agent's mistakes or suboptimal actions. By pointing out errors and suggesting improvements, the agent can learn from these corrections and refine its behavior accordingly.
Integrating human feedback into reinforcement learning algorithms requires careful consideration of how to effectively combine and balance the feedback with the existing reinforcement learning mechanisms. Techniques such as reward aggregation, inverse reinforcement learning, or apprenticeship learning are often employed to incorporate human feedback effectively.
RLHF has gained attention due to its potential to address challenges in real-world applications where human expertise is valuable, such as robotics, healthcare, or game playing. By leveraging human feedback, RLHF aims to improve the learning process, reduce exploration time, and ensure safe and reliable behavior of the learning agent.
## primary ways in which human feedback can be incorporated
### Evaluative Feedback
The human expert evaluates the agent's action or policy and provide feedback on the quality
form: scalar reward signal or ranking of different actions
Then the algorithm utilizes the feedback to update the policy.
### Demonstrations
The human expert provides the desired behaviors or actions in the environment directly, which can be examples for the AI agent to learn from. The agent can mimic the demonstrated behavior or use the demonstrated behavior as a starting point for further exploration and learning.
# project outline
agent routes setting
## objective
train an optimized route in the map for the agent, so that the overall reward is maximized(shorter path, more positive transformation, less obstacles)
## environment setting
use different size of grids
## reward setting
**for part of the sites:**
* set different kinds of obstacle
* set trasmission spots(transmit to another site instantly)
* control the moving direction
* give penalty for longer length of the route
## evaluation
compute the adjusted reward
## core feature
stochastic reward/state,
instant transformation
instant reward combined with long-term reward
# group presentation requirement
## time
within 13 min
2 min Q&A
## format
can be done either one member or all members
file name: Pre_Group_1.ppt
first page should contain the name and student id of all members
## complementary
need to rate all of your peer groups
slide number: less than 18
Exceeding the time limit will trigger a score deduction (-0.5 points per 0.5 minutes)
# group presentation delivery
PPT: 谢金妤
## structure
### introduction or background (may include the significance of the study)
王一丹
time: 1-2 min
page number:
### data collection or/and preprocessing
叶峰源
time: 2-3 min
page number:
### methodology, numerical or experimental results
卢诣
time: 6 min
page number:
### conclusion
谢金妤
time: 1 min
page number:
# work distribution
## project design and code
卢诣,叶峰源
### environment setting
| S | 1 | 0 | 0 | 0 | 0 |
| ---- | ---- | ---- | ---- | ---- | ---- |
| 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | T |
| 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 |
### algorithm
Q learning
implement human feedback by evaluative demonstration and demonstration
## report
王一丹
name requirement: Report_Group_1.pdf
### Abstract
### Background
### Related work
### Method
### Result
### Conclusion
## presentation PPT
谢金妤
name requiement: ‘Pre_Group_1’