project outline

# RLHF introduction reinforcement learning with human feedback(RLHF) is an approach combining the strengths of reinforcement learning with human expertise to train AI agents. The learning process involves an iterative interaction between the AI agent and human expert. Initially, the agent explores the environment and takes action based on its current policy. The human expert observe the agent's behavior and provide feedback in the forms of evaluation and demonstration. To be more specifc: Reinforcement Learning with Human Feedback (RLHF) is an approach to reinforcement learning that incorporates feedback from human experts to improve the learning process. In traditional reinforcement learning, an agent learns by interacting with an environment, receiving reward signals, and adjusting its behavior to maximize cumulative rewards. RLHF extends this framework by allowing humans to provide additional feedback to guide the learning process. The goal of RLHF is to leverage the expertise and knowledge of human trainers to accelerate and refine the learning of the agent. Human feedback can take various forms, such as explicit reward signals, demonstrations, preferences, or critiques. By incorporating this feedback, RLHF aims to address challenges such as sample inefficiency, exploration in complex environments, and safety concerns. There are different ways to integrate human feedback into reinforcement learning: Reward Shaping: Humans can provide additional reward signals to guide the agent's behavior. For example, they can assign rewards based on desired outcomes or intermediate goals, helping the agent to focus on relevant behaviors and learn more quickly. Demonstrations: Human trainers can provide demonstrations of desired behavior, showing the agent how to perform certain tasks correctly. By observing and imitating these demonstrations, the agent can learn more efficiently and generalize from the provided examples. Preference-based Feedback: Instead of explicit rewards or demonstrations, humans can provide comparative feedback or preferences. They can rank or compare different action sequences or provide pairwise comparisons to guide the agent's decision-making process. Critiques and Corrections: Humans can provide feedback to correct the agent's mistakes or suboptimal actions. By pointing out errors and suggesting improvements, the agent can learn from these corrections and refine its behavior accordingly. Integrating human feedback into reinforcement learning algorithms requires careful consideration of how to effectively combine and balance the feedback with the existing reinforcement learning mechanisms. Techniques such as reward aggregation, inverse reinforcement learning, or apprenticeship learning are often employed to incorporate human feedback effectively. RLHF has gained attention due to its potential to address challenges in real-world applications where human expertise is valuable, such as robotics, healthcare, or game playing. By leveraging human feedback, RLHF aims to improve the learning process, reduce exploration time, and ensure safe and reliable behavior of the learning agent. ## primary ways in which human feedback can be incorporated ### Evaluative Feedback The human expert evaluates the agent's action or policy and provide feedback on the quality form: scalar reward signal or ranking of different actions Then the algorithm utilizes the feedback to update the policy. ### Demonstrations The human expert provides the desired behaviors or actions in the environment directly, which can be examples for the AI agent to learn from. The agent can mimic the demonstrated behavior or use the demonstrated behavior as a starting point for further exploration and learning. # project outline agent routes setting ## objective train an optimized route in the map for the agent, so that the overall reward is maximized(shorter path, more positive transformation, less obstacles) ## environment setting use different size of grids ## reward setting **for part of the sites:** * set different kinds of obstacle * set trasmission spots(transmit to another site instantly) * control the moving direction * give penalty for longer length of the route ## evaluation compute the adjusted reward ## core feature stochastic reward/state， instant transformation instant reward combined with long-term reward # group presentation requirement ## time within 13 min 2 min Q&A ## format can be done either one member or all members file name: Pre_Group_1.ppt first page should contain the name and student id of all members ## complementary need to rate all of your peer groups slide number: less than 18 Exceeding the time limit will trigger a score deduction (-0.5 points per 0.5 minutes) # group presentation delivery PPT: 谢金妤 ## structure ### introduction or background (may include the significance of the study) 王一丹 time: 1-2 min page number: ### data collection or/and preprocessing 叶峰源 time: 2-3 min page number: ### methodology, numerical or experimental results 卢诣 time: 6 min page number: ### conclusion 谢金妤 time: 1 min page number: # work distribution ## project design and code 卢诣，叶峰源 ### environment setting | S | 1 | 0 | 0 | 0 | 0 | | ---- | ---- | ---- | ---- | ---- | ---- | | 0 | 0 | 0 | 0 | 0 | 0 | | 0 | 0 | 0 | 0 | 0 | T | | 0 | 0 | 0 | 0 | 0 | 0 | | 0 | 0 | 0 | 0 | 0 | 0 | ### algorithm Q learning implement human feedback by evaluative demonstration and demonstration ## report 王一丹 name requirement: Report_Group_1.pdf ### Abstract ### Background ### Related work ### Method ### Result ### Conclusion ## presentation PPT 谢金妤 name requiement: ‘Pre_Group_1’

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.