# Doc: RL Training with LoRA in `rLLM` Low-Rank Adaptation (LoRA) [[Hu et al. 2021]](https://arxiv.org/abs/2106.09685) is a parameter-efficient finetuning (PEFT) method that plays a key role in modern LLM finetuning by allowing policy updates to be injected into an LLM without modifying its full weight matrices (as in the case of full-finetuning; FT). Instead of updating the original dense parameters—which can be memory-expensive—LoRA factorizes weight updates into low-rank matrices that are trained while the base model remains frozen. We refer interested readers to this [post](https://www.ibm.com/think/topics/lora) for details. LoRA has been a popular approach in LLM supervised finetuning (SFT) and lately seen increase usage in reinforcement learning (RL) following **Thinking Machine Lab's [blogpost](https://thinkingmachines.ai/blog/lora/)**, where it is empirically argued that [_"LoRA has equivalent performance to full fine-tuning when doing RL"_](https://tinker-docs.thinkingmachines.ai/lora-primer). Starting from `v0.3`, `rLLM` has **added official support for LoRA-based agentic RL training** through both the `Verl` and `Tinker` backends. In the sections below, we will go through the basics of LoRA-based agentic RL training using the example `solver-judge` workflow. For LoRA + SFT training, please see [placeholder link](placeholder). <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/SJ_yeCdebe.png"> <br> <em>A illustration of inserting LoRA adapter matrices (with rank r) to the original weight matrix. Instead of updating the heavy weight matrix W, LoRA instead trains the (often much smaller in parameter sizes) adapter matrices A & B while freezing W.</em> </div> --- ## LoRA Training in `rLLM` with `Verl` Backend ### `Verl` Configurations Below we list out the customizable LoRA configurations in `Verl` and give a quick overview of their roles: - #### `actor_rollout_ref.model.lora_rank` Rank of the LoRA update matrices (the “r” in the low-rank decomposition). It controls how many additional trainable parameters are added and thus the expressive capacity of the LoRA adapter: higher rank → more capacity and memory/compute; lower rank → more parameter-efficient but less expressive. **Default value:** 0 (i.e. no LoRA is applied) **Options:** positive integer if LoRA is needed > When using `vLLM` for rollout, it's recommended to set `lora_rank <= 512` due to the `max_lora_rank` [restriction](https://github.com/vllm-project/vllm/blob/8a297115e2367d463b781adb86b55ac740594cf6/vllm/config/lora.py#L27). - #### `actor_rollout_ref.model.lora_alpha` LoRA scaling factor that rescales the low-rank update before it is added to the frozen base weight. Effectively controls the magnitude/“strength” of the LoRA adaptation (since the effective scaling is usually `lora_alpha / lora_rank`), and can influence training stability and how aggressively the policy diverges from the base model. **Default value:** 16 **Options:** positive integer - #### `actor_rollout_ref.model.target_modules` Specifies which linear submodules of the transformer receive LoRA adapters. This controls **where in the network** the trainable low-rank updates are inserted (e.g., attention vs. MLP blocks), trading off adaptation flexibility against parameter count and runtime overhead. **Default value:** "all-linear" (i.e. attention + MLP matrices, see options below) **Options:** choose subset of `[q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj]` for selecting individual matrices, or "all-linear" for selecting all. - #### `actor_rollout_ref.model.exclude_modules` The names of the modules to not apply the adapter. When passing a string, a regex match will be performed. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. See [PEFT doc](https://huggingface.co/docs/peft/en/package_reference/lora#peft.LoraConfig) for details. **Default value:** null **Options:** a regex or a list of strings ### Concrete Recipe with `Solver-Judge` Workflow We can easily enable LoRA training for the `solver-judge` workflow by adding the aforementioned parameters to the script `examples/solver_judge/train_solver_judge_flow.sh`. ``` actor_rollout_ref.model.lora_rank=8 \ # or other LoRA rank actor_rollout_ref.model.lora_alpha=8 \ # or other LoRA alpha actor_rollout_ref.model.target_modules=all-linear \ # recommended ``` > Meanwhile, it is often suggested to make the learning rate (i.e. `actor_rollout_ref.actor.optim.lr`) **10x larger than full finetuning** under LoRA setting since less parameters are being trained. > > Similarly, empirically it is also argued that LoRA training **"is less tolerant to large batch sizes"**, where a cap is sometimes put at 32/64. Ultimately, just like any hyper-parameter, some ablations might be needed. Below we display the training average reward (`critic/score/mean`), as well as solver (`val/unknown/solver_acc`) and judge (`val/unknown/judge_acc`) accuraries: we did **two LoRA runs with rank 8/32 on Qwen3/0.6B** respectively. We set `lora_alpha = 8`, `lr = 1e-5`, `temperature=0.6` and `train_batch_size = 16`. Other hyperparameters are the default ones. ![critic-score-mean](https://hackmd.io/_uploads/HyKO2-tg-l.png) ![val-solver-acc](https://hackmd.io/_uploads/Hk2Ip-KlZx.png) ![val-judge-acc](https://hackmd.io/_uploads/SJ-PpZtlWx.png) We see that **both LoRA runs are tracking the full FT run closely** for 150 steps, with the `lora_rank = 8` case almost matching the performances exactly (we also see that higher lora rank doesn't always correspond to better performance in RL setting). Meanwhile LoRA training **enjoys considerably smaller GPU utilization rate** by storing fewer gradients. ![gpu-utilization](https://hackmd.io/_uploads/r1iCbMYeZe.png) > To the surprise of many, even thugh LoRA RL maches the performance while saving GPU memories, it **can be slower in speed than full FT due to slow rollout speed**. This has been discussed in `verl`, and the speed issue might be potentially alleviated by implementing the solution [here](https://github.com/volcengine/verl/issues/4033#issuecomment-3497300222) in the future. --- ### LoRA Training in `rLLM` with `Tinker` TBD