<aside> 📌

Paper: Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation https://arxiv.org/pdf/2601.11258

Authors: Pingzhi Tang*, Yiding Wang*, Muhan Zhang

Blog written by: Yiding Wang, incoming PhD at Peking University.

Estimated Reading Time: 5 min.

</aside>

Current LLMs learn differently from humans. We humans, can flexibly absorb new knowledge/tools and utilize them as long as we have acquired corresponding manipulation skills. For example, we can internalize a paper we read a few days ago and use its techniques to solve our new research problems. However, even the most powerful LLMs don’t update their parameters; they cannot access new knowledge if they aren’t pre-trained or provided as context. We are curious about some critical questions that lie within (which might also impede LLMs’ Continual Learning or human-like lifelong learning): Are Supervised Fine-Tuning (SFT) methods good enough for knowledge incorporation? Can LLMs effectively adapt injected knowledge with their inherent skills and solve downstream tasks related to new knowledge?

Previous works have explored how to inject new knowledge into LLMs through SFT. For example, Lampinen et al. discovered that passage implications are better than raw text. SEAL (Zweiger et al.) innovatively meta-trains LLMs themselves to generate better self-edits. However, there is still room for improvement: on the SQuAD dataset, where the LLMs can answer questions with near 100% accuracy with contexts, knowledge-incorporated LLMs (by SFT) can only achieve <50% accuracy.

<aside> 💡

Finding 1: SFT-only Causes Reasoning-Knowledge Disconnect

</aside>

We found LLMs that only SFT on new knowledge (e.g., passages with implications or tool APIs) cannot truly reason over it and solve complex tasks. For example, on SQuAD, although the LLM has been trained on “Directives do not generally give citizens (as opposed to the member state) standing to sue other citizens”, when given a question, “What generally does not allow citizens to sue other citizens?”, the model cannot locate this knowledge and answer correctly. Of course, LLMs can answer more questions correctly after SFT, but they are more like pattern matching (reciting SFT knowledge for close question keys) instead of human-like reasoning and answering.

截屏2026-01-22 21.34.37.png

This Disconnect is more intuitive in agent tasks. In tool-use agent settings, we hope LLMs can effectively call the learned APIs to solve real-world tasks. However, as shown in the Figure, SFT fails to deal with environmental error and leads to hallucinations.

截屏2026-01-22 22.50.03.png

From these experiments, we can conclude that SFT alone cannot make LLMs able to internalize and flexibly utilize knowledge as humans do. There seems to be a lack of “Skill Transfer” to adapt the new knowledge into valuable learning. As we all know, RL is important in equipping LLMs with “reasoning” capabilities—modern LLMs have gone through heavy post-training on high-quality data to enhance their downstream task performance. However, it’s impractical to perform RL training for every incorporated knowledge entry because of the costs. This implies that for Continual Learning settings, a mechanism that can efficiently transplant pre-learned skills onto newly acquired knowledge is very important.

<aside> 💡

Finding 2: Orthogonality of Parameter Updates during SFT and RL Stages

</aside>

We first did some preliminary experiments to explore how to adapt RL-learned skills to injected knowledge. We sequentially perform SFT and RL on the LooGLE dataset (a long-context dataset), SFT on raw passage snippets and simple summary/recall texts, and RL on self-synthetic QAs. We then extract the delta weights induced by the SFT and RL stages and calculate their cosine similarity. We surprisingly discover that the weights are highly orthogonal, which might suggest that the knowledge and skills are acquired in a decoupled parameter space.

截屏2026-01-23 11.58.44.png

This separability motivates our approach: if we don’t want to do expensive Test-Time RL to acquire corresponding skills again, we can try extracting the relative skill vector from a source domain and transplanting it to target domains.

<aside> 💡

Our Method: Parametric Skill Transfer (PaST)

</aside>

PaST is a straightforward method: At Stage I (Source Domain Skill Extraction), we sequentially (or iteratively) do SFT and RL on the source domain and extract the delta weights induced by the RL stage as the Skill Vector. This stage resembles a cultivation or pre-practice of the necessary manipulation skills and distills them to weight vectors. At Stage II (Target Domain Skill Injection), we linearly add the Skill Vector to the Target domain SFT-ed LLMs and see the magic. This simulates the ideal learning process at test-time: enabling LLMs use new knowledge with “PaST” skills at low costs.

截屏2026-01-23 12.26.37.png

We briefly show some conclusions in our experiments (see our paper for more details):

<aside> 💡

Conclusion 1: (RL) Skills Are More Important Than Better SFT Edits

</aside>

As shown in Table 1, only using 100 Passages (50 x 2) to extract PaST’s Skill Vector and apply it on the baseline “Train on Passage+Synthetic”, PaST achieves up to +17.2 absolute points over this direct baseline. Furthermore, PaST significantly outperforms SEAL, which meta-trains the LLM to generate better SFT edits. This confirms the essential role of skill transfer.