Large-scale language models (LLMs) have achieved remarkable success across a variety of tasks. However, they often suffer from a limited context window size due to high fine-tuning costs, lack of long text, and the introduction of devastating values due to new token positions. there is.
To address this issue, a new paper LongRoPE: Extending the LLM context window beyond 2 million tokensMicrosoft’s research team introduced LongRoPE, a pioneering technique that scales the context window of a pre-trained LLM to a staggering 2048k tokens while maintaining performance on the original short context window.
The team identified four major obstacles that prevent further expansion of the Context Window.
- New, untrained position indices introduce a large number of catastrophic values, cause out-of-distribution problems, and complicate fine-tuning of convergence.
- Tweaks usually require text of corresponding length, but long texts, especially texts over 1000k, are rare.
- Training very long texts is computationally intensive, requiring significant training time and GPU resources.
- Expanding to a very long context window reduces the performance of the original short context because attention is distributed over a large number of token positions.
To overcome the first challenge, the team employs an interpolated RoPE location embedding and scales the new location index down to the pre-trained range. They empirically revealed his two important discoveries:
- Effective position interpolation must consider two forms of heterogeneity: RoPE dimensions and token position variations.
- By integrating the heterogeneity into the position interpolation, the information of the original RoPE is effectively preserved, especially the important dimensions and the positions of the tokens.
Motivated by these discoveries, LongRoPE was developed and successfully extended the LLM context window beyond 2 million tokens with three key innovations:
- Identification and exploitation of two forms of heterogeneity in position interpolation through efficient search provides an enhanced initialization for fine-tuning and enables 8x scaling in non-fine-tuning scenarios.
- Implementing a progressive expansion strategy. He achieves a 2048k context window by first fine-tuning a 256k-long LLM, followed by a second position interpolation with the fine-tuned extended LLM.
- Retuned LongRoPE with a length of 8k to restore performance on short context windows.
Extensive experiments across different LLMs and different long context tasks highlight the effectiveness of LongRoPE. It maintains low perplexity from 4k to 2048k evaluation lengths, achieving passkey retrieval accuracy of over 90% and comparable accuracy on standard benchmarks designed within a 4096 context window. LongRoPE can be applied to any LLM based on RoPE embedding.
The researchers envision that the LongRoPE model will enable numerous new long-context applications and encourage further research in this area.
The code is available at https:///.github.com/microsoft/LongRoPE.paper LongRoPE: Extending the LLM context window beyond 2 million tokens It’s on arXiv.
author: Hecate He | Editor:Chain Zhang
We know you don’t want to miss out on news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly Get weekly AI updates.