InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

What is the Problem?

Large language models (LLMs) are typically pre-trained on sequences with limited maximum lengths (a few thousand tokens), which restricts their ability to process much longer sequences required in real-world applications such as LLM-driven agents and streaming inputs. Existing solutions often require expensive continual pre-training on longer sequences, which is computationally intensive and can degrade performance on shorter contexts. The challenge is to enable LLMs to efficiently and effectively process extremely long sequences-well beyond their training context window-without any additional training or architectural changes.

Summary

The paper introduces InfLLM, a training-free, memory-based method that enables LLMs to process extremely long sequences by augmenting the standard sliding window attention mechanism with an efficient external context memory. InfLLM stores distant context information in memory units and dynamically retrieves only the most relevant units for each token during attention computation. This approach allows LLMs to capture long-distance dependencies and avoid the distraction caused by irrelevant or noisy contexts, all without any further training. The method is evaluated on challenging long-context benchmarks, demonstrating that LLMs pre-trained on short sequences can achieve performance comparable to or better than models that have undergone costly continual training on long sequences.

Key Insights

Notable Design Details/Strengths

Limitations/Weaknesses

Summary of Key Results

Open Questions