Large language models (LLMs) with billions of parameters have sparked a new wave of exciting AI applications. However, their high computational costs and memory demands during inference pose significant challenges. Adaptive sparse activation inference, which activates only a small number of neurons for each token, offers a novel way to accelerate model inference without degrading performance, showing great potential for resource-constrained hardware devices. Nevertheless, existing methods predict activated neurons based on individual tokens with additional MLP, which involve frequent changes in activation maps and resource calls, limiting the acceleration benefits of sparse activation. In this paper, we introduce CoreInfer, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. Specifically, we propose the concept of sentence-wise core neurons, which refers to the subset of neurons most critical for a given sentence, and empirically demonstrate its effectiveness. To determine the core neurons, we explore the correlation between core neurons and the sentence's semantics. Remarkably, we discovered that core neurons exhibit both stability and similarity in relation to the sentence's semantics—an insight overlooked by previous studies. Building on this finding, we further design two semantic-based methods for predicting core neurons to fit different input scenarios. In CoreInfer, the core neurons are determined during the pre-filling stage and fixed during the encoding stage, enabling zero-cost sparse inference. We evaluated the model generalization and task generalization of CoreInfer across various models and tasks. Notably, on an NVIDIA TITAN XP GPU, CoreInfer achieved a 10.33x and 2.72x speedup compared to the Huggingface implementation and PowerInfer, respectively.
We introduce CoreInfer, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. CoreInfer on a single RNVIDIA TITAN Xp (12G) running opt (ReLU)-7B with amazing speedup! At the same time, CoreInfer performs well in both question-answering and summarization tasks.
Speedup of CoreInfer on question-answering and summary tasks
In our work, we defined token-wise and sentence-wise core neurons. Furthermore, we experimentally proved that LLM only needs one set of Core Neurons to be activated to process an input sentence.
In order to accurately predict Core Neurons, we explored the relationship between the Core Neurons and semantics of a sentence and observed two insights:
1. Insight-1: The Stability of Core Neurons Is Related to Semantic Stability.
2. Insight-2: The Similarity of Core Neuron Is Related to Semantic Similarity.
The experimental results of the two insights are shown below.
Based on these two insights, for an input sentence, we propose two methods for core neuron prediction: Stability-guided Prediction:
1. Stability-guided Prediction: when the input sentence has stable semantics, we can approximate the core neurons in decoding stage by directly using the core neurons identified during the pre-filling stage.
2. Similarity-guided Prediction: when the core neurons of an input sentence are unstable, we cluster the training dataset based on this similarity and identify the core neurons by selecting the top neurons that appear most frequently within that semantic group.
We randomly select samples from the C4 dataset and the TruthfulQA dataset to test the llm output of stability-guided and similarity-guided prediction. The output results are as follows. To our surprise, similarity-guided prediction sometimes has better output results than the original model.
We test the performance of different tasks on different datasets and the results are as follows. We find that CoreInfer is not only applicable to models using Relu, but also to the most advanced models using SiLu. This shows that CoreInfer is not limited by the model activation function.
We evaluated CoreInfer vs. Transformers Implecation and PowerInfer on a single NVIDIA A100 with a series of FP16 models, and the results are shown below. CoreInfer achieves up to 6x speedup on Llama 2 70B and up to 7x speedup on OPT 66B.
We also evaluated CoreInfer and other advanced activation sparse inference on a single NVIDIA Titan Xp(12G) , and the results are illustrated in the same way as above. CoreInfer achieves up to 10.33x speedup on OPT model.
@misc{wang2024coreinferacceleratinglargelanguage,
title={CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation},
author={Qinsi Wang and Saeed Vahidian and Hancheng Ye and Jianyang Gu and Jianyi Zhang and Yiran Chen},
year={2024},
eprint={2410.18311},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.18311},
}