CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation

Qinsi Wang1, Saeed Vahidian1, Hancheng Ye1, Jianyang Gu2, Jianyi Zhang1, Yiran Chen1
1 Duke University 2 Ohio State University

Figure 1: The overview framework of CoreInfer.

Abstract

Large language models (LLMs) with billions of parameters have sparked a new wave of exciting AI applications. However, their high computational costs and memory demands during inference pose significant challenges. Adaptive sparse activation inference, which activates only a small number of neurons for each token, offers a novel way to accelerate model inference without degrading performance, showing great potential for resource-constrained hardware devices. Nevertheless, existing methods predict activated neurons based on individual tokens with additional MLP, which involve frequent changes in activation maps and resource calls, limiting the acceleration benefits of sparse activation. In this paper, we introduce CoreInfer, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. Specifically, we propose the concept of sentence-wise core neurons, which refers to the subset of neurons most critical for a given sentence, and empirically demonstrate its effectiveness. To determine the core neurons, we explore the correlation between core neurons and the sentence's semantics. Remarkably, we discovered that core neurons exhibit both stability and similarity in relation to the sentence's semantics—an insight overlooked by previous studies. Building on this finding, we further design two semantic-based methods for predicting core neurons to fit different input scenarios. In CoreInfer, the core neurons are determined during the pre-filling stage and fixed during the encoding stage, enabling zero-cost sparse inference. We evaluated the model generalization and task generalization of CoreInfer across various models and tasks. Notably, on an NVIDIA TITAN XP GPU, CoreInfer achieved a 10.33x and 2.72x speedup compared to the Huggingface implementation and PowerInfer, respectively.


Demo of the CoreInfer

We introduce CoreInfer, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. CoreInfer on a single RNVIDIA TITAN Xp (12G) running opt (ReLU)-7B with amazing speedup! At the same time, CoreInfer performs well in both question-answering and summarization tasks.

Speedup of CoreInfer on question-answering and summary tasks



Effectively predict Core Neurons

In our work, we defined token-wise and sentence-wise core neurons. Furthermore, we experimentally proved that LLM only needs one set of Core Neurons to be activated to process an input sentence. In order to accurately predict Core Neurons, we explored the relationship between the Core Neurons and semantics of a sentence and observed two insights:
1. Insight-1: The Stability of Core Neurons Is Related to Semantic Stability.
2. Insight-2: The Similarity of Core Neuron Is Related to Semantic Similarity. The experimental results of the two insights are shown below.

Figure 2: Stability proof: isualization of core neurons when the core neurons are unstable token length of the continuous input sentence is 10, 50, 100, 200, and 300.

Figure 3: Similarity Proof: relationship between the core neurons of sentences and their topics.

Based on these two insights, for an input sentence, we propose two methods for core neuron prediction: Stability-guided Prediction:
1. Stability-guided Prediction: when the input sentence has stable semantics, we can approximate the core neurons in decoding stage by directly using the core neurons identified during the pre-filling stage.
2. Similarity-guided Prediction: when the core neurons of an input sentence are unstable, we cluster the training dataset based on this similarity and identify the core neurons by selecting the top neurons that appear most frequently within that semantic group.



Task Performan of CoreInfer

We randomly select samples from the C4 dataset and the TruthfulQA dataset to test the llm output of stability-guided and similarity-guided prediction. The output results are as follows. To our surprise, similarity-guided prediction sometimes has better output results than the original model.

Figure 4: (Upper) Performance of stability-guided prediction on the generation task. (Lower) Performance of similarity-guided prediction on the question-answering task.

We test the performance of different tasks on different datasets and the results are as follows. We find that CoreInfer is not only applicable to models using Relu, but also to the most advanced models using SiLu. This shows that CoreInfer is not limited by the model activation function.

Table 1: Performance comparisons with original models across various tasks using the lm-evaluation-harness.


Hardware Speedup of CoreInfer

We evaluated CoreInfer vs. Transformers Implecation and PowerInfer on a single NVIDIA A100 with a series of FP16 models, and the results are shown below. CoreInfer achieves up to 6x speedup on Llama 2 70B and up to 7x speedup on OPT 66B.

Figure 5: Speedup of various models on A100 80GB. The X axis indicates the output length. The Y axis represents the speedup compared with Transformer.

We also evaluated CoreInfer and other advanced activation sparse inference on a single NVIDIA Titan Xp(12G) , and the results are illustrated in the same way as above. CoreInfer achieves up to 10.33x speedup on OPT model.

Table 2: Comparison of resources required by different methods to run OPT-6.7b on NVIDIA TITAN XP. ‘NA’ means that the metric is not applicable.



BibTeX

@misc{wang2024coreinferacceleratinglargelanguage,
      title={CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation}, 
      author={Qinsi Wang and Saeed Vahidian and Hancheng Ye and Jianyang Gu and Jianyi Zhang and Yiran Chen},
      year={2024},
      eprint={2410.18311},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.18311}, 
}