CoreInfer

In our work, we defined token-wise and sentence-wise core neurons. Furthermore, we experimentally proved that LLM only needs one set of Core Neurons to be activated to process an input sentence. In order to accurately predict Core Neurons, we explored the relationship between the Core Neurons and semantics of a sentence and observed two insights:
1. Insight-1: The Stability of Core Neurons Is Related to Semantic Stability.
2. Insight-2: The Similarity of Core Neuron Is Related to Semantic Similarity. The experimental results of the two insights are shown below.

chematic of the designed DocParser pipeline for automated document annotation

Figure 2: Stability proof: isualization of core neurons when the core neurons are unstable token length of the continuous input sentence is 10, 50, 100, 200, and 300.

Figure 3: Similarity Proof: relationship between the core neurons of sentences and their topics.

Based on these two insights, for an input sentence, we propose two methods for core neuron prediction: Stability-guided Prediction:
1. Stability-guided Prediction: when the input sentence has stable semantics, we can approximate the core neurons in decoding stage by directly using the core neurons identified during the pre-filling stage.
2. Similarity-guided Prediction: when the core neurons of an input sentence are unstable, we cluster the training dataset based on this similarity and identify the core neurons by selecting the top neurons that appear most frequently within that semantic group.

Task Performan of CoreInfer

We randomly select samples from the C4 dataset and the TruthfulQA dataset to test the llm output of stability-guided and similarity-guided prediction. The output results are as follows. To our surprise, similarity-guided prediction sometimes has better output results than the original model.

Figure 4: (Upper) Performance of stability-guided prediction on the generation task. (Lower) Performance of similarity-guided prediction on the question-answering task.

We test the performance of different tasks on different datasets and the results are as follows. We find that CoreInfer is not only applicable to models using Relu, but also to the most advanced models using SiLu. This shows that CoreInfer is not limited by the model activation function.

Comparison with document-related benchmarks

Table 1: Performance comparisons with original models across various tasks using the lm-evaluation-harness.

Hardware Speedup of CoreInfer

We evaluated CoreInfer vs. Transformers Implecation and PowerInfer on a single NVIDIA A100 with a series of FP16 models, and the results are shown below. CoreInfer achieves up to 6x speedup on Llama 2 70B and up to 7x speedup on OPT 66B.

The definition of logical relationships between component units

Figure 5: Speedup of various models on A100 80GB. The X axis indicates the output length. The Y axis represents the speedup compared with Transformer.

We also evaluated CoreInfer and other advanced activation sparse inference on a single NVIDIA Titan Xp(12G) , and the results are illustrated in the same way as above. CoreInfer achieves up to 10.33x speedup on OPT model.

Table 2: Comparison of resources required by different methods to run OPT-6.7b on NVIDIA TITAN XP. ‘NA’ means that the metric is not applicable.

BibTeX

@misc{wang2024coreinferacceleratinglargelanguage,
      title={CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation}, 
      author={Qinsi Wang and Saeed Vahidian and Hancheng Ye and Jianyang Gu and Jianyi Zhang and Yiran Chen},
      year={2024},
      eprint={2410.18311},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.18311}, 
}

CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation

Abstract

Demo of the CoreInfer

Effectively predict Core Neurons

Task Performan of CoreInfer

Hardware Speedup of CoreInfer

BibTeX