Meta’s Llama 3.1 405B giant language mannequin (LLM) is attaining new ranges of efficiency due to NVIDIA’s TensorRT Mannequin Optimizer, in line with the NVIDIA Technical Weblog. The enhancements have resulted in as much as a 1.44x enhance in throughput when working on NVIDIA H200 GPUs.
Excellent Llama 3.1 405B Inference Throughput with TensorRT-LLM
TensorRT-LLM has already delivered outstanding inference throughput for Llama 3.1 405B for the reason that mannequin’s launch. This was achieved via varied optimizations, together with in-flight batching, KV caching, and optimized consideration kernels. These strategies have accelerated inference efficiency whereas sustaining decrease precision compute.
TensorRT-LLM added help for the official Llama FP8 quantization recipe, which calculates static and dynamic scaling elements to protect most accuracy. Moreover, user-defined kernels corresponding to matrix multiplications from FBGEMM are optimized through plug-ins inserted into the community graph at compile time.
Boosting Efficiency As much as 1.44x with TensorRT Mannequin Optimizer
NVIDIA’s customized FP8 post-training quantization (PTQ) recipe, out there via the TensorRT Mannequin Optimizer library, enhances Llama 3.1 405B throughput and reduces latency with out sacrificing accuracy. This recipe incorporates FP8 KV cache quantization and self-attention static quantization, decreasing inference compute overhead.
Desk 1 demonstrates the utmost throughput efficiency, displaying important enhancements throughout varied enter and output sequence lengths on an 8-GPU HGX H200 system. The system options eight NVIDIA H200 Tensor Core GPUs with 141 GB of HBM3e reminiscence every and 4 NVLink Switches, offering 900 GB/s of GPU-to-GPU bandwidth.
Most Throughput Efficiency – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs |
|||
Enter | Output Sequence Lengths | 2,048 | 128 | 32,768 | 2,048 | 120,000 | 2,048 |
TensorRT Mannequin Optimizer FP8 | 463.1 | 320.1 | 71.5 |
Official Llama FP8 Recipe | 399.9 | 230.8 | 49.6 |
Speedup | 1.16x | 1.39x | 1.44x |
Equally, Desk 2 presents the minimal latency efficiency utilizing the identical enter and output sequence lengths.
Batch Dimension = 1 Efficiency – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs |
|||
Enter | Output Sequence Lengths | 2,048 | 128 | 32,768 | 2,048 | 120,000 | 2,048 |
TensorRT Mannequin Optimizer FP8 | 49.6 | 44.2 | 27.2 |
Official Llama FP8 Recipe | 37.4 | 33.1 | 22.8 |
Speedup | 1.33x | 1.33x | 1.19x |
These outcomes point out that H200 GPUs with TensorRT-LLM and TensorRT Mannequin Optimizer are delivering superior efficiency in each latency-optimized and throughput-optimized situations. The TensorRT Mannequin Optimizer FP8 recipe additionally achieved comparable accuracy with the official Llama 3.1 FP8 recipe on the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.
Becoming Llama 3.1 405B on Simply Two H200 GPUs with INT4 AWQ
For builders with {hardware} useful resource constraints, the INT4 AWQ method in TensorRT Mannequin Optimizer compresses the mannequin, permitting Llama 3.1 405B to suit on simply two H200 GPUs. This technique reduces the required reminiscence footprint considerably by compressing the weights all the way down to 4-bit integers whereas encoding activations utilizing FP16.
Tables 4 and 5 present the utmost throughput and minimal latency efficiency measurements, demonstrating that the INT4 AWQ technique supplies comparable accuracy scores to the Llama 3.1 official FP8 recipe from Meta.
Most Throughput Efficiency – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs |
|||
Enter | Output Sequence Lengths | 2,048 | 128 | 32,768 | 2,048 | 60,000 | 2,048 |
TensorRT Mannequin Optimizer INT4 AWQ | 75.6 | 28.7 | 16.2 |
Batch Dimension = 1 Efficiency – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs |
|||
Enter | Output Sequence Lengths | 2,048 | 128 | 32,768 | 2,048 | 60,000 | 2,048 |
TensorRT Mannequin Optimizer INT4 AWQ | 21.6 | 18.7 | 12.8 |
NVIDIA’s developments in TensorRT Mannequin Optimizer and TensorRT-LLM are paving the way in which for enhanced efficiency and effectivity in working giant language fashions like Llama 3.1 405B. These enhancements provide builders extra flexibility and cost-efficiency, whether or not they have intensive {hardware} assets or extra constrained environments.
Picture supply: Shutterstock