Created February 5, 2024

Optimized llama2 on AMD GPU

Improving cost/performance of Llama2 on AMD GPUs using torch.compile and new decoding methods such as Eagle/ LookAhead.

Things used in this project

Software apps and online services

ROCm

PyTorch

Story

LLM inference optimization is very challenging and directly impact the cost for both small and big companies, GPU poor and rich community. In PyTorch land, using compiler is the new direction for optimizing the performance, torch.compile with project such as GPT-FAST showcased how to properly use torch compiler to speed up the inference. However, LLM inference optimization is beyond kernel fusion and need further work such as optimized management of KV cache in combinations with new approaches to reduce the cost of auto-regressive decoding.

There are emerging method such as speculative decoding, look ahead decoding and recently Eagle that can significantly speed up the inference.

Many of these methods already have been proved out on Nvidia GPUs, here I am aiming to deploy llama2 family of model using a combination of newer techniques including the above and quantization on AMD GPUs. This would help the ecosystem to further promote the hardware diversity and build success story for AMD GPUS in the developer community.

Optimized llama2 on AMD GPU

Things used in this project

Software apps and online services

Story

Credits

hamid nazeri

Comments

Embed the widget on your own site

Optimized llama2 on AMD GPU

Optimized llama2 on AMD GPU

Things used in this project

Software apps and online services

Story

Credits

hamid nazeri

Comments