LLM inference optimization is very challenging and directly impact the cost for both small and big companies, GPU poor and rich community. In PyTorch land, using compiler is the new direction for optimizing the performance, torch.compile with project such as GPT-FAST showcased how to properly use torch compiler to speed up the inference. However, LLM inference optimization is beyond kernel fusion and need further work such as optimized management of KV cache in combinations with new approaches to reduce the cost of auto-regressive decoding.
There are emerging method such as speculative decoding, look ahead decoding and recently Eagle that can significantly speed up the inference.
Many of these methods already have been proved out on Nvidia GPUs, here I am aiming to deploy llama2 family of model using a combination of newer techniques including the above and quantization on AMD GPUs. This would help the ecosystem to further promote the hardware diversity and build success story for AMD GPUS in the developer community.
Comments