1. Project Background and Objectives
The Diffusion Model has garnered significant attention in the field of image generation due to its ability to generate images based on text descriptions. However, for commercial applications, the controllability of the generated results is a crucial requirement. This is especially true in applications where e-commerce images are generated based on given clothing and model conditions, where the demand for controllability is even higher. Several open-source projects have implemented this application, with the OOTDiffusion project being a typical example.
This application involves multiple network structures, including the openpose network for extracting human key points, the humanparsing network for understanding human key partitions, the cliptext model introduced by the stable diffusion model, the u-net + scheduler related models, and the autoencoder-decoder model. The main objective of this project is to adapt the complex OOTDiffusion project to a platform based on the ROCm software stack and RDNA 3 hardware.
In this project, by completing the adaptation of OOTDiffusion on the ROCm and RDNA 3 hardware, we gained a thorough understanding of the ROCm software platform and the associated RDNA 3 hardware platform's adaptability to various models in the current Pytorch community. We also identified the issues encountered during the adaptation process, the debugging techniques, and the solutions to these problems.
2. Issues and Solutions
2.1. Preparations and Related Issues Encountered
1. **Install ROCm 6.1 Software Stack**: After installation, we can verify the basic software and hardware environment using `rocm-smi`.
2. **Basic Software and Hardware Platform Testing**: First, we performed basic testing by pulling the ROCm 6.1 container provided by AMD, confirming that the GPU can perform basic tasks in the software and hardware environment.
3. **Setting Up an Environment on the Host System to Support `rocm-smi`**:
1. First, create a Python 3.10 environment based on conda.
2. Following AMD's official tutorial, we installed PyTorch with ROCm support.
3. In the Python environment, after importing `torch`, print the PyTorch version number to verify it is PyTorch 2.4 with ROCm 6.1 support.
4. The first issue encountered here was that `print(torch.cuda.is_available())` still showed that GPU computation was not supported.
5. The cause of this issue was that the current user did not have permission to access the GPU-related devices under `/dev` (e.g., `/dev/kfd`). The solution was to add the current user to the relevant group (render group), which resolved the issue.
4. **Install the Required Libraries According to `OOTDiffusion`'s `requirements.txt`**.
2.2. Adapting OOTDiffusion Code and Encountered Issues
**Attempt to Execute the OOTDiffusion `run/run_ootd.py` Script**
During the initial attempt to execute the `run/run_ootd.py` script of OOTDiffusion, the primary issue encountered was:
2.2.1. Issue Description
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 72.00 GiB. GPU 0 has a total capacity of 44.98 GiB of which 37.17 GiB is free. Of the allocated memory 7.18 GiB is allocated by PyTorch, and 220.70 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2.2.2. Analysis
The phenomenon indicates a typical out-of-memory issue. The first step was to attempt to alleviate the problem within the main script call, `run_ootd.py`. Notably, `run_ootd.py` uniformly adjusts images to a size of 768x1024 for processing. Thus, the priority was to try reducing the image size in `run_ootd.py` to ease the memory pressure on the entire processing pipeline. After adjusting the target image size to 384x512, the pipeline could run to completion. Although the generated results initially met the model's objectives, there were flaws likely related to the image resolution used in the processing pipeline. This confirmed that the main issue was insufficient memory during a particular computation step. The model operations were generally supported by the ROCm + RDNA3 combination, but specific computation steps needed memory optimization.
The next step was to pinpoint the exact location of the memory issue. By printing information at each major step in `run_ootd.py` and combining it with the output from `rocm-smi`, the issue was traced to the `F.scaled_dot_product_attention` call in the `~/anaconda3/envs/py3xrocmtorch2/lib/python3.10/site-packages/diffusers/models/attention_processor.py` script. This is a fundamental attention computation step. Smaller images could complete the computation, but larger images caused memory insufficiency, indicating that PyTorch's current attention computation implementation is not memory-friendly under the current software and hardware combination.
2.2.3. Thinking in Solution
The solution involved optimizing the attention computation for memory efficiency on the current software and hardware platform under PyTorch. In a CUDA environment, this is typically achieved using either the xformer library or the flash-attention library. However, neither library has default support for ROCm + RDNA3. Initially, an attempt was made to obtain support from ROCm software libraries, but the main branch of ROCm's flash-attn only supports CDNA architecture hardware. Subsequent searches revealed that a branch (`howiejay/navi_support`) of ROCm's flash-attention provides RDNA3 support.
2.2.4. Steps of Solution
The first step to solve the problem was to check out the `howiejay/navi_support` branch from [ROCm's flash-attention repository](https://github.com/ROCm/flash-attention) and install it.
The second part of the solution involved modifying the OOTDiffusion code with minimal changes to support ROCm + RDNA3's flash-attention implementation. This was achieved using the `sdpa_hijack` approach, hijacking the attention-related calls in the processing flow and replacing them with the flash-attention implementation that supports ROCm + RDNA3. Refer to the specific code modifications in the attached submission, particularly in run/run_ootd.py
.
This resolved the entire issue.
2.3. Performance
We compared the image generation speed of the W7900 using this solution with the NVIDIA 3090 executing the same processing pipeline. The performance was similar (1.9s/it vs 1.4s/it).
3. Conclusion
The ROCm + RDNA3 software and hardware combination provides extensive support for deep learning applications based on PyTorch. Most operators and computation processes work out of the box. Some operators require the installation of ROCm-optimized computation libraries for further optimization to achieve better performance. It is important to note that the RDNA3 and CDNA architectures may require different ROCm libraries for the optimization of the same operators. Overall, ROCm + RDNA3 has reached a high level of usability as a deep learning computation platform.
Comments