The landscape of artificial intelligence has seen a significant transformation with the advent of large language models (LLMs). These models, such as ChatGPT and Llama, and their successors, have demonstrated remarkable capabilities in understanding and generating human-like text. The increase in model size and complexity has directly correlated with their improved performance across a variety of tasks, from natural language understanding to generation. However, this enhancement comes at the cost of requiring substantial computational resources and memory, posing challenges for deployment, especially on edge devices or in scenarios where resources are constrained.
Quantization has emerged as a critical technique in the deployment of large language models, addressing both the computational and memory challenges. Before the rise of LLMs, quantization research primarily focused on leveraging efficient compute instructions. This involved converting high-precision floating-point operations into lower-precision representations, such as INT8 or FP16, to speed up computations and reduce power consumption. In the context of large models, however, the primary benefit of quantization has shifted towards reducing the memory footprint. By representing model weights and activations in lower precision, significant memory savings can be achieved. This is crucial for deploying these models on devices with limited memory capacity, enabling faster inference and reducing the overall cost of deployment.
Despite the clear advantages of quantization, implementing efficient quantization kernels for large language models remains a challenging task. The difficulty lies in the diverse hardware architectures and the need for specialized optimizations to fully exploit hardware capabilities. Standard approaches often fall short in providing the necessary performance improvements, as they may not fully leverage hardware-specific features such as Matrix Cores or advanced vector instructions. Additionally, achieving precision stability and minimizing accuracy loss are critical considerations that complicate the development of quantization techniques. As a result, there is a growing need for flexible and efficient quantization solutions that can be tailored to different hardware platforms.
To address these challenges, we introduce a novel machine learning compiler designed to facilitate the efficient deployment of quantized large language models. Our framework, BitBLAS, provides a hardware-aware tensor transformation approach that optimizes the execution of low-precision computations. One of the key highlights of our solution is its ability to achieve optimal performance compared to state-of-the-art libraries such as ROCm’s rocBLAS. BitBLAS automatically generates optimized code that takes full advantage of the underlying hardware, ensuring that quantization does not come at the expense of performance. This is particularly significant for high-performance computing environments and edge devices, where resource efficiency is paramount. By focusing on both compute efficiency and memory savings, our framework enables the deployment of large language models in a wide range of settings, unlocking new possibilities for AI applications.
In summary, BitBLAS represents a significant advancement in the field of low-precision deep learning computing. By addressing the challenges of quantization kernel support and offering a machine learning compiler that achieves optimal performance, our framework stands poised to facilitate the efficient deployment of large language models across diverse hardware platforms. This not only democratizes access to cutting-edge AI technology but also sets the stage for future innovations in the field. We publiced BitBLAS/Ladder at BitBLAS.
Intermediate Representation and Primitive Design for Tensor SchedulingA significant reason existing automatic tensorization approaches fail to achieve theoretical hardware performance is the inability to fully utilize bandwidth. Maximizing bandwidth utilization requires carefully controlling the placement of data at each level of storage, thereby maximizing the bandwidth utilization rate of each memory access level. Current machine learning compilation work does not consider data placement issues during automatic optimization. Based on this insight, this design proposes that machine learning compilation systems should not only consider the search space for computational scheduling but also introduce a search space for tensor scheduling. This search space should include the data layout of tensors, allowing for the existence of feasible solutions that can achieve the theoretical maximum hardware performance for the target application and hardware acceleration units.
We outline the efforts made by academia and industry to address the challenges of new computing paradigms brought about by the development of large models. From these solutions, two important characteristics of new computing paradigms and mainstream hardware architectures can be summarized:
Memory System Compatibility: Although hardware accelerators may not support direct computational instructions for certain custom data types, their memory systems can store any data type by converting these custom data types into fixed-width opaque data blocks. This indicates that data storage and indirect processing can be achieved on hardware that does not directly support these custom data types by appropriately packaging and converting the data.
- Memory System Compatibility: Although hardware accelerators may not support direct computational instructions for certain custom data types, their memory systems can store any data type by converting these custom data types into fixed-width opaque data blocks. This indicates that data storage and indirect processing can be achieved on hardware that does not directly support these custom data types by appropriately packaging and converting the data.
Computational Instruction Compatibility: Most custom data types can be losslessly converted into standard data types with wider bit-widths supported by existing hardware computation units for processing. For example, converting NF4 tensors into FLOAT16 or FLOAT32 allows for computation on NF4 tensors on existing hardware. This observation suggests that through data type conversion, the computational capabilities of existing hardware for different data types can be extended, enabling efficient processing of a wide range of data types.
- Computational Instruction Compatibility: Most custom data types can be losslessly converted into standard data types with wider bit-widths supported by existing hardware computation units for processing. For example, converting NF4 tensors into FLOAT16 or FLOAT32 allows for computation on NF4 tensors on existing hardware. This observation suggests that through data type conversion, the computational capabilities of existing hardware for different data types can be extended, enabling efficient processing of a wide range of data types.
Therefore, this design proposes the introduction of tensor scheduling in machine learning compilation systems, leading to the design of a machine learning compilation system centered around tensor scheduling, called BitBLAS. To incorporate tensor scheduling, this chapter will first design a tensor-centric intermediate representation, a set of tensor scheduling primitives to manage tensors, and the final result of tensor scheduling—the execution sketch. This chapter will also introduce the implementation and optimization process of the system. Finally, the new challenges faced by the system after introducing tensor scheduling will be discussed.
Compiler Abstraction Centered on TensorsUnlike traditional scalar data types (e.g., 16-bit floating-point numbers, 32-bit fixed-point numbers), new data types under large model development may be vector-based (e.g., quantized 4-bit data types with row scaling factors) or even tensor-based (e.g., data types with grouped precision adjustments). Thus, the system introduces a new data type, \textit{typedTile} (\textit{ttile}). This data type has the following characteristics:
- Tiled Storage: A tiled data structure can encompass all data types, whether scalar, vector, or tensor-based. The TileShape, nElemBits, and metadata can store all tensor representation information.
- Custom Conversion: It allows the provision of conversion methods from this data type to other data types, such as converting NF4 data type to 16-bit floating-point or INT4 data type to 32-bit fixed-point. During actual compilation optimization, the tensor scheduler selects an appropriate data type for conversion.
Specifically, algorithm designers can use common data types (e.g., FLOAT16) or define custom data types (e.g., MXFP8 or NF4) as \textit{ttype} and define neural network model computations under this data type. BitBLAS takes the neural network model as input and converts it into a data flow graph based on \textit{ttile} (i.e., \textit{ttile}-graph), where operators are defined as \textit{ttile}-operators. The figure shows that BitBLAS abstracts the target hardware accelerator as a multi-level storage structure, with each level represented as a \textit{ttile}. By taking over the tensor types in the \textit{ttile}-graph using \textit{ttile}, the system can introduce tensor scheduling, such as padding tensors or converting data types.
Furthermore, the system introduces an abstraction for tensor data layouts, determining how tensors should be placed at each storage level. This layout transformation is mutual, meaning if one \textit{ttile} changes its data layout, subsequent \textit{ttiles} consuming this tensor must adjust their data access accordingly to ensure program correctness. Therefore, this layout change needs to be separate from the definition of \textit{ttile} itself. To this end, the system introduces the IndexMap abstraction, which provides a paired data layout transformation primitive (TransformLayout) for transforming tensor layouts between \textit{ttiles}. For example, if a \textit{ttile} changes its data layout to transpose storage, the user creates an IndexMap data type with IndexMap(initial_indices=[i, j], final_indices=[j, i]) and applies the tensor scheduling primitive TransformLayout(tTile, read, 0, index_map). The \textit{ttile} producing the tensor will store the data in a transposed manner, and subsequent \textit{ttiles} consuming this tensor must access data according to the transposed layout.
With the \textit{ttype} abstraction for tensor data types, the \textit{ttile} abstraction for tiling, and the IndexMap abstraction for tensor data layouts, this design opens up another dimension of the search space beyond traditional computational scheduling, making tensor scheduling possible.
Device Abstraction for Describing Hardware RequirementsIn addition to adding tensor attributes to computation graphs and operator scheduling, this design enriches the target hardware abstraction. Traditional machine learning compilers use target hardware information only for generating corresponding compilation options based on the architecture model. This design proposes the \textit{tDevice} abstraction, which additionally introduces information about the size and bandwidth of each memory level in the target hardware architecture, the abstract representation of supported hardware instruction sets, and the expected memory access patterns at each storage level. This information includes supported data types and expected access patterns relevant to tensor scheduling, aiding further exploration of system optimization after introducing tensor scheduling.
The size and bandwidth information of each storage level in the target hardware architecture will be used to generate the search space. By leveraging this hardware information, the system can significantly reduce the search space. Most hardware parameters do not require manual input from the user. For example, cache size, warp size, and other information can be directly obtained through TVM's device driver encapsulation. Users only need to provide a few easily accessible hardware parameters, such as global memory bandwidth and shared memory bandwidth.
The definition of hardware instruction sets includes tensor expressions for the target instruction set, supported data types, and the expected memory access patterns. The functions of this information are as follows:
- Tensor Expressions for the Target Instruction Set: Used to match tensor expressions in the target abstract syntax tree and replace sub-trees that can be tensorized. This expression will also be used as an input to the automatic tensorization algorithm.
- Supported Data Types for the Target Instruction Set: Different hardware instructions expect different data formats as input. For example, the DP4A instruction accepts eight INT8 data types, producing an INT32 accumulated value, while the HFMA2 instruction accepts four FLOAT16 data types, returning two accumulated FLOAT16 data types. The definition of these data types helps select appropriate instruction sets for tensorization. For instance, the NF4 instruction set can be compatible with the FLOAT16 data type. The computation range of matrix multiplication between FLOAT16 and NF4 can be accommodated by both FLOAT16 and FLOAT16 matrix multiplication and FLOAT32 and FLOAT32 matrix multiplication. Even if the target computation data type is determined to be FLOAT16 and FLOAT16 matrix multiplication, hardware has different instruction options, such as HFMA2 and HMMA. This design uses a heuristic matching method for dual selection: first, select the minimum bit-width data type that satisfies numerical precision, and then match the hardware instruction with the highest throughput according to the current computation mode.
- Expected Memory Access Patterns for Target Instructions: Different instructions may have different expected access patterns, which can be irregular. For example, in multiplying two row-major matrices, the HFMA2 instruction expects to access parallel two-element vectors from both matrices, while the DP4A instruction expects to access vertical four-element vectors. The expected access patterns for Tensor Core instructions can be even more complex depending on the configuration. This complexity in access patterns is a fundamental reason why optimizing memory bandwidth is challenging. To better address this, this design explicitly defines the target instructions' access patterns.
Finally, the \textit{tDevice} abstraction also includes modeling the expected access patterns for each memory level's storage and retrieval. These, together with the expected memory access patterns of hardware instructions, will guide the generation of the search space described, both based on the IndexMap abstraction.
Inspirations and Design of Tensor Scheduling PrimitivesThe design of tensor scheduling primitives in this system draws inspiration from practices across multiple fields, including parallel computing, high-performance computing, and deep learning accelerator architecture design. These fields share the common goal of improving the execution efficiency of computational tasks while minimizing resource consumption. To achieve these objectives, this system introduces a set of scheduling primitives specifically optimized for tensor operations. These primitives aim to provide flexible computational scheduling capabilities that adapt to the characteristics of different hardware architectures while optimizing data storage and access patterns to enhance overall computational efficiency.
Borrowing Strategies from TVM Scheduling Primitives: As an innovative deep learning compiler framework, TVM provides a range of advanced computational scheduling primitives such as split, reorder, fuse, vectorize, parallel, and unroll. These primitives allow developers to fine-tune the computation graph, such as improving data locality through loop transformations or increasing computational efficiency through parallelization and vectorization. In designing tensor scheduling primitives, this system incorporates the lessons learned from TVM's scheduling primitive design, integrating these advanced optimization strategies into tensor operations in a more abstract form, thereby enabling developers to achieve hardware-specific optimizations with relatively low barriers.
Adapting to Multi-level Memory Architectures: Modern computing hardware, especially GPUs and specialized deep learning accelerators, typically employ multi-level memory architectures to optimize data access efficiency. Inspired by TVM's compute_at primitive, this system's Slice and Pad primitives help developers better manage data layout and migration across different storage levels, such as moving a large tensor from global memory to shared memory or registers. Such optimization techniques can significantly reduce memory access latency and increase cache utilization.
Flexible Data Type Conversion: As deep learning models grow increasingly complex, different stages of computation may require data representations of varying precision. The cast primitive in TVM inspired the design of the Convert primitive in this system, which supports flexible data type conversion. This capability allows models to optimize computational performance and memory usage while maintaining computational precision.
Detailed Design of Tensor Scheduling PrimitivesIn summary, the design of tensor scheduling primitives is inspired by advanced computation frameworks and hardware architectures and informed by the high-level scheduling strategies found in TVM. Through these primitives, the system aims to provide efficient and flexible optimization paths for deploying deep learning models on various hardware.
To achieve more efficient and flexible operations in data type scheduling and conversion, this paper proposes a set of tensor scheduling primitives. These primitives have the following functions:
- Slice: This scheduling primitive allows for extracting a smaller sub-tensor from a larger one. The design of this primitive mimics common problem decomposition strategies in hardware accelerators, such as dividing a large tensor into multiple registers or distributing a tensor in global memory into shared memory. The Slice primitive defines the mapping relationship of tensors across various storage levels, enabling efficient data access and processing. For example, the Slice primitive extracts a 4x4 portion from an 8x8 large \textit{ttile}.
- Pad: Tensor hardware acceleration typically requires operations to be designed for an entire vector or tensor. The problem size may not necessarily align with the size expected by tensor operations, potentially introducing branching logic when forced into tensorized computation, thereby reducing efficiency. The Pad primitive matches tensor dimensions to the integer multiple required by the hardware acceleration instruction, eliminating these branches and optimizing computational performance. For example, the Pad primitive pads an 8x7 \textit{ttile} to 8x8 to match the tensor acceleration instruction.
- TransformLayout: This scheduling primitive changes the data layout of a tensor at a specific storage level. Different computational stages may require different data layouts to optimize access efficiency and computational performance. By dynamically adjusting the data layout, the TransformLayout primitive helps the system better adapt to underlying hardware structures, enhancing data processing flexibility and efficiency. For example, the TransformLayout primitive scrambles four-element groups in an 8x8 \textit{ttile}.
- Convert: As deep learning models evolve with new computation paradigms, the system needs to convert data types dynamically. The Convert primitive is used for online data type conversion, ensuring that tensors can seamlessly transition between different representations during various computation stages, meeting different hardware requirements and optimization goals. For example, the Convert primitive converts an 8x8 \textit{ttile} of 4-bit fixed-point data type to a 16-bit floating-point \textit{ttile} of the same size.
The results of tensor scheduling will be reflected in a combination of these scheduling primitives, collectively achieving efficient tensor scheduling.
Execution Sketch of Tensor SchedulingA program execution sketch is an abstract representation in the compilation process that outlines how the compiler plans to execute a specific program or computational task. The sketch does not focus on detailed implementation but instead describes key decisions regarding execution strategies, data layouts, and computation patterns. The purpose is to provide the compiler with a clear direction and framework to guide subsequent optimization and code generation processes. The execution sketch plays several critical roles in the compilation and optimization process:
- It guides optimization decisions, clarifying computation patterns and data management strategies.
- By determining key strategies and constraints, it narrows the search space the compiler needs to explore during optimization.
- It modularizes optimization tasks, improving code reusability and maintainability.
- As a high-level plan, the sketch facilitates adjustments and iterations during the optimization process.
The result of tensor scheduling will be encapsulated as a tensor scheduling execution sketch, describing information such as tensor data layouts, data type conversions, and computation scheduling. This sketch will guide subsequent optimization and code generation processes in the compiler. Figure below illustrates the tensor scheduling sketch generation process for matrix multiplication under quantized large-model scenarios on NVIDIA's Ampere architecture GPUs. In the system, each level of memory access will be represented by a TransformLoad/TransformStore function, with these functions encompassing the actual tensor scheduling primitives. and (a) describes the tensor computation using \textit{ttype}. The left matrix A for matrix multiplication is a tensor of type FLOAT16, and the right matrix B is a tensor of type NF4. This matrix multiplication uses the FLOAT32 data type for accumulation, eventually writing back to a FLOAT16 data type, annotated using \textit{ttype}. This computation can be described more flexibly, with flexibility reflected in implicit data type conversions (e.g., NF4 to FLOAT16, FLOAT16 to FLOAT32) and in the implicit conversion of the accumulation data type, using FLOAT32 for accumulation.
Additionally, this matrix multiplication problem is irregular, with M=32, N=32, and K=63, where K is not a power of 2. Therefore, the TransformLoad function requires the Pad primitive to match tensor dimensions to the integer multiple expected by the lower storage. (b) shows the tDevice memory hierarchy abstraction of the Ampere architecture and the expected memory access requirements at each level. Based on this tDevice abstraction and computation description, BitBLAS generates the scheduling computation pseudo-code shown in (c), progressively loading data through TransformLoad from outer storage levels to the innermost storage for computation and writing back to outer storage through TransformStore.
The data layout changes of input matrix B at each memory level are shown in (e). (d) defines the TransformLoad functions for matrix B at each storage level, fully utilizing the tensor scheduling primitives: the Pad primitive matches tensor dimensions to the expected integer multiple of the lower storage level; the Slice primitive extracts a small block of data from outer storage to the lower storage; the Convert function converts the tensor data type from NF4 to FLOAT16; and the TransformLayout primitive completes the data layout transformation. By applying this tensor scheduling strategy, the computationally intensive operators in the system have determined the types of storage they need to use, and the computational scheduling only needs to perform data access sequentially without considering tensor-related scheduling space.
The core implementation of BitBLAS consists of approximately 8, 000 lines of Python and C++ code. It is based on or inspired by various open-source deep neural network compiler technologies, such as TVM, Welder, and Roller. TVM has been modified to implement computation scheduling and target code generation, while Roller's search space strategies assist in inferring efficient configurations. Welder, currently the most performant machine learning compiler, explores fine-grained operator fusion strategies under tiled computation graphs, providing comprehensive optimization for DNN models. This design leverages Welder to implement operator fusion.
The input to BitBLAS is a PyTorch program. For built-in PyTorch data types, BitBLAS requires no modifications to the DNN model program. Additionally, for new data types not supported by PyTorch, BitBLAS extends PyTorch with custom operators to express tensor expressions for user-defined data types. Given a PyTorch program, BitBLAS exports it to an ONNX computation graph. Simultaneously, BitBLAS extends ONNX to represent computations on new data types, with the tensor expressions for new data types stored in the attributes of ONNX graph nodes.
Using the exported ONNX graph and a specialized format file based on tiling for the target hardware accelerator, BitBLAS automatically converts the ONNX graph into a tiled computation graph intermediate representation and performs tensor and computation scheduling. Subsequently, BitBLAS generates device code for the target hardware accelerator through TVM. The system has been implemented for NVIDIA and AMD GPUs, as they are the most popular and widely supported hardware in the neural network domain.
The system optimization process of BitBLAS, begins by receiving a high-level neural network computation graph and converting it into a tiled computation graph intermediate representation. The optimization process is divided into tensor and computation scheduling. The tensor scheduler, combined with the tDevice hardware abstraction, selects an appropriate computation instruction for the target computation mode. This process involves an automatic normalization algorithm based on iterators, completing the tensor scheduling dimensions of data type conversion scheduling and tensor hardware mapping. The tensor scheduler then derives an optimal tensor data layout based on the target computation instruction, the current computation mode, and the preferred access patterns defined in tDevice for each memory level, thereby completing the tensor scheduling dimension of data layout scheduling. The results of the tensor scheduler are encapsulated in an intermediate data structure called hardware hints. This structure includes hardware abstraction information and the derived optimal data layout, guiding the generation of the search space for the computation scheduler
System Optimization1. Automated Tensorization Expression Derivation Based on IteratorsEfficiently finding compatible hardware instructions and generating a search space for complex computations can be challenging. For instance, as shown in the convolution operator, there are seven nested loops, where four loops can be spatial iterators and three are reduction iterators. The target tensor computation unit performs a fixed-size matrix multiplication with three nested loops, including two spatial iterators and one reduction iterator. Mapping the seven-loop convolution to the three-loop matrix multiplication is complex. A direct convolution approach might choose three axes to map to the tensor accelerator’s three axes, but this limits the scheduling space. This section proposes an automated tensorization algorithm to normalize any complex expression into a computing pattern compatible with hardware instruction sets, thereby reducing the complexity of the search space.
Automated tensorization derivation is also crucial for converting general computational expressions into hardware-optimized tensor operations. This process is essential for generalizing tensorized programs, allowing systems to convert programs that are not easily tensorizable, such as Conv1D, Conv2D in neural network computations, or Stencil in scientific computing, into fully tensorizable expressions. This mapping is vital for optimizing data access patterns in tensor operations, significantly improving computational efficiency.
The design takes convolution operators and AMD GPUs’ Matrix Cores as examples to illustrate the algorithm workflow.
Creating Classified Iterators Based on Target Instruction Expressions: Matrix Cores can complete matrix multiplication in a few cycles, simplified as C[I, J] += A[I, K] * B[J, K]. This design first traverses iterators across different caches, determining the iterator classification. For Matrix Core instructions, the classified iterators include:
- kIter_I: The I axis, appearing in both cache C and A.
- kIter_J: The J axis, appearing in both cache C and B.
- kIter_K: The K axis, appearing in both cache A and B.
For each iterator, the algorithm assigns a data structure IterTrait describing the iterator’s type (\texttt{IterKind}) and range IterRange
The core idea of automated tensorization derivation is to match the target hardware’s indexing mapping using an iterator classification algorithm for a given computational expression, ultimately generating a fused indexing mapping IndexMap.
Through this process, automated tensorization derivation generates fine-grained indexing mappings based on specific iterator traits and a predefined order. The 2D convolution in (a) can be matched to the Matrix Core expression in (b) through automated tensorization derivation, ultimately transforming into the data layout transformation and three-loop matrix multiplication in (d). Table 1 lists the mappings of seven 2D convolution operators in a classic convolutional neural network, ResNet-18, for a batch size of 128, as derived by this automated tensorization method on NVIDIA’s Ampere architecture. Note that the target tensor instructions in the table are abbreviations of actual hardware instructions, including the ldmatrix instruction for loading data from memory to registers and the mma instruction for computation.
The automated tensorization derivation process is crucial for achieving efficient tensor operations. It not only helps researchers determine if a computation can be compatible with target hardware instruction sets but also optimizes data access patterns, enhancing overall computational performance. More importantly, this approach simplifies the target constraints for search space generation to a single instruction set, avoiding the complexity of exploring the search space based on the original complex computational expressions. Thus, the search space for single operator optimization can be converted into a search space for single instruction optimization.
This section focuses on addressing the challenge of the large search space for tensor data layouts. There are two main difficulties in creating a search space for data layout: the complexity of expressing layouts and the vastness of the search space. To precisely specify the location of data in storage, the IndexMap expressions used must achieve a bijective transformation, meaning each data element must have a unique position in storage. This requires not only an accurate mapping but also ensuring the uniqueness and invertibility of the mapping. Constructing expressions that meet these criteria is challenging, especially given the large search space, making it nearly impossible to find the optimal data layout. To address these core challenges, we propose an optimal layout derivation algorithm based on a previously developed device abstraction that describes hardware requirements. This algorithm reduces the size of the search space through innovative methods and intelligent strategies while simplifying the expression of data layouts, making the search process feasible.
Figure below illustrates an intuitive memory layout derivation scheme implemented by BitBLAS on the NVIDIA A100 GPU. On NVIDIA GPUs, data must go through two stages of transfer: from global memory to shared memory, and then from shared memory to registers. Based on this process, the optimal memory layout derivation is divided into two main parts, following a bottom-up approach:
- Deriving Data Layout from Registers to Shared Memory: This involves optimizing data layout within a thread block. Typically, there is a mismatch between the access pattern determined by computational instructions and the optimal access pattern supported by the GPU's shared memory (i.e., without bank conflicts). To address this issue, the strategy extends the derivation of access patterns outward, ensuring that data is accessed sequentially from shared memory to registers, maximizing the utilization of shared memory bandwidth.
- Deriving Data Layout from Shared Memory to Global Memory: This focuses on optimizing data layout between thread blocks. At this stage, the strategy has already optimized the access pattern from registers to shared memory to a sequential write mode, which in turn affects the global memory read process, often making it non-sequential. While NVIDIA's architecture allows for linear access within certain lengths in global memory, there are limitations in this optimization process. This is especially true when dealing with asymmetric access patterns (such as operations involving 16-bit floating-point numbers multiplied by 4-bit fixed-point numbers), where achieving optimal performance can be challenging. Therefore, the strategy further optimizes the access pattern at this level to ensure a more efficient linear correspondence between global memory reads and shared memory.
Through these two parts of memory derivation, the system ensures that the computational program always performs linear memory accesses, optimizing the memory access pattern for any given computational instructions and methods.
Tuning with Tile Graph
In the field of Deep Neural Networks (DNNs), data flow graphs are a common representation, where each node represents a computationally intensive operation, such as matrix multiplication. The concept of a Tile Graph, introduced by Welder\cite{welder}, represents a finer-grained form of computation flow graph. Building upon this concept, BitBLAS has made innovations and optimizations. In BitBLAS, the neural network computation graph is first converted into a Relay computation graph, which includes operator information and is represented in Relay. This graph is then further transformed into a Tile Graph containing tile information.
To represent multiple operators in a single Tile Graph, potential tile shape conflicts between two adjacent operators must first be resolved. Welder employs a method of backward propagation of output tile shapes to automatically derive aligned tile shapes. This process relies on the assumption that the computation logic within each operator can be accurately preserved at the tile granularity. Additionally, to determine which tile shape can provide the best performance, Welder aligns the computation mode with hardware characteristics to minimize data movement across all memory layers. For operators with aligned tile configurations, the data flow can be easily modeled based on the size of the input/output tiles and the shape of the input/output tensors.
Moreover, Welder discovered a key insight when considering the entire memory hierarchy: the optimization of memory traffic is essentially independent between different memory layers, a phenomenon known as inter-layer independence. Specifically, the aforementioned traffic model is only influenced by the tile configuration of the memory layer of interest. Based on these observations, Welder establishes an efficient process to optimize the entire space: starting with aligning two adjacent operators in the independent memory layer, determining the optimal tile size at the appropriate memory layer according to traffic cost, and extending this optimization to more operators.
BitBLAS incorporates Welder's core idea of the Tile Graph, a tile-level data flow graph model used to model DNN computation. In this model, each node processes a data tile of a tensor at a time, thus mapping DNN computation onto a multi-layer memory structure. The system allows control over the size of the data tile for each node and the desired memory layer for reuse between nodes. Specifically, Welder provides the SetConnect interface for setting data reuse layers and the Propagate interface for inferring tile configurations within a set of connected nodes.
To effectively optimize tile-level data flow scheduling, Welder leverages the inter-layer independence characteristic of data flow computation, decoupling the optimization space into multiple subspaces. Based on this, a two-layer scheduling strategy is designed, enumerating different memory connection options for each connection and determining an efficient tile configuration for each subspace guided by a traffic cost model. Ultimately, the optimized execution plan is mapped to executable code for specific hardware accelerators through four abstract computation interfaces defined at the hardware layer: Allocate, Load Tiles, Compute Tile, and Store Tiles.
With tile-level global data flow scheduling, the Tile Graph can unify all common operations fusion (e.g., register-based element-wise fusion, shared memory-based fusion, etc.) into a single framework. This generality not only allows the system to automatically discover various uncommon operation fusion patterns not explored by existing rule-based methods but also supports new requirements, such as handling DNN models with arbitrarily large inputs (e.g., high-resolution images), even when a single operation's scale may exceed the GPU memory limit. By extending the existing memory hierarchy to more levels (such as host memory), the system can generate an optimized execution data flow across the host and device memory hierarchy.
End2End Examples1. Matmul and GEMV PerformanceExperimental Data Analysis: Figure above shows a performance comparison on AMD’s MI210 data center GPU between AMD’s rocBLAS compute library and similar compilation software. The rocBLAS library lacks complete Swizzle support, so its Matrix Core performance on the MI210 GPU only reaches about 43% of the theoretical hardware limit on average. In comparison, the similar compiler TensorIR achieves only about 24% of the theoretical performance. In contrast, BitBLAS not only significantly outperforms the manufacturer-provided compute library but also achieves up to 74% of the MI210’s theoretical hardware performance.
Explanation of Experimental Results: NVIDIA’s compute libraries and hardware have undergone years of iterative optimization. However, other hardware manufacturers often do not have the same level of consideration in hardware design as NVIDIA. For example, while the AMD MI210 GPU provides a Matrix Core to compete with the A100’s Tensor Core, it lacks matching peripheral circuits, such as the \textit{ld.matrix} and \textit{cp.async} instructions introduced in the Ampere architecture. These instructions are essential for fully utilizing the hardware’s memory bandwidth and reducing the overhead on register files. The absence of these instructions in the MI210 may prevent its Matrix Core from achieving its full potential. From a software perspective, NVIDIA has developed comprehensive Swizzle strategies to resolve conflicts between upper-level compute memory access patterns and lower-level hardware memory expectations. However, judging by the performance of rocBLAS, this issue has not been fully addressed. Furthermore, while existing tensorization efforts support NVIDIA’s Tensor Cores, there is still a lack of support for AMD’s Matrix Cores.
BitBLAS‘s strategy makes it easier to explore full memory utilization, even though it does not achieve 100% hardware performance utilization in matrix multiplication. This can be reasonably explained by two factors. First, the absence of instructions that fully exploit hardware bandwidth means the Matrix Core cannot reach its full potential. Second, when processing actual random data, the high number of level inversions leads to increased power consumption, causing the hardware to downclock to reduce heat. This typically results in a performance loss of over 10%, so a utilization rate close to 80% is likely near the hardware’s theoretical maximum. Using NVIDIA’s Nsight Compute tool, we locked the GPU’s frequency and observed that BitBLAS‘s compute resource utilization averaged around 95%, fully utilizing the hardware’s compute resources, consistent with our theoretical analysis.
2. End2End PerformanceExperimental Data Analysis on AMD MI210 GPU: The experiment evaluated the end-to-end inference latency of BitBLAS on AMD GPUs, comparing it with Welder, PyTorch-Inductor, and ONNXRuntime. Figure shows the end-to-end performance for six models. In the $W{\text{FP16}}A{\text{FP16}}$ data type configuration, BitBLAS achieved average speedups of 2.1x on Llama, 2.35x on Bloom, 1.5x on ResNet, 13.5x on ShuffleNet, 1.6x on Conformer, and 1.5x on ViT. For large language models (LLMs) using the $W{\text{INT4}}A{\text{FP16}}$ data type configuration, BitBLAS delivered up to 3.8x speedup on Llama and 4.5x on Bloom compared to Welder.
3. Tuning Time for OperatorsExperimental Data Analysis: Table shows a performance comparison on AMD's MI210 data center GPU between AMD's compute library rocBLAS and similar compilation software. The rocBLAS library lacks complete Swizzle support, so its Matrix Core performance on the MI210 GPU reaches only about 43% of the theoretical hardware limit on average. Similarly, the TensorIR compiler achieves only 24% of the theoretical performance. In contrast, BitBLAS not only significantly outperforms the manufacturer-provided compute library but also achieves up to 74% of the MI210's theoretical hardware performance.
Explanation of Experimental Results: NVIDIA's compute libraries and hardware have benefited from years of iterative optimization. However, other hardware manufacturers often do not have the same level of detailed design considerations. For example, the AMD MI210 GPU, despite offering a Matrix Core comparable to NVIDIA's A100 Tensor Core, lacks matching peripheral circuitry such as the \textit{ld.matrix} and \textit{cp.async} instructions introduced in the Ampere architecture. These instructions are crucial for fully utilizing the hardware's memory bandwidth and reducing register file overhead. The absence of these instructions in the MI210 likely prevents the Matrix Core from achieving its full potential.
From a software perspective, NVIDIA provides comprehensive Swizzle strategies to resolve conflicts between upper-level compute memory access patterns and the lower-level hardware memory expectations. However, based on rocBLAS's performance, it appears that these issues have not been fully resolved, and existing tensorization efforts, while supporting NVIDIA's Tensor Core, do not yet support AMD's Matrix Core.
BitBLAS's strategy facilitates more efficient exploration of memory utilization. Although BitBLAS did not achieve 100% hardware performance utilization in matrix multiplication, this can be reasonably explained. First, the lack of instructions to fully exploit hardware bandwidth limits the performance of the Matrix Core. Second, when processing real-world random data, the high number of level inversions increases power consumption, leading the hardware to downclock to reduce heat, typically resulting in a performance loss of over 10%. Thus, a utilization rate close to 80% is likely the theoretical maximum for this hardware. Using NVIDIA's Nsight Compute tool, we locked the GPU's frequency and observed that BitBLAS's compute resource utilization averaged around 95%, fully utilizing the hardware's computational resources, consistent with our theoretical analysis.
4. Tuning Time for End2EndExperimental Data Analysis: On the MI210 hardware platform, using ResNet50 as an example (with batch sizes of 1 and 128), BitBLAS achieved tuning speeds approximately 123x and 50x faster than AMOS, respectively. Even when compared to TensorIR and Welder, BitBLAS showed nearly 20x speedups in some models, demonstrating a significant advantage in tuning efficiency. Overall, BitBLAS demonstrated a significant advantage in tuning time, especially when handling large batch sizes. Although Ladder introduces additional tensor scheduling search space, resulting in slightly slower tuning speeds compared to Welder, it still offers a very acceptable tuning duration compared to other automatic tensorization frameworks.
Explanation of Experimental Results: AMOS and TensorIR utilize Ansor-like search strategies, leading to an overly large search space for optimizing individual operators, which results in long search times across the entire network. In contrast, BitBLAS redesigned the search space for individual operators, significantly reducing the time required. Moreover, BitBLAS borrowed end-to-end search strategies from Welder to achieve good performance, while also expanding the search space to include tensor propagation and tensor conversion scheduling between operators. This results in a larger end-to-end search space than Welder’s, but the additional cost remains acceptable since the number of operators requiring tensor scheduling and propagation is limited. Overall, the experimental results for end-to-end model tuning time demonstrate BitBLAS’s advantage in tuning efficiency, which aligns with expectations.
Kernel LibraryInstallation GuidePrerequisitesOperating System: Linux (Ubuntu 20.04 or later recommended for installation via wheel or PyPI or you may need to checkout the Building from Source section for other Linux distributions.)
Python Version: >= 3.7CUDA Version: >= 10.0
Installing with pipThe easiest way to install BitBLAS is direcly from the PyPi using pip. To install the latest version, run the following command in your terminal.
Note: Currently, bitblas whl is only supported on Linux systems. We recommend using Ubuntu 20.04 or later version as we build the whl files on this platform. Currently we only provide whl files for CUDA>=12.1 and with Python>=3.8. If you are using a different version of CUDA. you may need to build BitBLAS from source.
pip install bitblas
Alternatively, you may choose to install BitBLAS using prebuilt packages available on the Release Page:
pip install bitblas-0.0.0.dev0+ubuntu.20.4.cu120-py3-none-any.whl
After installing BitBLAS, you can verify the installation by running:
python -c "import bitblas; print(bitblas.__version__)"
Building from SourceWe recommend using a docker container with the necessary dependencies to build BitBLAS from source. You can use the following command to run a docker container with the necessary dependencies:
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.01-py3
To build and install BitBLAS directly from source, follow the steps below. This process requires certain pre-requisites from apache tvm, which can be installed on Ubuntu/Debian-based systems using the following commands:
sudo apt-get update
sudo apt-get install -y python3 python3-dev python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev
After installing the prerequisites, you can clone the BitBLAS repository and install it using pip:
git clone --recursive https://github.com/Microsoft/BitBLAS.git
cd BitBLAS
pip install . # Please be patient, this may take some time.
if you want to install BitBLAS with the development mode, you can run the following command:
pip install -e .
Quick StartBitBLAS provides two Python APIs to perform mixed-precision matrix multiplication:
bitblas.Matmul
implements the $W{wdtype}A{adtype}$ mixed-precision matrix multiplication of $C{cdtype}[M, N] = A{adtype}[M, K] \times W{wdtype}[N, K]$ where $W{wdtype}$ indicates the weight of $wtype$, A{adtype} indicates the activation of $adtype$, and C{cdtype} indicates the output of $cdtype$.bitblas.Linear
is a PyTorchnn.Linear
-like module to support a Linear of mixed-precision.
Here is an example for a $W{INT4}A{FP16}$ mixed-precision matrix multiplication: $out{FP16}[M, N] = A{FP16}[M, K] \times W_{INT4}[N, K]$, the example includes the creation of input matrices, quantization of weight matrices, and execution of the multiplication. The result is then compared against a reference result obtained through conventional methods to ensure accuracy.
import bitblas
import torch
# enabling debug output
bitblas.set_log_level("Debug")
matmul_config = bitblas.MatmulConfig(
M=1, # M dimension
N=1024, # N dimension
K=1024, # K dimension
A_dtype="float16", # activation A dtype
W_dtype="int4", # weight W dtype
accum_dtype="float16", # accumulation dtype
out_dtype="float16", # output dtype
layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
with_bias=False, # bias
# configs for weight only quantization
group_size=None, # setting for grouped quantization
with_scaling=False, # setting for scaling factor
with_zeros=False, # setting for zeros
zeros_mode=None, # setting for how to calculating zeros
)
matmul = bitblas.Matmul(config=matmul_config)
# Create input matrices
input_tensor = torch.rand((1, 1024), dtype=torch.float16).cuda()
weight_tensor = torch.randint(0, 7, (1024, 1024), dtype=torch.int8).cuda()
# Transform weight tensor to int4 data type
weight_tensor_int4 = matmul.transform_weight(weight_tensor)
# Perform mixed-precision matrix multiplication
output_tensor = matmul(input_tensor, weight_tensor_int4)
# Reference result using PyTorch matmul for comparison
ref_result = torch.matmul(input_tensor, weight_tensor.t().to(torch.float16))
# Assert that the results are close within a specified tolerance, note that the int4 randint value is a little bigger than the float16 value, so we set the atol to 1.0
print("Ref output:", ref_result)
print("BitBLAS output:", output_tensor)
torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)
The same example can be extended to include the quantization of the weight tensor with scaling and zeros. The following code snippet demonstrates how to quantize the weight tensor with scaling and zeros and execute the mixed-precision matrix multiplication.
import bitblas
import torch
in_features = 1024
out_features = 1024
group_size = 128
matmul_config = bitblas.MatmulConfig(
M=1, # M dimension
N=out_features, # N dimension
K=in_features, # K dimension
A_dtype="float16", # activation A dtype
W_dtype="uint4", # weight W dtype
accum_dtype="float16", # accumulation dtype
out_dtype="float16", # output dtype
layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
with_bias=False, # bias
# configs for weight only quantization
group_size=group_size, # setting for grouped quantization
with_scaling=True, # setting for scaling factor
with_zeros=True, # setting for zeros
zeros_mode="original", # setting for how to calculating zeros
)
matmul = bitblas.Matmul(config=matmul_config)
# Define shapes for tensors
input_shape = (1, 1024)
weight_shape = (1024, 1024)
scaling_shape = (1024, 1024 // 128)
zeros_shape = (1024, 1024 // 128)
output_shape = (1, 1024)
# Create scaling and zeros tensors for quantization
scaling = torch.rand(scaling_shape, dtype=torch.float16).cuda()
zeros = torch.rand(zeros_shape, dtype=torch.float16).cuda()
# Create input tensor
input_tensor = torch.rand(input_shape, dtype=torch.float16).cuda()
# Create and transform weight tensor
weight_tensor = torch.randint(0, 7, weight_shape, dtype=torch.int8).cuda()
weight_tensor_int4 = matmul.transform_weight(weight_tensor)
# Perform mixed-precision matrix multiplication with quantization
output_tensor = matmul(input_tensor, weight_tensor_int4, scale=scaling, zeros=zeros)
rescaling_tensor = torch.zeros_like(weight_tensor, dtype=torch.float16).cuda()
# Compute reference result with manual scaling and zero-point adjustment
# rescale = (weight - zeros) * scaling
for i in range(in_features // group_size):
for j in range(group_size):
rescaling_tensor[:, i * group_size + j] = (
weight_tensor[:, i * group_size + j].to(torch.float16) - zeros[:, i]
) * scaling[:, i]
ref_result = torch.matmul(input_tensor, rescaling_tensor.t().to(torch.float16))
# Assert that the results are close within a specified tolerance
print("Ref output:", ref_result)
print("BitBLAS output:", output_tensor)
torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-2)
The init stage of the bitblas.Matmul
class will take minutes to finish, as it will use hardware informations to do a one-time kernel library initialization.
BitBLAS also implemented a variant PyTorch nn.Linear
module, i.e., bitblas.Linear
, to support a Linear of mixed-precision. See code implementation
Here is an example to define a bitblas.Linear
of $W{INT4}A{FP16}$:
import bitblas
import torch
# enabling debug output
bitblas.set_log_level("Debug")
model = bitblas.Linear(
in_features=1024,
out_features=1024,
bias=False,
A_dtype="float16", # activation A dtype
W_dtype="int4", # weight W dtype
accum_dtype="float16", # accumulation dtype
out_dtype="float16", # output dtype
# configs for weight only quantization
group_size=None, # setting for grouped quantization
with_scaling=False, # setting for scaling factor
with_zeros=False, # setting for zeros
zeros_mode=None, # setting for how to calculating zeros
# Target optimization var for dynamic symbolic.
# For detailed information please checkout docs/PythonAPI.md
# By default, the optimization var is [1, 16, 32, 64, 128, 256, 512]
opt_M=[1, 16, 32, 64, 128],
)
# Create an integer weight tensor
intweight = torch.randint(-7, 7, (1024, 1024), dtype=torch.int8)
# Load and transform weights into the BitBLAS linear module
model.load_and_transform_weight(intweight)
# Save the state of the model
torch.save(model.state_dict(), "./model.pth")
# Load the model state
model.load_state_dict(torch.load("./model.pth"))
# Set the model to evaluation mode
model.eval()
# Create a dummy input tensor
dummpy_input = torch.randn(1, 1024, dtype=torch.float16)
# Perform inference
output = model(dummpy_input)
print("BitBLAS output:", output)
# Please checkout the correctness evaluation code in `testing/python/module/test_bitblas_linear.py`
import bitblas
import torch
# enabling debug output
bitblas.set_log_level("Debug")
model = bitblas.Linear(
in_features=1024,
out_features=1024,
bias=False,
A_dtype="float16", # activation A dtype
W_dtype="int4", # weight W dtype
accum_dtype="float16", # accumulation dtype
out_dtype="float16", # output dtype
# configs for weight only quantization
group_size=None, # setting for grouped quantization
with_scaling=False, # setting for scaling factor
with_zeros=False, # setting for zeros
zeros_mode=None, # setting for how to calculating zeros
# Target optimization var for dynamic symbolic.
# For detailed information please checkout docs/PythonAPI.md
# By default, the optimization var is [1, 16, 32, 64, 128, 256, 512]
opt_M=[1, 16, 32, 64, 128],
)
# Create an integer weight tensor
intweight = torch.randint(-7, 7, (1024, 1024), dtype=torch.int8)
# Load and transform weights into the BitBLAS linear module
model.load_and_transform_weight(intweight)
# Save the state of the model
torch.save(model.state_dict(), "./model.pth")
# Load the model state
model.load_state_dict(torch.load("./model.pth"))
# Set the model to evaluation mode
model.eval()
# Create a dummy input tensor
dummpy_input = torch.randn(1, 1024, dtype=torch.float16)
# Perform inference
output = model(dummpy_input)
print("BitBLAS output:", output)
# Please checkout the correctness evaluation code in `testing/python/module/test_bitblas_linear.py`
we also provide repack interface to repack the pretrained weight of AutoGPTQ into the format of BitBLAS. Here is an example to repack the pretrained weight of AutoGPTQ:
# !pip install auto-gptq
import bitblas
import torch
from auto_gptq.nn_modules.qlinear.qlinear_cuda_old import (
QuantLinear as CudaOldQuantLinear,
)
# enabling debug output
bitblas.set_log_level("Debug")
in_features = 1024
out_features = 1024
group_size = 128
original_w, linear, s, qw = bitblas.quantization.gen_quant4(
in_features, out_features, group_size
)
zeros = torch.full((in_features // group_size, out_features), 7, dtype=torch.int32)
cuda_old_linear = CudaOldQuantLinear(
bits=4,
group_size=group_size,
infeatures=in_features,
outfeatures=out_features,
bias=False,
)
cuda_old_linear.pack(linear, s.T, zeros.T, g_idx=None)
bitblas_linear = bitblas.Linear(
in_features=in_features,
out_features=out_features,
bias=False,
A_dtype="float16", # activation A dtype
W_dtype="uint4", # weight W dtype
accum_dtype="float16", # accumulation dtype
out_dtype="float16", # output dtype
# configs for weight only quantization
group_size=group_size, # setting for grouped quantization
with_scaling=True, # setting for scaling factor
with_zeros=True, # setting for zeros
zeros_mode="quantized", # setting for how to calculating zeros
)
# Repack weights from CudaOldQuantLinear to BitBLAS linear module
bitblas_linear.repack_from_gptq(cuda_old_linear)
# Prepare input data
m = 1 # Batch size
inp = torch.rand(m, in_features, dtype=torch.float16, device="cuda")
# Move models to CUDA for execution
cuda_old_linear = cuda_old_linear.to("cuda")
bitblas_linear = bitblas_linear.to("cuda")
# Perform inference without gradient calculations
with torch.no_grad():
res_cuda_old = cuda_old_linear(inp)
res_bitblas = bitblas_linear(inp)
print("CudaOldQuantLinear output:", res_cuda_old)
print("BitBLAS output:", res_bitblas)
# Verify the outputs are close within specified tolerances
torch.testing.assert_close(res_bitblas, res_cuda_old, rtol=1e-0, atol=1e-1)
# !pip install auto-gptq
import bitblas
import torch
from auto_gptq.nn_modules.qlinear.qlinear_cuda_old import (
QuantLinear as CudaOldQuantLinear,
)
# enabling debug output
bitblas.set_log_level("Debug")
in_features = 1024
out_features = 1024
group_size = 128
original_w, linear, s, qw = bitblas.quantization.gen_quant4(
in_features, out_features, group_size
)
zeros = torch.full((in_features // group_size, out_features), 7, dtype=torch.int32)
cuda_old_linear = CudaOldQuantLinear(
bits=4,
group_size=group_size,
infeatures=in_features,
outfeatures=out_features,
bias=False,
)
cuda_old_linear.pack(linear, s.T, zeros.T, g_idx=None)
bitblas_linear = bitblas.Linear(
in_features=in_features,
out_features=out_features,
bias=False,
A_dtype="float16", # activation A dtype
W_dtype="uint4", # weight W dtype
accum_dtype="float16", # accumulation dtype
out_dtype="float16", # output dtype
# configs for weight only quantization
group_size=group_size, # setting for grouped quantization
with_scaling=True, # setting for scaling factor
with_zeros=True, # setting for zeros
zeros_mode="quantized", # setting for how to calculating zeros
)
# Repack weights from CudaOldQuantLinear to BitBLAS linear module
bitblas_linear.repack_from_gptq(cuda_old_linear)
# Prepare input data
m = 1 # Batch size
inp = torch.rand(m, in_features, dtype=torch.float16, device="cuda")
# Move models to CUDA for execution
cuda_old_linear = cuda_old_linear.to("cuda")
bitblas_linear = bitblas_linear.to("cuda")
# Perform inference without gradient calculations
with torch.no_grad():
res_cuda_old = cuda_old_linear(inp)
res_bitblas = bitblas_linear(inp)
print("CudaOldQuantLinear output:", res_cuda_old)
print("BitBLAS output:", res_bitblas)
# Verify the outputs are close within specified tolerances
torch.testing.assert_close(res_bitblas, res_cuda_old, rtol=1e-0, atol=1e-1)
Comments