Accelerate BiT Model Even More with Quantization Using OpenVINO and NNCF

Enhance BiT model inference with the help of OpenVINO and Neural Network Compression Framework (NNCF) and low precision (INT8) inference.

Sponsored by Intel
11 months agoMachine Learning & AI

1. Introduction

In the first part of this blog series, we discussed how to use Intel®’s OpenVINO™ toolkit to accelerate inference of the Big Transfer (BiT) model for computer vision tasks. We covered the process of importing the BiT model into the OpenVINO environment, leveraging hardware optimizations, and benchmarking performance. Our results showcased significant performance gains and reduced inference latency for BiT when using OpenVINO compared to the original TensorFlow implementation. With this strong base result in place, there’s still room for further optimization. In this second part, we will further enhance BiT model inference with the help of OpenVINO and Neural Network Compression Framework (NNCF) and low precision (INT8) inference. NNCF provides sophisticated tools for neural network compression through quantization, pruning, and sparsity techniques tailored for deep learning inference. This allows BiT models to become viable for power and memory-constrained environments where the original model size may be prohibitive. The techniques presented will be applicable to many deep learning models beyond BiT.

2. Model Quantization

Model quantization is an optimization technique that reduces the precision of weights and activations in a neural network. It converts 32-bit floating point representations (FP32) to lower bit-widths like 16-bit floats (FP16) or 8-bit integers (INT8) or 4-bit integers (INT4). The key benefit is enhanced efficiency — smaller model size and faster inference time. These improvements not only increase efficiency on server platforms but, more importantly, also enable deployment onto resource-constrained edge devices. So, while server platform performance is improved, the bigger impact is opening all-new deployment opportunities. Quantization transforms models from being restricted to data centers to being deployable even on low-power devices with limited compute or memory. This massively expands the reach of AI to the true edge.

Below are a few of the key model quantization concepts:

  • Precision reduction — Decreases the number of bits used to represent weights and activations. Common bit-widths: INT8, FP16. Enables smaller models.
  • Efficiency — Compressed models are smaller and faster, leading to efficient system resource utilization.
  • Trade-offs — Balancing model compression, speed, and accuracy for target hardware. The goal is to optimize across all fronts.
  • Techniques — Post-training and quantization-aware training. Bakes in resilience to lower precision.
  • Schemes — Quantization strategies like weight, activation, or combined methods strike a balance between compressing models and preserving accuracy.
  • Preserving accuracy — Fine-tuning, calibration, and retraining maintain the quality of real-world data.

3. Neural Network Compression Framework (NNCF)

NNCF is a powerful tool for optimizing deep learning models, such as the Big Transfer (BiT) model, to achieve improved performance on various hardware, ranging from edge to data center. It provides a comprehensive set of features and capabilities for model optimization, making it easy for developers to optimize models for low-precision inference. Some of the key capabilities include:

  • Support for a variety of post-training and training-time algorithms with minimal accuracy drop.
  • Seamless combination of pruning, sparsity, and quantization algorithms.
  • Support for a variety of models: NNCF can be used to optimize models from a variety of frameworks, including TensorFlow, PyTorch, ONNX, and OpenVINO.

NNCF provides samples that demonstrate the usage of compression algorithms for different use cases and models. See compression results achievable with the NNCF-powered samples on the Model Zoo page. For more details refer to this GitHub repo.

4. BiT Classification Model Optimization with OpenVINO™

Note: Before proceeding with the following steps, ensure you have a conda environment set up. Refer to this blog post for detailed instructions on setting up the conda environment.

4.1. Download BiT_M_R50x1_1 tf classification model:

wget https://tfhub.dev/google/bit/m-r50x1/1?tf-hub-format=compressed 
-O bit_m_r50x1_1.tar.gz

mkdir -p bit_m_r50x1_1 && tar -xvf bit_m_r50x1_1.tar.gz -C bit_m_r50x1_1

4.2. OpenVINO™ Model Optimization:

Execute the below command inside the conda environment to generate OpenVINO IR model files (.xml and .bin) for the bit_m_r50x1_1 model. These model files will be used for further optimization and for inference accuracy validation in subsequent sections.

ovc ./bit_m_r50x1_1 --output_model ./bit_m_r50x1_1/ov/fp32/bit_m_r50x1_1 
--compress_to_fp16 False

5. Data Preparation

To evaluate the accuracy impact of quantization on our BiT model, we need a suitable dataset. For this, we leverage the ImageNet 2012 validation set which contains 50,000 images across 1000 classes. The ILSVRC2012 validation ground truth is used for cross-referencing model predictions during accuracy measurement.

By testing our compressed models on established data like ImageNet validation data, we can better understand the real-world utility of our optimizations. Maintaining maximal accuracy while minimizing resource utilization is crucial for edge deployment. This dataset provides the rigorous and unbiased means to effectively validate those trade-offs.

Note: Accessing and downloading the ImageNet dataset requires registration.

6. Quantization Using NNCF

In this section, we will delve into the specific steps involved in quantizing the BiT model using NNCF. The quantization process involves preparing a calibration dataset and applying 8-bit quantization to the model, followed by accuracy evaluation.

6.1. Preparing Calibration Dataset:

At this step, create an instance of the nncf.Dataset class that represents the calibration dataset. The nncf.Dataset class can be a wrapper over the framework dataset object used for model training or validation. Below is a sample code snippet of nncf.Dataset() call with transformed data samples.

# TF Dataset split for nncf calibration
img2012_val_split = get_val_data_split(tf_dataset_, \
train_split=0.7, \
val_split=0.3, \
shuffle=True, \
shuffle_size=50000)

img2012_val_split = img2012_val_split.map(nncf_transform).batch(BATCH_SIZE)

calibration_dataset = nncf.Dataset(img2012_val_split)

The transformation function is a function that takes a sample from the dataset and returns data that can be passed to the model for inference. Below is the code snippet of the data transform.

# Data transform function for NNCF calibration 
def nncf_transform(image, label):
image = tf.io.decode_jpeg(tf.io.read_file(image), channels=3)
image = tf.image.resize(image, IMG_SIZE)
return image

6.2. NNCF Quantization (FP32 to INT8):

Once the calibration dataset is prepared and the model object is instantiated, the next step involves applying 8-bit quantization to the model. This is achieved by using the nncf.quantize() API, which takes the OpenVINO FP32 model generated in the previous steps along with the calibrated dataset values to initiate the quantization process. While nncf.quantize() provides numerous advanced configuration knobs, in many cases like this one, it just works out of the box or with minor adjustments. Below, is sample code snippet of nncf.quantize() API call.

ov_quantized_model = nncf.quantize(ov_model, \
calibration_dataset, \
fast_bias_correction=False)

For further details, the official documentation provides a comprehensive guide on the basic quantization flow, including setting up the environment, preparing the calibration dataset, and calling the quantization API to apply 8-bit quantization to the model.

6.3. Accuracy Evaluation

As a result of NNCF model quantization process, the OpenVINO INT8 quantized model is generated. To evaluate the impact of quantization on model accuracy, we perform a comprehensive benchmarking comparison between the original FP32 model and the quantized INT8 model. This comparison involves measuring the accuracy of BiT Model (m-r50x1/1) on the ImageNet 2012 Validation dataset. The accuracy evaluation results are shown in Table 1.

With TensorFlow (FP32) to OpenVINO™ (FP32) model optimization, the classification accuracy remained consistent at 0.70154, confirming that conversion to OpenVINO™ model representation does not affect accuracy. Furthermore, with NNCF Quantization to an 8-bit integer model, the accuracy was only marginally impacted of less than 0.03%, demonstrating that the quantization process did not compromise the model’s classification abilities.

Refer to Appendix A for the Python script bit_ov_model_quantization.py, which includes data preparation, model optimization, NNCF quantization tasks, and accuracy evaluation.

The usage of the bit_ov_model_quantization.py script is as follows:

$python bit_ov_model_quantization.py --help
usage: bit_ov_model_quantization.py [-h] [--inp_shape INP_SHAPE] --dataset_dir DATASET_DIR --gt_labels GT_LABELS --bit_m_tf BIT_M_TF --bit_ov_fp32 BIT_OV_FP32
[--bit_ov_int8 BIT_OV_INT8]

BiT Classification model quantization and accuracy measurement

required arguments:
--dataset_dir DATASET_DIR
Directory path to ImageNet2012 validation dataset
--gt_labels GT_LABELS
Path to ImageNet2012 validation ds gt labels file
--bit_m_tf BIT_M_TF Path to BiT TF fp32 model file
--bit_ov_fp32 BIT_OV_FP32
Path to BiT OpenVINO fp32 model file

optional arguments:
-h, --help show this help message and exit
--inp_shape INP_SHAPE
N,W,H,C
--bit_ov_int8 BIT_OV_INT8
Path to save BiT OpenVINO INT8 model file

7. Conclusion

The results emphasize the efficacy of OpenVINO™ and NNCF in optimizing model efficiency while minimizing computational requirements. The ability to achieve remarkable performance and accuracy retention, particularly when compressing models to INT8 precision, demonstrates the practicality of leveraging OpenVINO™ for deployment in various environments including resource-constrained environments. NNCF proves to be a valuable tool for practitioners seeking to balance model size and computational efficiency without substantial compromise on classification accuracy, opening avenues for enhanced model deployment across diverse hardware configurations.

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. See backup for configuration details.

No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. ​

Additional Resources

Appendix A

  • Software configuration:
  • ILSVRC2012 ground truth: ground_truth_ilsvrc2012_val.txt
  • See bit_ov_model_quantization.py below for the BiT model quantization pipeline with NNCF described in this blog.
"""
Copyright (c) 2022 Intel Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""

"""
This script is tested with TensorFlow v2.12.1 and OpenVINO v2023.1.0

Usage Example below (with required parameters):

python bit_ov_model_quantization.py
--gt_labels ./<path_to>/ground_truth_ilsvrc2012_val.txt
--dataset_dir ./<path-to-dataset>/ilsvrc2012_val_ds/
--bit_m_tf ./<path-to-tf>/model
--bit_ov_fp32 ./<path-to-ov>/fp32_ir_model

"""

import os, sys
from openvino.runtime import Core
import numpy as np
import argparse, os
import nncf
import openvino.runtime as ov
import pandas as pd
import re

import logging
logging.basicConfig(level=logging.ERROR)

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow.compat.v2 as tf

from PIL import Image
from sklearn.metrics import accuracy_score

ie = Core()
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

# For top 1 labels.
MAX_PREDS = 1
BATCH_SIZE = 1
IMG_SIZE = (224, 224) # Default Imagenet image size
NUM_CLASSES = 1000 # For Imagenette dataset

# Data transform function for NNCF calibration
def nncf_transform(image, label):
image = tf.io.decode_jpeg(tf.io.read_file(image), channels=3)
image = tf.image.resize(image, IMG_SIZE)
return image

# Data transform function for imagenet ds validation
def val_transform(image_path, label):
image = tf.io.decode_jpeg(tf.io.read_file(image_path), channels=3)
image = tf.image.resize(image, IMG_SIZE)
img_reshaped = tf.reshape(image, [IMG_SIZE[0], IMG_SIZE[1], 3])
image = tf.image.convert_image_dtype(img_reshaped, tf.float32)
return image, label

# Validation dataset split
def get_val_data_split(tf_dataset_, train_split=0.7, val_split=0.3, \
shuffle=True, shuffle_size=50000):
if shuffle:
ds = tf_dataset_.shuffle(shuffle_size, seed=12)

train_size = int(train_split * shuffle_size)
val_size = int(val_split * shuffle_size)
val_ds = ds.skip(train_size).take(val_size)

return val_ds

# OpenVINO IR model inference validation
def ov_infer_validate(model: ov.Model,
val_loader: tf.data.Dataset) -> tf.Tensor:

model.reshape([1,IMG_SIZE[0],IMG_SIZE[1],3]) # If MO ran with Dynamic batching
compiled_model = ov.compile_model(model)
output = compiled_model.outputs[0]

ov_predictions = []
for img, label in val_loader:#.take(25000):#.take(5000):#.take(5):
pred = compiled_model(img)[output]
ov_result = tf.reshape(pred, [-1])
top_label_idx = np.argsort(ov_result)[-MAX_PREDS::][::-1]
ov_predictions.append(top_label_idx)

return ov_predictions


# OpenVINO IR model NNCF Quantization
def quantize(ov_model, calibration_dataset): #, val_loader: tf.data.Dataset):
print("Started NNCF qunatization process")
ov_quantized_model = nncf.quantize(ov_model, calibration_dataset, fast_bias_correction=False)
return ov_quantized_model

# OpenVINO FP32 IR model inference
def ov_fp32_predictions(ov_fp32_model, validation_dataset):
# Load and compile the OV model
ov_model = ie.read_model(ov_fp32_model)
print("Starting OV FP32 Model Inference...!!!")
ov_fp32_pred = ov_infer_validate(ov_model, validation_dataset)
return ov_fp32_pred

def nncf_quantize_int8_pred_results(ov_fp32_model, calibration_dataset, \
validation_dataset, ov_int8_model):

# Load and compile the OV model
ov_model = ie.read_model(ov_fp32_model)

# NNCF Quantization of OpenVINO IR model
int8_ov_model = quantize(ov_model, calibration_dataset)
ov.serialize(int8_ov_model, ov_int8_model)
print("NNCF Quantization Process completed..!!!")

ov_int8_model = ie.read_model(ov_int8_model)
print("Starting OV INT8 Model Inference...!!!")
ov_int8_pred = ov_infer_validate(ov_int8_model, validation_dataset)

return ov_int8_pred

def tf_inference(tf_saved_model_path, val_loader: tf.data.Dataset):

tf_model = tf.keras.models.load_model(tf_saved_model_path)
print("Starting TF FP32 Model Inference...!!!")
tf_predictions = []
for img, label in val_loader:
tf_result = tf_model.predict(img, verbose=0)
tf_result = tf.reshape(tf_result, [-1])
top5_label_idx = np.argsort(tf_result)[-MAX_PREDS::][::-1]
tf_predictions.append(top5_label_idx)

return tf_predictions


"""
Module: bit_classificaiton
Description: API to run BiT classificaiton OpenVINO IR model INT8 Quantization on using NNCF and
perfom accuracy metrics for TF FP32, OV FP32 and OV INT8 on ImageNet2012 Validation dataset
"""
def bit_classification(args):

ip_shape = args.inp_shape
if isinstance(ip_shape, str):
ip_shape = [int(i) for i in ip_shape.split(",")]
if len(ip_shape) != 4:
sys.exit( "Input shape error. Set shape 'N,W,H,C'. For example: '1,224,224,3' " )

# Imagenet2012 validataion dataset used for TF and OV FP32 accuracy testing.
#dataset_dir = ../dataset/ilsvrc2012_val/1.0/ + "*.JPEG"
dataset_dir = args.dataset_dir + "*.JPEG"
tf_dataset = tf.data.Dataset.list_files(dataset_dir)

gt_lables = open(args.gt_labels)

val_labels = []
for l in gt_lables:
val_labels.append(str(l))

# Generating ImageNet 2012 validation dataset dictionary (img, label)
val_images = []
val_labels_in_img_order = []
for i, v in enumerate(tf_dataset):
img_path = str(v.numpy())
id = int(img_path.split('/')[-1].split('_')[-1].split('.')[0])
val_images.append(img_path[2:-1])
val_labels_in_img_order.append(int(re.sub(r'\n','',val_labels[id-1])))

val_df = pd.DataFrame(data={'images': val_images, 'label': val_labels_in_img_order})

# Converting imagenet2012 val dictionary into tf.data.Dataset
tf_dataset_ = tf.data.Dataset.from_tensor_slices((list(val_df['images'].values), val_df['label'].values))
imgnet2012_val_dataset = tf_dataset_.map(val_transform).batch(BATCH_SIZE)

# TF Dataset split for nncf calibration
img2012_val_split_for_calib = get_val_data_split(tf_dataset_, train_split=0.7, \
val_split=0.3, shuffle=True, \
shuffle_size=50000)

img2012_val_split_for_calib = img2012_val_split_for_calib.map(nncf_transform).batch(BATCH_SIZE)

# TF Model Inference
tf_model_path = args.bit_m_tf
print(f"Tensorflow FP32 Model {args.bit_m_tf}")
tf_p = tf_inference(tf_model_path, imgnet2012_val_dataset)

#acc_score = accuracy_score(tf_pred, val_labels_in_img_order[0:25000])
acc_score = accuracy_score(tf_p, val_labels_in_img_order)
print(f"Accuracy of FP32 TF model = {acc_score}\n")

# OpenVINO Model Inference
print(f"OpenVINO FP32 IR Model {args.bit_ov_fp32}")
ov_fp32_p = ov_fp32_predictions(args.bit_ov_fp32, imgnet2012_val_dataset)

acc_score = accuracy_score(ov_fp32_p, val_labels_in_img_order)
print(f"Accuracy of FP32 IR model = {acc_score}\n")

print("Starting NNCF dataset Calibration....!!!")
calibration_dataset = nncf.Dataset(img2012_val_split_for_calib)

# OpenVINO IR FP32 to INT8 Model Quantization with NNCF and
# INT8 predictions results on validation dataset
ov_int8_p = nncf_quantize_int8_pred_results(args.bit_ov_fp32, calibration_dataset, \
imgnet2012_val_dataset, args.bit_ov_int8)

print(f"OpenVINO NNCF Quantized INT8 IR Model {args.bit_ov_int8}")
acc_score = accuracy_score(ov_int8_p, val_labels_in_img_order)
print(f"Accuracy of INT8 IR model = {acc_score}\n")

#acc_score = accuracy_score(tf_p, ov_fp32_p)
#print(f"TF Vs OV FP32 Accuracy Score = {acc_score}")

#acc_score = accuracy_score(ov_fp32_p, ov_int8_p)
#print(f"OV FP32 Vs OV INT8 Accuracy Score = {acc_score}")

if __name__ == "__main__":

parser = argparse.ArgumentParser(description="BiT Classification model quantization and accuracy measurement")
optional = parser._action_groups.pop()
required=parser.add_argument_group("required arguments")
optional.add_argument("--inp_shape", type=str, help="N,W,H,C", default="1,224,224,3", required=False)
required.add_argument("--dataset_dir", type=str, help="Directory path to ImageNet2012 validation dataset", required=True)
required.add_argument("--gt_labels", type=str, help="Path to ImageNet2012 validation ds gt labels file", required=True)
required.add_argument("--bit_m_tf", type=str, help="Path to BiT TF fp32 model file", required=True)
required.add_argument("--bit_ov_fp32", type=str, help="Path to BiT OpenVINO fp32 model file", required=True)
optional.add_argument("--bit_ov_int8", type=str, help="Path to save BiT OpenVINO INT8 model file",
default="./bit_m_r50x1_1/ov/int8/saved_model.xml", required=False)
parser._action_groups.append(optional)

args = parser.parse_args()
bit_classification(args)
Latest articles
Related articles
Latest articles
Read more
Related articles