The main goal of this project is to guide an operator through an assembly process and alert them when an error occurs. YWIL works as a Poka-yoke mechanism, preventing, avoiding, detecting and correcting human mistakes.
In the following sections, I will describe step by step how YWIL uses the power of Computer Vision to carry out its mission with the help of the Intel distribution of OpenVINO toolkit.
Step 1: Making a simple assembly kitThe journey begins with the simulation of an industrial manufacturing environment. In this case, I used a 3D modeling software to design a kind of plastic machine. From now on, I will refer to it as the assembly kit.
The assembly kit consists of several pieces of different shapes and sizes. Each part will be placed in an exact position at a certain point in the assembly process.
After designing the whole mechanism, it is time to export the 3D models to files in .STL format. Then, these files are converted to G-code. Finally, we are ready to 3D printing.
For this demo, I have used random colored filaments for the different parts but it would be a good challenge for the Deep Learning model if all objects were the same color.
You can find a link to the ready-to-use files in the Custom parts section at the bottom of this page.
Step 2: Capturing images for trainingThe camera model I have used has depth calculation function but only RGB camera function is needed. You can use any type of standard webcam for this project.
To make training easier, all images are captured from the same point of view. Therefore, it is important to keep the camera fixed in an exact position. I have used a mini tripod about 15 centimeters high very close to the base of the assembly kit and I have placed the camera at an angle of approximately 60 degrees to frame the entire mechanism.
You can find image with the setup of the camera and the base in the Schematics section at the bottom of this page.
In the code of the project that will be explained later there is a flag to indicate that we are going to take and save images to train our Machine Learning model:
private const bool capturingFlag = true;
This C# method is in charge of converting the camera streaming into an image file and saving it in our system:
private void SaveStreamAsPngFile(Stream inputStream)
While the YWIL application is running in training mode, the captured images are saved in a folder called YWIL_YouWorkItLooks within the Pictures directory. Then, it only remains to carefully arrange the images in different folders. Each folder corresponds to a step in the process or to a label that we want to detect.
Since YWIL must detect custom objects, we have to train custom Computer Vision models. For this, I have used a Cognitive Service from Microsoft Azure called Custom Vision.
This platform offers three types of image recognition or labeling models:
- MulticlassClassification: each image corresponds to only one label.
- MultilabelClassification: multiple labels can be assigned to each image.
- Object Detection: each label is delimited by a region or bounding box.
For the YWIL use case, I have chosen the General Compact Domain and Basic Platform Export Capability options. These options allow us to export the model to ONNX or TensorFlow formats and deploy it on edge devices.
All Custom Vision projects are built the same way: first, the model is chosen; then the images are uploaded and tagged or labeled; after that, the model is trained; next, it is tested and evaluated with no-labeled images; and finally, the model is exported to a ready-to-use file format.
I used the Multilabel Classification model for steps 0 to 4. This model allows YWIL to verify whether or not the part is correctly positioned, as well as to identify if the step is correct.
For steps 5 and 6 I used an Object Detection model. This model allows YWIL to count holes and parts and identify objects in different position, even incorrectly placed parts.
Finally, I used a Multiclass Classification model to identify different situations in step 7. The result of this model is a single label that allows YWIL to identify whether or not the last part is positioned correctly.
I captured about 1500 images of the assembly process and selected and tagged more than 1200 of them in the three Custom Vision projects. The more images used to train the models, the more accurate the results will be. Keep in mind that you can add more training images and update the models by repeating these processes at any time.
Step 4: Installing the OpenVINO toolkitWe need to download and install the Intel distribution of OpenVINO Toolkit to infer new classifications and scores on our Intel device. In my case, I have used my laptop with an Intel CPU and Windows 10.
First, go to the official Intel website. You just have to choose the operating system and fill in and submit the form.
Then, you can download one of the versions available for your operating system and choose between the Customizable Package or the Full Package.
Also, you will receive an email with a serial number, installation instructions, links to documentation and to a Getting Started Guide, etc.
After downloading the package, I followed the installation instructions for Windows. It is very easy if you do everything carefully and step by step.
At the end of the installation guide there are instructions to run the Image Classification Verification Script and the Inference Pipeline Verification Script. These two tests are used to verify that everything works fine.
If you followed all the instructions and tried other samples and demos, you will have noticed that whenever we open a new CLI session we must configure the system environment variables.
C:\Program Files (x86)\IntelSWTools\openvino\bin\setupvars.bat
To avoid having to do that every time we want to use OpenVINO, we can permanently set OpenVINO environment variables for Windows 10 in Control Panel > System and Security > System > Advanced System Settings > Environment Variables.
Step 5: Using the Model OptimizerTo use the OpenVINO Toolkit Inference Engine with our custom models we first need to convert them to Intermediate Representation using the Model Optimizer tool. To know how to use this tool you can follow the Model Optimizer Development Guide from Intel website.
We can find the Model Optimizer tool inside the OpenVINO installation directory:
C:\Program Files (x86)\IntelSWTools\openvino\deployment_tools\model_optimizer
To get the Intermediate Representation of the YWIL Custom Vision models I used the exported TensorFlow versions. Each framework (ONNX, TensorFlow, Caffe, etc.) has its own tool, try to use the appropriate one and check on this page if all the layers of your model are supported. After using the Model Optimizer tool from the Windows CLI, you will get an output similar to these:
[ SUCCESS ] Generated IR version 10 model.
[ SUCCESS ] XML file: YWIL_Classification_Multilabel.xml
[ SUCCESS ] BIN file: YWIL_Classification_Multilabel.bin
[ SUCCESS ] Total execution time: 19.16 seconds.
[ SUCCESS ] Generated IR version 10 model.
[ SUCCESS ] XML file: YWIL_Object_Detection.xml
[ SUCCESS ] BIN file: YWIL_Object_Detection.bin
[ SUCCESS ] Total execution time: 25.20 seconds.
[ SUCCESS ] Generated IR version 10 model.
[ SUCCESS ] XML file: YWIL_Classification_Multiclass.xml
[ SUCCESS ] BIN file: YWIL_Classification_Multiclass.bin
[ SUCCESS ] Total execution time: 17.48 seconds.
It may happen that you get some errors during the optimization process. To solve most of them, you can refer to the Model Optimizer FAQ.
Step 6: Using the Inference EngineIn this section I will explain how to use the OpenVINO Inference Engine. YWIL uses the Python API but the steps to follow are very similar to the Common Workflow instructions for C++.
- Create an Inference Engine Core object:
IECore()
- Read the Intermediate Representation:
read_network()
- Compile and load the network to the device:
load_network()
- Prepare input and output formats:
input_info
&outputs
- Set the image input data:
inputs={input: image}
- Execute the inference process:
infer()
- Get the output of the model:
result[output]
from openvino.inference_engine import IECore
inference_engine = IECore()
network = inference_engine.read_network(model=model_xml, weights=model_bin)
exec_network = inference_engine.load_network(network=network, device_name='CPU')
input = next(iter(network.input_info))
output = next(iter(network.outputs))
result = exec_network.infer(inputs={input: image})
result = result[output]
Here you can see the JSON formatted result of an inference on the Classification Multilabel model using the Python script above:
[
{
"label": "Step4",
"probability": "99.8"
},
{
"label": "front",
"probability": "85.2"
}
]
You can find a more detailed description on the Introduction to Inference Engine page.
Step 7: Using ONNX Runtime with OpenVINOI had some compatibility issues using the Azure Custom Vision Object Detection model with Model Optimizer and Inference Engine. Therefore, I decided to use a plan B: use the ONNX Runtime from Microsoft with OpenVINO as Execution Provider.
The installation process is very simple and all the instructions can be found on the ONNX Runtime GitHub page. To use OpenVINO and our Intel CPU we just have to add this argument:
--use_openvino CPU_FP32
git clone
--recursive https://github.com/Microsoft/onnxruntime
cd onnxruntime
.\build.bat
--config RelWithDebInfo
--use_openvino CPU_FP32
--cmake_generator "Visual Studio 16 2019"
The inference process in ONNX Runtime with Python is very similar to OpenVINO, instead of the Intermediate Representation we use the ONNX model file directly:
import onnx
import onnxruntime
model = onnx.load(model_filename)
onnx.save(model, dirpath)
options = onnxruntime.SessionOptions()
session = onnxruntime.InferenceSession(dirpath, options)
input = session.get_inputs()[0].name
output = session.run(None, {input: image})
Step 8: Building a desktop applicationThe main part of YWIL is an application that distributes each image of the assembly process to the corresponding model to infer the Computer Vision results.
string result = await ModelInference(CommonSteps.ModelFolders[currentModel]);
switch (currentModel)
{
case 0:
List<Classification> multilabel =
JsonConvert.DeserializeObject<List<Classification>>(result);
ProcessMultilabelResult(multilabel);
break;
case 1:
List<Detection> detection =
JsonConvert.DeserializeObject<List<Detection>>(result);
ProcessDetectionResult(detection);
break;
case 2:
List<Classification> multiclass =
JsonConvert.DeserializeObject<List<Classification>>(result);
ProcessMulticlassResult(multiclass);
break;
default:
break;
}
Then, different methods process the results, verify the steps and alert of the errors. Finally, all the information is displayed in a graphical interface:
- 1. Real time image from the camera
- 2. Assembly instructions and alerts
- 3. Visual guide and reference pictures
- 4. Detected objects and probabilities
- 5. Progress bar of the entire process
You can find a link to the YWIL GitHub page with the complete Visual Studio solution in the Code section at the bottom of this page.
***** Step 8: Python Bonus Track *****
This project was originally created for the Deep Learning Superhero Challenge. The main application was written in C# for the early submission contest but after that, I had time to fully convert it to Python.
The fundamental difference between the two applications is that the inference process is much more straightforward from Python code. In this way, the power of OpenVINO is fully exploited and better results are obtained in terms of inference times and frames per second.
The Python project has more or less the same structure as in the C# application and the well-known Tkinter library has been used to build the same graphical interface. There is a section where it is chosen to which model the image is sent for analysis:
predictions = []
if current_model == 0:
predictions = multilabel.Infer(image)
elif current_model == 1:
predictions = detection.Infer(image)
elif current_model == 2:
predictions = multiclass.Infer(image)
Then, specific methods are in charge of processing the results and displaying the information on the screen:
print_currently(len(predictions))
print_detections(predictions)
print_inference(start, end)
if current_model == 0:
process_multilabel(predictions)
elif current_model == 1:
detections = process_detection(predictions)
image = draw_detections(image, detections)
elif current_model == 2:
process_multiclass(predictions)
In the 'C #' code there is no method like that one that calculates the average inference time of the last frames:
def print_inference(start, end):
if infer_times.count == waiting_frames:
infer_times.pop(0)
infer_times.append((end-start)/1000)
ms = median(infer_times)
text = 'Last inference time: {:.3f}ms'.format(ms)
window.inference.config(text=text)
You can find the YWIL GitHub page link with the Python application in the Code section at the end of this project.
Step 9: Testing the YWIL app in actionAnd finally, here comes the fun part: testing the YWIL application in real time during the assembly process of a kit.
***** Step 9: Python Bonus Track *****
Having a new Python with OpenVINO application requires: first, see it working, then, analyze the results and finally, write some conclusions. Let's go for it!
In real time we can see that the classification models take about 7 milliseconds on average to process each image with OpenVINO while the object detection model takes about 33 milliseconds on average using ONNX Runtime with OpenVINO as execution provider.
These records are really good when you consider that the time includes the image pre-processing, the inference process and the output post-processing. With those inference times we could reach processing speeds of up to 140 frames per second with a minimum of 30 fps. OpenVINO is really worth it, I recommend you give it a try.
It is also very important to note that the average confidence of all the models is above 95 percent. And all of this has been accomplished with just a few images in our training dataset. Imagine the possibilities of these Custom Vision models.
I hope you have enjoyed the YWIL project!
Comments