In this project I will show how I built the AI Assisted 360° Live View plugin for the Theta V camera.
The plugin provides 360° Live View functionality using WebRTC, and object detection using TensorFlow.
In a more advanced form, the plugin could be used to remotely assist personnel in different use cases, like for example in rescue operations.
The Theta V camera could be either wear by a person, or mounted on remote controlled vehicles like drones / rovers.
More about the project idea can be found here.
The source code of the plugin can be found in the GitHub repository linked to the project
A pre-compiled APK of the plugin is available here.
The plugin is in the process to being published in the Theta 360 plugin store.
Web RTC based Live ViewThe live view functionality is based the the WebRTC sample plugin.
shrhdk_ the author of the WebRTC sample plugin, describes the plugin's functioning in the following article.
(note: the article is in Japanese, but Google Translate can be used pretty well to read it)
The plugin uses three communication channels with the web interface:
- a HTTP server is used serve the web interface and to provide control over the camera
- WebRTC is used for the for the video and audio live streaming
- a WebSocket connection with a JSON based protocol is used to negotiate the WebRTC connection details
TensorFlow can be easily used to on the Theta V cameras. Craig Oda explains how to do this in the following article.
The source code can be found in the mktshhr/tensorflow-theta repository. This is basically a fork of the official tensorflow repository with the Android TensorFlow example packed into a Theta plugin and adjusted to work on the Theta V / Z1.
Building the AI Assisted 360° Live View pluginThe plugin was built using Android Studio:
I started from the theta360developers/tensorflow-theta repository. I built the project and using Vysor I checked the plugin is working as expected.
The I added the WebRTC functionality from the ricohapi/theta-plugin-webrtc-sample repository. This also worked, so at this point I had a plugin with the WebRTC functionality and support for writing TensorFlow code.
The next step was to add TensorFlow object detection functionality over the WebRTC video streaming.
This was a little bit trickier as the the TensorFlow examples (ClassifierActivity
, CameraActivity
) used the Android Media API to capture images from the camera, while the WebRTC plugin used native video sinks and the two are not compatible.
To fix this from a created a new class TFClassifier
with just TensorFlow functionality object detection functionality from ClassifierActivity
, but not he media API part.
Next, I added to the WebRTC class a callback interface that can be used to get access to the camera's MediaStream
when it is ready.
public class WebRTC {
...
private Consumer<MediaStream> mediaStreamCallback;
...
public void setMediaStreamCallback(Consumer<MediaStream> mediaStreamCallback) {
this.mediaStreamCallback = mediaStreamCallback;
}
...
private void setupLocalStream(...) {
mLocalStream = mFactory.createLocalMediaStream("android_local_stream");
...
if (mediaStreamCallback != null) {
mediaStreamCallback.accept(mLocalStream);
}
...
}
Then we can use this interface to give the TFClassifier
class a VideoTrack
:
mWebRTC = new WebRTC(this);mWebRTC.setMediaStreamCallback(mediaStream -> {
Log.d(TAG, "Got Media Stream: " + mediaStream);
new TFClassifier(mediaStream.videoTracks.get(0), getAssets(),
...
});
});
To the VideoTrack
we can then in TFClassifier
add a custom VideoSink
:
public TFClassifier(VideoTrack videoTrack, ...) {
...
videoTrack.addSink(this::onFrame);
...
}
private void onFrame(VideoFrame videoFrame) {
...
}
The VideoFrame
we get here, with a little bit of image processing magic, can be converted to the Bitmap
needed by TensorFlow:
private final Map<String, BySize> bySize = new HashMap<>();private static class BySize {
final int[] rgbBytes;
final int width;
final int height;
final int rotation;
final Matrix frameToCropTransform;
final Bitmap rgbFrameBitmap;
BySize(int width, int height, int rotation) {
this.width = width;
this.height = height;
this.rotation = rotation;
rgbBytes = new int[width * height];
rgbFrameBitmap = Bitmap.createBitmap(width, height, Bitmap.Config.ARGB_8888);
frameToCropTransform = ImageUtils.getTransformationMatrix(
width, height,
INPUT_SIZE, INPUT_SIZE,
rotation, MAINTAIN_ASPECT);
}
}
private void onFrame(VideoFrame videoFrame) {
...
Log.d(TAG, "Got Video Frame. rot=" + videoFrame.getRotation());
VideoFrame.I420Buffer i420Buffer = videoFrame.getBuffer().toI420();
BySize b = bySize(i420Buffer.getWidth(), i420Buffer.getHeight(), videoFrame.getRotation());
ImageUtils.convertYUV420ToARGB8888(
bufferToArray(null, i420Buffer.getDataY()),
bufferToArray(null, i420Buffer.getDataU()),
bufferToArray(null, i420Buffer.getDataV()),
i420Buffer.getWidth(),
i420Buffer.getHeight(),
i420Buffer.getStrideY(),
i420Buffer.getStrideU(),
1 /*i420Buffer.getStrideV()*/,
b.rgbBytes);
b.rgbFrameBitmap.setPixels(b.rgbBytes, 0, b.width, 0, 0, b.width, b.height);
final Canvas canvas = new Canvas(croppedBitmap);
canvas.drawBitmap(b.rgbFrameBitmap, b.frameToCropTransform, null);
...
Note: that the VideoFrame
coming may not always be the same size, as the plugin may switch between different resolutions. The BySize
sub-class is a helper structure used to cache structures for each resolution.
Then we can run image recognition on the obtained BitMap
;
public TFClassifier(...) {
...
classifier =
TensorFlowImageClassifier.create(
assetManager,
MODEL_FILE,
LABEL_FILE,
INPUT_SIZE,
IMAGE_MEAN,
IMAGE_STD,
INPUT_NAME,
OUTPUT_NAME);
}
private void recognize() {
final long startTime = SystemClock.uptimeMillis();
final List<Classifier.Recognition> results = classifier.recognizeImage(croppedBitmap);
long lastProcessingTimeMs = SystemClock.uptimeMillis() - startTime;
Log.i(TAG, String.format("Detect: %s (time=%dms)", results, lastProcessingTimeMs));
...
}
Because the image recognition can be time consuming (~100+ ms) it is ran on a separate thread (Android Handler
/ HandlerThread
). When a new frame arrives, it is pre-processed, the job is passed to a Handler
in which the image recognition is done.
private void onFrame(VideoFrame videoFrame) {
if (!state.compareAndSet(State.IDLE, State.PRE_PROCESS)) {
Log.d(TAG, "not idle");
return;
}
...
handler.post(() -> {
if (!state.compareAndSet( State.PRE_PROCESS, State.PROCESS)) {
Log.d(TAG, "not pre-process");
return;
}
recognize();
if (!state.compareAndSet(State.PROCESS, State.IDLE)) {
Log.d(TAG, "not process");
return;
}
});
To avoid over-loading the Handler
thread, when the image recognition is still running for a frame, any new frame is dropped. This way the image recognition may run at lower frame rate compared to the live video stream, but it remain be up to date instead of getting delayed.
The result of the image recognition is passed to a callback:
private final Consumer<String> detectionCallback;
private void recognize() {
...
final String json = new Gson().toJson(results);
detectionCallback.accept(json);
}
Then, in the MainActivity
class, the result is sent to the web interface, over a WebSocket connection using a custom-type message:
new TFClassifier(mediaStream.videoTracks.get(0), getAssets(), detectionJson -> { mWsClient.send( "{ \"type\":\"tf-detection\", \"data\" : " + detectionJson + " }"); });
On the Web Interface size, right now the result is just displayed over the video stream in a RAW form on a Canvas
.
webSocket.onmessage = function(evt) {
console.log('WebSocket onmessage() data:', evt.data);
let message = JSON.parse(evt.data);
switch (message.type) {
...
case 'tf-detection': {
console.log(message);
ctx.fillStyle = "#ff00ff";
ctx.font = "12px Arial";
let y = 50;
for (var i = 0; i < message.data.length; i++) {
let text = JSON.stringify(message.data[i]);
ctx.fillText(text, 10, y);
y += ctx.measureText(text).height;
}
break;
Note that the plugin is still in a PoC / Beta / phase and can be improved a lot.
The enhancements I plan to do are:
- a much better User Interface
- TensorFlow models that also provide location data of the recognized objects, so that these can be show on the live stream
- multi box detection - currently the camera image is scaled / cropped before the image recognition is done
- support to use for custom TensorFlow models provided by the user
- OpenCV based image filtering / pre-processing
Enjoy!
Comments