Created August 7, 2022

Real-time multi-camera 3D audio-visual scene segmentation

Multiple Spresence devices with camera, mic, and LTE observe the same scene and comprehend it in 3D in multiple sensor modalities.

Things used in this project

Hardware components

Sony Spresense boards (main & extension)

Part of field capture device.

Sony Spresense camera board

Part of field capture device (video).

Sony Spresense LTE extension board

Streaming from field capture device.

TDK Corporation Omnidirectional environmental microphone

Part of field capture device (audio).

LTE SIM Card (SORACOM)

Software apps and online services

Sony Spresense SDK

TensorFlow

OBS Studio by OBS Project

KiCad

Story

Intro

Using multiple Spresense devices (at least two), each equipped with camera and ambient microphone, surrounding a volume of space and looking from the boundary in, a scene can be 3D segmented visually and acoustically. With the devices doing edge sensor fusion and smart vision on their individual local streams, and further forwarding the geotagged streams for remote integration, the scene can be comprehended in multiple sensor dimensions in real time. The devices can be either static or freely mobile. The end user can dynamically choose their point of view on the scene and examine in detail to the extent allowed by the multiplicity of the device streams being integrated.

This technology, lightweight and power-efficient, can be used in many ways: casual observation by passers-by of a captivating street juggling act, crime-scene provenance for identification of blame in court, traffic observation at critical intersections or complex transportation hubs, a compelling new genre of POV-controlled video shorts, etc. Every application where a single camera and microphone provide a flat single-POV perspective on an object or scene of interest, a set of cooperative multi-sensor devices can provide a much rich multi-perspective experience of the scene. Certainly, this is not novel technology (it has been employed to create rich sports viewer experiences in many sports), but the power-efficiency yet high-quality Spresense platform can bring it out of the stadium and put it right out on the sidewalk.

The technologies for stitching streams already exist and support multi-million dollar industries. This is a democratization of the technology into the mainstream. It is impossible to predict all the possible uses it may be put to if made widely accessible and affordable, just like good old flat video was; better implement and put in the hands of creative humans, sit back, and enjoy the emergence of novel applications.

Main features

The main idea of the solution is to enrich, integrate, process, and forward multiple streams of sensor data describing the same volume of space at the same moment in time. So, the main features are:

1. Multiple sensors (at least camera, microphone, and GPS) or arrays of sensors creating individual streams at the same time. (Spresence main, vanilla extension, and camera.)

2. Software- or hardware-based sensor fusion (HW might require the augmentation of the computing stack with a lightweight power-efficient FPGA). (Design of custom extension for offloading of HW-based sensor fusion to FPGA.)

3. For full mobility, a custom extension board running on battery power needs to be designed and fabricated. (Design of custom extension to support mobile power in a wearable form factor.)

4. Machine vision models for real-time segmentation of the "flat" local streams and embedding of metadata (aka second-stage sensor fusion).

5. LTE streaming of individual streams. (LTE extension and SIM card with global IoT plan.)

6. Post-processing, stitching and serving of POV-control streamed video. (Development of multi-media protocol extensions for accepting the rich data.) At this point, the data can be all tagged with proof-of-work and other block-chain provenance, and/or encrypted, depending on the application.

Development roadmap

There are 3 or 4 stages of development:

1. Building the sensor-fusion and 2D segmentation pipeline for the local sensor stream for a single Spresence device (main, extension, camera).

2. Geotagging and CORDIC processing of the stream to embed maximum metadata for downstream integration of the streams (main, extension, camera).

3. Simultaneous streaming from at least 2 devices over LTE to a scene-specific cloud server (after proof of concept, this service can be developed to serve real-time scenes and events on demand).

4. Stitching and integration of the multiple simultaneous streams and providing real-time streaming of the rich stream to both browser viewers (no POV control) to custom mobile apps (full POV control).

Outro

This project challenge came during a tough time for me, and after slipping the development roadmap several times due to high-priority personal matters preempting the project, ultimately wasn't realized. It will be taken up again shortly. This submission is meant only to meet my commitment to Hackster.io and the project sponsor Sony.

Real-time multi-camera 3D audio-visual scene segmentation

Things used in this project

Hardware components

Software apps and online services

Story

Intro

Main features

Development roadmap

Outro

Schematics

System overview

Code

Project repo

Credits

Ivo Georgiev

Comments

Embed the widget on your own site

Real-time multi-camera 3D audio-visual scene segmentation

Real-time multi-camera 3D audio-visual scene segmentation

Things used in this project

Hardware components

Software apps and online services

Story

Intro

Main features

Development roadmap

Outro

Schematics

System overview

Code

Project repo

Credits

Ivo Georgiev

Comments