Li Mr
Published © MIT

Video Chat with Multimodal RAG on Jetson

Input a video, and then you can ask any questions related to the video. It will answer your questions based on the content of the video.

IntermediateFull instructions provided3 hours275
Video Chat with Multimodal RAG on Jetson

Things used in this project

Hardware components

Seeed Studio recomputer J4012
×1

Software apps and online services

Jetson-container:Ollama
Jetson-container:Pytorch

Story

Read more

Schematics

Multimodal-RAG-on-Jetson

This is the flowchart for Multimodal-RAG-on-Jetson: first, audio is extracted from the video to serve as input for Whisper, which saves it as a text file. Then, images are extracted at regular intervals from the video; specifically, one image is extracted every two seconds. These images are used as input for Llava, which describes them and saves the descriptions as text files. Next, an embedding model converts these texts into vectors and stores them in a vector database.

When you ask Llava a question, it retrieves texts from the vector database that are similar to your question. These texts serve as the context to answer your question. This enables interactive video communication with the system. Give it a try, and feel free to leave any feedback or comments!

Code

Multimodal-RAG-on-Jetson

Multimodel RAG with Ollama:llava and LlamaIndex on Jetson

Credits

Li Mr

Li Mr

1 project • 0 followers

Comments