- Video Summarization: This tool generates a concise summary of long videos, making it easier to understand and retain important information quickly
- Multi-language translation: Provides text and audio translations in multiple languages, making content accessible to the global audience.
- Upscaling: This tool also provides an upscaling feature that can enhance the video quality to make sure the user gets a high-quality playback if he needs it.
In today's fast-paced life, people often struggle to find time to consume lengthy video content, The Language barrier further limits access to different types of valuable content, Recognizing these challenges we created Streamline to:
- Summarize the videos for easier and more convenient consumption in key points.
- Break down the language barrier by offering translations in various languages.
Our main goal is to make the video content more accessible, understandable, and enjoyable for everyone, regardless of the time constraints or Language barrier.
Hardware & Software RequirementsFor this project, we used the GPU provided by AMD as we were one of the Hardware Winners, and after investing some money to make the GPU usable for us we used the following hardware -
- AMD Radeon W7900 Pro (48GB) GPU
- Ryzen 5 7600 processor
- Crucial ddr5 ram 6000mhz (16GB x 2)
- Crucial nvme M.2 1TB SSD
- MSI motherboard
- 800w bronze-rated power supply
- Ubuntu 22.04 (operating system)
- Windows 10 (as a dual boot operating system)
This overall build was a successful investment, as we were able to completely harness the whole power of the GPU clocking at 240w max.
Rcom drivers and library support by the AMD for GPU were easy to install and helped us set it for TensorFlow and PyTorch, making all the model deployments and setup easy to use.
Additionally, we integrated the remote desktop function in our hardware device since my other teammates were living far away so it is accessible to them for use all the time.
This was the complete setup :).
ApproachOur approach to developing Streamline was very simple and with one goal, how to make a model that makes consuming video much simpler for a user, for that, we decided to work on 3 main goals: Summarizing the content, providing multi-language support, and upscaling the video quality. The following steps will outline our approach:
Research and Selection of A.I. Models:For summarization, we had two choices for models, either go with an Extractive Summarization Model or an Abstractive Summarization model, After multiple tests and the use of multiple models we decided to use the Abstractive Summarization model because of its huge advantage over the Extractive models, and due to the following reasons:
- Extractive Models are limited in content generation as they are limited key sentences that are present in the original text meanwhile abstractive models can generate new sentences that may or may not be present in the original text but will convey the same meaning and would sound similar to how humans summarize the text.
- The Technique used by the Extractive Models is to score the sentences based on importance clustering similar sentences, meanwhile Abstractive Summarization model uses advanced neural networks such as transformer-based models that understand the context and generate a summary.
- Although Extractive Models take less time as they do not generate a summary we preferred the Abstractive Model over it because the summary we got was usually more coherent and concise, which was more natural and readable.
- For summarization, we had two choices for models, either go with an Extractive Summarization Model or an Abstractive Summarization model, After multiple tests and the use of multiple models we decided to use the Abstractive Summarization model because of its huge advantage over the Extractive models, and due to the following reasons:Extractive Models are limited in content generation as they are limited key sentences that are present in the original text meanwhile abstractive models can generate new sentences that may or may not be present in the original text but will convey the same meaning and would sound similar to how humans summarize the text.The Technique used by the Extractive Models is to score the sentences based on importance clustering similar sentences, meanwhile Abstractive Summarization model uses advanced neural networks such as transformer-based models that understand the context and generate a summary.Although Extractive Models take less time as they do not generate a summary we preferred the Abstractive Model over it because the summary we got was usually more coherent and concise, which was more natural and readable.
Therefore, for this we decided to go with "mistralai/Mistral-7B-Instruct-v0.2" which was 4-bit quantized to save up Vram, this model performed well and met all the expectations.
- To extract all the content video, we needed to generate the transcript, for which we used “Vosk-model-en-us-0.22-lgraph”.
- This model receives the audio input in the mp3 format of the video after preprocessing and then the model converts that into a series of text based on audio input.
- We tried multiple models like “facebook/s2t-small-librispeech-asr”, but the results from these were unsatisfactory for us, therefore we decided we decided to stick with Vosk as it was providing the most accurate output for the speech we were providing.
- Unlike the case of the Summarization model, there was no need for any changes in any sentences to create any meaning or change the text format into any humanized format, So utilizing a small model like Vosk was preferred.
- Why not use any other transformer model like "facebook/hf-seamless-m4t-medium" ?? it can do the same job right?? But the problem was context length input and generation, so these models unlike our Vosk model that uses kaldi and can process a very large context in one go, Our transformers model was suffering as sometimes when videos were too long, it usually failed or started hallucinating in mid.
- For translation, We used "Facebook/hf-seamless-m4t-medium", this model did two jobs very well, Unlike its massive fail in Speech2text conversion where it suffered to do the job, It surprisingly did a very good job in text2text and text2Speech translation and was able to provide very good output in various languages, although due to our language barrier, we couldn't verify if our model was generating correct translation or not XD.
- "facebook/hf-seamless-m4t-medium" was the perfect choice for our project, This Model was not on the heavier side (1.2B parameters) consuming only 2-3 gb on Vram.
- Usage of “facebook/seamless-m4t-v2-large” was also on our list, and it also did a really good job and was outperforming the medium model but considering its large size (2.3B parameters) which was consuming 4-7 GB of RAM was a bit of issue for us, during that time we were planning to integrate other models too and saving up Vram was a top priority.
Our Architecture for Streamline was very simple, easy to use, and easy to deploy, our architecture for this was very simple, We used Streamlit for our interface and accessibility, where all the models and functions were integrated.
Directory:
project_directory/
│
├── app.py
├── utils/
│ ├── __init__.py
│ ├── youtube.py
│ ├── audio.py
│ ├── transcription.py
│ ├── summarization.py
│ ├── upscaling.py
│ ├── video.py
└── requirements.txt
Audio Module- Extracts the.mp3 (audio) from the video file processes resample it to 16kHz and preprocess it before sending it to the Vosk model for transcription
- Extracts the.mp3 (audio) from the video file processes resample it to 16kHz and preprocess it before sending it to the Vosk model for transcription
- Implementation of the Vosk Model is done here to process the audio we get and convert that audio into a Transcript for our Transformer model to process it.
- Implementation of the Vosk Model is done here to process the audio we get and convert that audio into a Transcript for our Transformer model to process it.
- The Summarization Model consists of our transformer Model "mistralai/Mistral-7B-Instruct-v0.2", This model did a good job, although to tackle the issue with the part where the model context length can't contain the whole transcript, we used the splitter function feed model only enough context length it can handle.
- The Summarization Model consists of our transformer Model "mistralai/Mistral-7B-Instruct-v0.2", This model did a good job, although to tackle the issue with the part where the model context length can't contain the whole transcript, we used the splitter function feed model only enough context length it can handle.
prompt = (
f"Summarize the following video transcript chunk in a coherent and detailed manner. "
f"Highlight key points and maintain the flow of the narrative. Include information about the video titled '{video_info['title']}' by '{video_info['author']}' with the following description: '{video_info['description']}':\n\n"
f"{text}\n\nSummary:"
)
Video Module- This module had a small job of feeding the Upscaling Module with frames so that it could process the video and upscale it frame by frame.
- This module had a small job of feeding the Upscaling Module with frames so that it could process the video and upscale it frame by frame.
- The purpose of this module was to extract all info about the video that has been extracted from the YouTube link to provide more brief content to model the author of the video or description etc.
- The purpose of this module was to extract all info about the video that has been extracted from the YouTube link to provide more brief content to model the author of the video or description etc.
- This module consists of our ESR GAN, which enhances the video frame by frame and later combines them to make one upscaled video, although currently this thing is limited to only 15 sec of upscaling as it is way slower than we expected and takes a lot of time in upscaling.
- This module consists of our ESR GAN, which enhances the video frame by frame and later combines them to make one upscaled video, although currently this thing is limited to only 15 sec of upscaling as it is way slower than we expected and takes a lot of time in upscaling.
- Streamlit is a web framework that is simple to implement and modify and meets all our criteria of displaying video, text, and, audio in the same place. We also had other frameworks in mind like Dash, Panel, and Gradio but they all were inferior to Streamlit in one way or another.
- Streamlit is a web framework that is simple to implement and modify and meets all our criteria of displaying video, text, and, audio in the same place. We also had other frameworks in mind like Dash, Panel, and Gradio but they all were inferior to Streamlit in one way or another.
Pytube
- Use of Pytube library to integrate the support of YouTube links in the framework making it more accessible and easy to use, as you can just paste the link of the YouTube video you want and generate the summary directly, or you can upload the video of your own as per the need.
- PytubeUse of Pytube library to integrate the support of YouTube links in the framework making it more accessible and easy to use, as you can just paste the link of the YouTube video you want and generate the summary directly, or you can upload the video of your own as per the need.
def get_video_info(youtube_url):
yt = YouTube(youtube_url)
video_info = {
"video_id": yt.video_id,
"title": yt.title,
"author": yt.author,
"channel_id": yt.channel_id,
"description": yt.description,
"publish_date": yt.publish_date.strftime('%Y-%m-%d') if yt.publish_date else 'Unknown',
"views": yt.views,
"length": yt.length,
"rating": yt.rating,
"keywords": yt.keywords,
"thumbnail_url": yt.thumbnail_url,
"video_url": yt.watch_url,
"captions": {lang: str(yt.captions[lang]) for lang in yt.captions},
"streams": [str(stream) for stream in yt.streams]
- Since our interface supports audio, text, and video formats we can get the output in all formats in the same framework without any big changes or implementation, and Since we have access to Pytube here, we can print all the info and description of any video for any amount of details we can get from YouTube API. This can help gather any additional details about the video content as per needs.
Our A.I. powered tool Streamline is not perfect of course, it has some flaws that we tried to fix most of the issues we encountered but some were left, Here’s the list of the problems we faced and fixes we made, and unfixed issues:
- First and the biggest issue for the context length generation by our "mistralai/Mistral-7B-Instruct-v0.2", This model has a 70% chance of hallucinating if we ever try to generate anything above 2048 characters, after all the testing we found a solution that if we feed the model with the basic intro of the video, and also the fact that it's getting part of videos and make it generate a summary, it does well, so splitting the whole transcript into small chunks of 1500 characters and make model explain that worked well, and whole generated summary has a good enough context of whole video and key points generation was great.
- Then we faced the problem with our "facebook/hf-seamless-m4t-medium", this mode had the same issue of sometimes hallucinating, so we did the same with this model, splitting the text into chunks and feeding it, making it workable and producing a smoothly translated summary. [LIMITATION] Unfortunately this model had a small problem with audio generation with split audio, it was either hallucinating on that too or something else but it kept on repeating the same words after the first audio generation, so we limited it to generate only 20 sec of translated audio. Which fixed the issue but we were not able to complete the summary in audio translation.
- [LIMITATION] Another limitation we faced was with the video being more upscale, this model took a huge amount of time in video upscaling, around 7min for a 15 sec video, which was troublesome, so we limited this model to only generate an upscaled version of first 15 sec of the video.
Our project Streamline has a lot of potential, although since we got our hardware in a very late stage and with only 1.5 months left to work on it, we tried our best to add as many functionalities as possible, but adding all of them to it was not possible, here is the future scope of what else can be added to this project to make it better in terms of efficiency, modularity, and reliability. [ In the Approach section we talked about saving up VRAM even tho we had plenty of it, it was only because we were planning to add more models to the list which would have crossed our 48 GB VRAM, but since we haven't implemented them here’s the discussion on what we were planning to add actually].
Frame Interpolation- This was one of the models that I wanted to implement in this project, I had been working on one for a long time, and for this project especially I wanted to make one from scratch, I collected around 33 GB of the dataset and trained a model with around 7 layers of conv2d, the result was something I didn't expect, images came out to be blurry and in the end, we decided not to add this. But in the future, we might improve this model by adding frame interpolation for sure.
- This was one of the models that I wanted to implement in this project, I had been working on one for a long time, and for this project especially I wanted to make one from scratch, I collected around 33 GB of the dataset and trained a model with around 7 layers of conv2d, the result was something I didn't expect, images came out to be blurry and in the end, we decided not to add this. But in the future, we might improve this model by adding frame interpolation for sure.
- Our current upscaler is lacking in terms of most fields, It is slow, not very reliable in bigger resolutions, and since it doesn't consume enough of gpu, I believe it's lacking in all these terms, We need to do proper research on this and build a better model from scratch, as there are very less open source video upscale models out there, building from scratch is the most viable option for sure.
- Since our current audio translation is capped at 20 sec per section, we need to fix the issue with the model hallucinations as we did for our mistral model, so that it can generate audio for all the topics completely..
- The current model works on the ideology of spitting each video into a sequence of images, resulting in many redundant frames with very few variations that could be further optimized and reduce the run time considerably.
- The current summary generation is based on the transcription acquired from the audio of the video, our future goal is it implement a system that extracts data from each distinct frame of a video and combines the data generated to form a meaningful and more detailed summary of the video.
Based on what we have seen and discussed by now, This was the whole project overview of how it works, what components we have used, the reason why we decided to make it, and how it worked out, etc. Following are some output of how the model looks in working [ Small update: The current working shown here is not actually on the original hardware we were working on, as our college opened I had to leave my pc at home and I forgot to take any demo video of it from that device, really sorry for that :(].
Untitled video - Made with Clipchamp.mp4
Comments
Please log in or sign up to comment.