Introduction
Project Design Phases
Host Setup
Neural Net Training
Jetson Nano Camera Setup
AI Video Captioning
System Application
Further Improvements
Conclusion

Published March 10, 2020 © GPL3+

Descriptive AI Camera

Build a camera that automatically describes what it sees using AI captioning.

AdvancedFull instructions providedOver 1 day7,027

Things used in this project

Hardware components

NVIDIA Jetson Nano Developer Kit

USB camera

Software apps and online services

TensorFlow

OpenCV – Open Source Computer Vision Library OpenCV

Story

In this project we will build a camera that automatically describes what it observes. The main aim is to build the AI part of a system that can be used for automated surveillance using edge devices such as Jetson Nano.

Back in 2016 in a Google paper named "Show and Tell" researchers showed how to couple a Convolutional Neural Network (CNN) with a Long Term Short Memory (LSTM) network to provide automatic captioning (textual descriptions) of images. In this project we will extend this idea to real time video. An AI network running on an edge device such as Jetson Nano will be deployed such that it continuously provides textual descriptions of acquired frames. The textual descriptions will be used to trigger actions based on the described objects.

All of this is done with no requirements for network connectivity so such a system can be installed in remote areas that requires AI supervision of surroundings.

Introduction

The complete design implements an automatic image captioning neural network applied to real time video using the latest Tensorflow version on a Jetson Nano edge computing device. To keep the implementation simple, advanced features like attention are not implemented although they can be added to the network since the main script is fairly modular.

1 / 2

A hybrid deep neural network will be implemented to provide captioning of each frame in real time using a simple USB cam and the Jetson Nano.

Project Design Phases

The project will be setup in four phases. During the first phase we will setup and train the network on a host computer equipped with a discrete graphics card. The second phase consists of setting up Jetson Nano and implementing a simple image pipeline from camera to the HDMI monitor. The third phase includes integrating the image captioning deep neural network with the image pipeline from phase two. And finally during the last phase we will test the network on real world settings.

Jetson Nano setup with USB camera

The hardware requires supplying the Jetson Nano with a 2A power supply since powering it from USB is insufficient to run the neural model in high performance mode. To do this, make sure to install the jumper on the right side of Jetson Nano. Then plug in a USB cam, an SD card with the latest image and the Ethernet cable. Once the hardware is setup, the next step is setting up the prerequisite frameworks.

Host Setup

First we will define and train the network on a host laptop. The project will make use of Tensorflow 2.01, Keras 2.1 and OpenCV 4.1. A pre-requisite is to install Cuda10.0 and Visual Studio Express 17.0 to leverage the GPU speed gains in case the laptop comes with an NVIDIA enabled GPU.

The data-set we will use for training is the Flickr8K image data-set. This is a relatively small data-set that allows one to train a complete AI pipeline on a laptop class GPU. One can also use larger data-sets which will allow for better performance at the expense of much higher training time. The data-set can be downloaded from the University of Illinois via the request form.

The next data-set is the Glove data-set which is a set of word embedding build from a large corpus of text. This data-set essentially serves as the dictionary from which the AI picks up its vocabulary. After the caption text cleanup is done, the next step is to load the Glove embeddings. Embeddings are encodings of words that are used by neural networks. Basically words are projected as vectors in a high dimensional space and then represented as vectors. Download the data-set from here:

https://nlp.stanford.edu/projects/glove/

Then create a top-level directory named /Captioning and extract both zipped files in there.

1 / 2

In addition create a folder named /data for saving the files generated during the training phase. Next we will define the network and train the network.

Neural Net Training

From a high level perspective the image captioning deep learning network consists of a deep CNN (InceptionV3) and an LSTM recurrent neural network daisy chained together. The output of CNN is a x-dimensional vector which represents an image class. The output is sent to an LSTM which generates a textual description of the objects in an image. The LSTM basically receives a stream of x-dimensional vectors. Based on this it strings together a description of the scene in real time.

The Ipython notebook that trains the network is found on Github. The design of the main network is based on work from Jeff Heaton. It consists of an InceptionV3 CNN coupled with an LSTM recurrent neural network.

The next step is to build a dataset from Flickr captions and clean all the descriptions by tokenizing and pre-processing the text. Then we split the Flickr8K dataset into test and train image datasets. Then we load the train dataset descriptions and train the network.

As mentioned, the Inception network is used as the first stage of the network. The last fully connected layer is removed so the data that comes out from the first stage CNN is a one-dimensional vector. Inception can only accept images that come with a resolution of 299x299 pixel so the camera images have to be formatted.

inputs1 = Input(shape=(OUTPUT_DIM,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
caption_model = Model(inputs=[inputs1, inputs2], outputs=outputs)

The code snippet above shows the edited InceptionV3 CNN concatenated with the LSTM. This implements an encoder-decoder architecture.

Once this is done, we have to loop through the training and test image folders and pre-process each image.

The last part of the network is a re-current long short term memory neural network. (LSTM) for short. This network takes sequences and tries to predict the next word in a sequence. Work on these types of networks was done by A. Karpathy at Standford who pointed out how adequate they are for such tasks.

1 / 2

The last step is to train the network. For this project initially 6 epochs were used with the loss being 2.6% initially. To get acceptable results however the loss has to be much less than 1 so one has to train for at least 10-15 epochs.

After the network is trained we load the trained weights and test the network on test images from the data-set as well as images which are not part of the original data-set.

Here the network described the kiddo as a "man standing in the grass"

If the images are very similar in style and content with the images from the Flickr9K dataset the descriptions are relatively accurate. If the content is different in style the network will probably give descriptions which are plain non-sense.

The training phase concldues the first phase of the project. At this point you have an exported Keras model with the weights as well as the pickle files for test and training. All the data under the /Captioning folder can be uploaded on the Jetson Nano using WinSCP.

Jetson Nano Camera Setup

The second phase included setting up Jetson Nano with the camera. A USB camera with VGA resolution was used for this project.

The same version of Tensorflow 2.0, Python and Keras needs to be installed on the Jetson Nano in order to avoid compatibility issues.

sudo pip3 install --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v42 tensorflow-gpu==2.0.0+nv19.11

OpenCv is used to capture frames from the camera in a continous loop. To demonstrate live image captioning on video we have to overlay text on top of the live video feed. This can be also done using the OpenCV API. First we need to install the correct version.

Installing OpenCV

OpenCv4.1 was compiled from source. This can take a while. To install version 4.1, I used the script below:

curl -L https://github.com/opencv/opencv/archive/4.1.1.zip -o opencv-4.1.1.zip
curl -L https://github.com/opencv/opencv_contrib/archive/4.1.1.zip -o opencv_contrib-4.1.1.zip
unzip opencv-4.1.1.zip
unzip opencv_contrib-4.1.1.zip
cd opencv-4.1.1/
echo "** Building..."
mkdir release
cd release/
cmake -D WITH_CUDA=ON -D ENABLE_PRECOMPILED_HEADERS=OFF  -D CUDA_ARCH_BIN="5.3" -D CUDA_ARCH_PTX="" -D WITH_GTK=OFF -D WITH_QT=ON -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib-4.1.1/modules -D WITH_GSTREAMER=ON -D WITH_LIBV4L=ON -D BUILD_opencv_python2=ON -D BUILD_opencv_python3=ON -D BUILD_TESTS=OFF -D BUILD_PERF_TESTS=OFF -D BUILD_EXAMPLES=OFF -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local ..
make -j3
sudo make install

Notice that GTK was turned off, to avoid an issue with the libraries that is found when compiling with the default settings.

Once OpenCV was installed, the program was tested using the file test_openCV.py attached below. The USB camera shows up as /video0 under /dev.

After the frame is captured, text can be overlaid on top of each frame using the following function:

def __draw_label(img, text, pos, bg_color):
  font_face = cv2.FONT_HERSHEY_TRIPLEX
  scale = 1
  color = (255, 255, 255)
  thickness = cv2.FILLED
  margin = 5
  txt_size = cv2.getTextSize(text, font_face, scale, thickness)
  end_x = pos[0] + txt_size[0][0] + margin
  end_y = pos[1] - txt_size[0][1] - margin
  cv2.rectangle(img, pos, (end_x, end_y), bg_color, thickness)
  cv2.putText(img, text, pos, font_face, scale, color, 2, cv2.LINE_AA)

The image below shows a frame captured from the camera with date overlaid on top of the frame.

All the images that are taken from the camera via the OpenCv API are numpy arrays. So the array has to be converted to an image, resized to match the InceptionV3 CNN requirements and then reconverted back to an image and pre-processed further. This can be avoided with cameras that have multiple programmable resolutions.

The Jetson Nano does not have a particularly powerful GPU in comparison to the latest RTX class GPU's hence, training the network should definitely be done on a host laptop.

Compute capability 5.3

AI Video Captioning

Now that we have the basic image pipeline working on the Nano we will copy the encoded pickle file and Glove embedding on the Jetson Nano and load the image captioning trained network weights.

The basic image pipeline will be augmented with the image captioning network. Once the frame is captured the frame will be encoded from a Numpy array to an image, resized, and then converted back to a Numpy array. The image will then be pre-processed and passed through the inception network to get the encoding vector. The last step is to re-shape it so that the vector is re-formatted for the LSTM network.

while(cap.isOpened()):
ret, frame = cap.read()
framenc = encodeImageArray(frame).reshape((1,OUTPUT_DIM))
capstr = generateCaption(framenc)
print("Caption:",capstr)
print("_____________________________________")
__draw_label(frame, capstr, (0,150), (50,125,50))
cv2.imshow('Frame',frame)
if cv2.waitKey(25) & 0xFF == ord('q'):
break

Each acquired video frame is then passed through the captioning network. The textual description is then overlaid on top of the video feed in real time for demonstration.

The network requires 2-3 minutes to load since it reads and parses all the encodings. Then it reads an image frame and passes that through the network. The inference happens really fast.

The network will give out a couple of warnings based on low memory initially. Bear in mind that it is not optimized using TensorRT so further increases in speed can be gained by doing that and substituting InceptionV3 with a better CNN such as Xception.

1 / 9

System Application

The main avenue for the implementation of such a system would be coastal monitoring, park security surveillance and any such scenarios where an automated surveillance can be used for applications that have a positive impact resulting in saving lives and making environments secure.

Further Improvements

The next step is to convert the Tensorflow mode into TensorRT from NVIDIA in order to gain an additional sped up.

Since this is a modular system the output of the network can be passed to a notifier system that sends an email every time a word of interest appears on the image description.

A further development is to couple this with a conversational AI system in order to build an "Ask and describe" system.

Conclusion

As can be seen the network performs OK only in those instances where the images are similar in content to the trained images.

To improve the description one needs to use a much larger text corpus as well as a much larger annotated dataset. While the Flickr30K is almost 4x the size of the current dataset one can get much better results if using the MSCoCO dataset. The catch is that you need a powerful GPU or make use of the cloud.

Code

JetsonNano_ImageCaptioningCam.py

#!/usr/bin/env python

# dhq 2020 Feb 1
# Based on work from Jeff Heaton (T81-558)

import os
import string
import glob
from tensorflow.keras.applications import MobileNet
import tensorflow.keras.applications.mobilenet  

from tensorflow.keras.applications.inception_v3 import InceptionV3
import tensorflow.keras.applications.inception_v3


from tqdm import tqdm
import tensorflow.keras.preprocessing.image
import pickle
from time import time
import numpy as np
from PIL import Image, ImageFile
import cv2

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding, TimeDistributed, Dense, RepeatVector,                         Activation, Flatten, Reshape, concatenate, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras import Input, layers
from tensorflow.keras import optimizers

from tensorflow.keras.models import Model

from tensorflow.keras.layers import add
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt


START = "startseq"
STOP = "endseq"
EPOCHS = 6
USE_INCEPTION = True

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return f"{h}:{m:>02}:{s:>05.2f}"


# ### Needed Data
# 
# You will need to download the following data and place it in a folder for this example.  Point the *root_captioning* string at the folder that you are using for the caption generation. This folder should have the following sub-folders.
# 
# * data - Create this directory to hold saved models.
# * [glove.6B](https://nlp.stanford.edu/projects/glove/) - Glove embeddings.
# * [Flicker8k_Dataset](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip) - Flicker dataset.
# * [Flicker8k_Text](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip)


# ### Running Locally

root_captioning = "../Captioning"


# ### Clean/Build Dataset From Flickr8k
# 
# We must pull in the Flickr dataset captions and clean them of extra whitespace, punctuation, and other distractions.

null_punct = str.maketrans('', '', string.punctuation)
lookup = dict()

with open( os.path.join(root_captioning,'Flickr8k_text','Flickr8k.token.txt'), 'r') as fp:
  
  max_length = 0
  for line in fp.read().split('\n'):
    tok = line.split()
    if len(line) >= 2:
      id = tok[0].split('.')[0]
      desc = tok[1:]
      
      # Cleanup description
      desc = [word.lower() for word in desc]
      desc = [w.translate(null_punct) for w in desc]
      desc = [word for word in desc if len(word)>1]
      desc = [word for word in desc if word.isalpha()]
      max_length = max(max_length,len(desc))
      
      if id not in lookup:
        lookup[id] = list()
      lookup[id].append(' '.join(desc))
      
lex = set()
for key in lookup:
  [lex.update(d.split()) for d in lookup[key]]


# Stats on what was collected.

print(len(lookup)) # How many unique words
print(len(lex)) # The dictionary
print(max_length) # Maximum length of a caption (in words)


# Load the Glove embeddings.
# Warning, running this too soon on GDrive can sometimes not work.
# Just rerun if len(img) = 0
img = glob.glob(os.path.join(root_captioning,'Flicker8k_Dataset', '*.jpg'))


# Display the count of how many Glove embeddings we have.
len(img)


# Read all image names and use the predefined train/test sets.
train_images_path = os.path.join(root_captioning,'Flickr8k_text','Flickr_8k.trainImages.txt') 
train_images = set(open(train_images_path, 'r').read().strip().split('\n'))
test_images_path = os.path.join(root_captioning,'Flickr8k_text','Flickr_8k.testImages.txt') 
test_images = set(open(test_images_path, 'r').read().strip().split('\n'))

train_img = []
test_img = []

for i in img:
  f = os.path.split(i)[-1]
  if f in train_images: 
    train_img.append(f) 
  elif f in test_images:
    test_img.append(f) 


# Display the size of the train and test sets.
print(len(train_images))
print(len(test_images))


# Build the sequences.  We include a **start** and **stop** token at the beginning/end.  We will later use the **start** token to begin the process of generating a caption.  Encountering the **stop** token in the generated text will let us know we are done.
train_descriptions = {k:v for k,v in lookup.items() if f'{k}.jpg' in train_images}
for n,v in train_descriptions.items(): 
  for d in range(len(v)):
    v[d] = f'{START} {v[d]} {STOP}'


# See how many discriptions were extracted.
len(train_descriptions)


# ### Choosing a Computer Vision Neural Network to Transfer
# 
# There are two neural networks that are accessed via transfer learning.  In this example, I use Glove for the text embedding and InceptionV3 to extract features from the images.  Both of these transfers serve to extract features from the raw text and the images.  Without this prior knowldge transferred in, this example would take consideraby more training.
# 
# I made it so you can interchange the neural network used for the images.  By setting the values WIDTH, HEIGHT, and OUTPUT_DIM you can interchange images.  One characteristic that you are seeking for the image neural network is that it does not have too many outputs (once you strip the 1000-class imagenet classifier, as is common in transfer learning).  InceptionV3 has 2,048 features below the classifier and MobileNet has over 50K.  If the additional dimensions truely capture aspects of the images, then they are worthwhile.  However, having 50K features increases the processing needed and the complexity of the neural network we are constructing.

encode_model = InceptionV3(weights='imagenet')
encode_model = Model(encode_model.input, encode_model.layers[-2].output)
WIDTH = 299
HEIGHT = 299
OUTPUT_DIM = 2048
preprocess_input = tensorflow.keras.applications.inception_v3.preprocess_input


# The summary for the chosen image neural network to be transfered is displayed.
encode_model.summary()


# ### Creating the Training Set
# 
# We we need to encode the images to create the training set.  Later we will encode new images to present them for captioning.
def encodeImage(img):
  # Resize all images to a standard size (specified bythe image encoding network)
  img = img.resize((WIDTH, HEIGHT), Image.ANTIALIAS)
  # Convert a PIL image to a numpy array
  x = tensorflow.keras.preprocessing.image.img_to_array(img)
  # Expand to 2D array
  x = np.expand_dims(x, axis=0)
  # Perform any preprocessing needed by InceptionV3 or others
  x = preprocess_input(x)
  # Call InceptionV3 (or other) to extract the smaller feature set for the image.
  x = encode_model.predict(x) # Get the encoding vector for the image
  # Shape to correct form to be accepted by LSTM captioning network.
  x = np.reshape(x, OUTPUT_DIM )
  return x

def encodeImageArray(img):
  img = tensorflow.keras.preprocessing.image.array_to_img(img)
  # Resize all images to a standard size (specified bythe image encoding network)
  img = img.resize((WIDTH, HEIGHT), Image.ANTIALIAS)
  # Convert a PIL image to a numpy array
  x = tensorflow.keras.preprocessing.image.img_to_array(img)
  # Expand to 2D array
  x = np.expand_dims(x, axis=0)
  # Perform any preprocessing needed by InceptionV3 or others
  x = preprocess_input(x)
  # Call InceptionV3 (or other) to extract the smaller feature set for the image.
  x = encode_model.predict(x) # Get the encoding vector for the image
  # Shape to correct form to be accepted by LSTM captioning network.
  x = np.reshape(x, OUTPUT_DIM )
  return x


# We can how generate the training set.  This will involve looping over every JPG that was provied.  Because this can take awhile to perform we will save it to a pickle file.  This saves the considerable time needed to completly reprocess all of the images.  Because the images are processed differently by different transferred neural networks, the output dimensions are also made part of the file name.  If you changed from InceptionV3 to MobileNet, the number of output dimensions would change, and a new file would be created.
train_path = os.path.join(root_captioning,"data",f'train{OUTPUT_DIM}.pkl')
if not os.path.exists(train_path):
  start = time()
  encoding_train = {}
  for id in tqdm(train_img):
    image_path = os.path.join(root_captioning,'Flicker8k_Dataset', id)
    img = tensorflow.keras.preprocessing.image.load_img(image_path, target_size=(HEIGHT, WIDTH))
    encoding_train[id] = encodeImage(img)
  with open(train_path, "wb") as fp:
    pickle.dump(encoding_train, fp)
  print(f"\nGenerating training set took: {hms_string(time()-start)}")
else:
  with open(train_path, "rb") as fp:
    encoding_train = pickle.load(fp)


# A similar process must also be performed for the test images.
test_path = os.path.join(root_captioning,"data",f'test{OUTPUT_DIM}.pkl')
if not os.path.exists(test_path):
  start = time()
  encoding_test = {}
  for id in tqdm(test_img):
    image_path = os.path.join(root_captioning,'Flicker8k_Dataset', id)
    img = tensorflow.keras.preprocessing.image.load_img(image_path, target_size=(HEIGHT, WIDTH))
    encoding_test[id] = encodeImage(img)
  with open(test_path, "wb") as fp:
    pickle.dump(encoding_test, fp)
  print(f"\nGenerating testing set took: {hms_string(time()-start)}")
else:
  with open(test_path, "rb") as fp:
    encoding_test = pickle.load(fp)


# Next we separate the captions that will be usef for training.  There are two sides to this training, the images and the captions.

all_train_captions = []
for key, val in train_descriptions.items():
    for cap in val:
        all_train_captions.append(cap)
len(all_train_captions)


# Words that do not occur very often can be misleading to neural network training.  It is better to simply remove such words.  Here we remove any words that occur less than 10 times.  We display what the total vocabulary shrunk to.
word_count_threshold = 10
word_counts = {}
nsents = 0
for sent in all_train_captions:
    nsents += 1
    for w in sent.split(' '):
        word_counts[w] = word_counts.get(w, 0) + 1

vocab = [w for w in word_counts if word_counts[w] >= word_count_threshold]
print('preprocessed words %d ==> %d' % (len(word_counts), len(vocab)))

# Next we build two lookup tables for this vocabulary. One idxtoword convers index numbers to actual words to index values.  The wordtoidx lookup table performs the opposit.

idxtoword = {}
wordtoidx = {}

ix = 1
for w in vocab:
    wordtoidx[w] = ix
    idxtoword[ix] = w
    ix += 1
    
vocab_size = len(idxtoword) + 1 
vocab_size


# Previously we added a start and stop token to all sentences.  We must account for this in the maximum length of captions.

max_length +=2
print(max_length)


# ### Using a Data Generator
# 
# Up to this point we've always generated training data ahead of time and fit the neural network to it.  It is not always practical to generate all of the training data ahead of time.  The memory demands can be considerable.  If the training data can be generated, as the neural network needs it, it is possable to use a Keras generator.  The generator will create new data, as it is needed.  The generator provided here creates the training data for the caption neural network, as it is needed.
# 
# If we were to build all needed training data ahead of time it would look something like below.
# 
# ![Captioning](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/caption-2.png "Captioning")
# 
# Here we are just training on two captions.  However, we would have to duplicate the image for each of these partial captions that we have.  Additionally the Flikr8K data set has 5 captions for each picture.  Those would all require duplication of data as well.  It is much more efficient to just generate the data as needed.

def data_generator(descriptions, photos, wordtoidx, max_length, num_photos_per_batch):
  # x1 - Training data for photos
  # x2 - The caption that goes with each photo
  # y - The predicted rest of the caption
  x1, x2, y = [], [], []
  n=0
  while True:
    for key, desc_list in descriptions.items():
      n+=1
      photo = photos[key+'.jpg']
      # Each photo has 5 descriptions
      for desc in desc_list:
        # Convert each word into a list of sequences.
        seq = [wordtoidx[word] for word in desc.split(' ') if word in wordtoidx]
        # Generate a training case for every possible sequence and outcome
        for i in range(1, len(seq)):
          in_seq, out_seq = seq[:i], seq[i]
          in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
          out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
          x1.append(photo)
          x2.append(in_seq)
          y.append(out_seq)
      if n==num_photos_per_batch:
        yield ([np.array(x1), np.array(x2)], np.array(y))
        x1, x2, y = [], [], []
        n=0


# ### Loading Glove Embeddings

glove_dir = os.path.join(root_captioning,'glove.6B')
embeddings_index = {} 
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), encoding="utf-8")

for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

f.close()
print(f'Found {len(embeddings_index)} word vectors.')


# ### Building the Neural Network
# 
# An embedding matrix is built from Glove.  This will be directly copied to the weight matrix of the neural network.

embedding_dim = 200

# Get 200-dim dense vector for each of the 10000 words in out vocabulary
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in wordtoidx.items():
    #if i < max_words:
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in the embedding index will be all zeros
        embedding_matrix[i] = embedding_vector


# The matrix makes sense.  It is 1652 (the size of the vocabulary) by 200 (the number of features Glove generates for each word).

embedding_matrix.shape

inputs1 = Input(shape=(OUTPUT_DIM,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
caption_model = Model(inputs=[inputs1, inputs2], outputs=outputs)

print(embedding_dim)
caption_model.summary()


caption_model.layers[2].set_weights([embedding_matrix])
caption_model.layers[2].trainable = False
caption_model.compile(loss='categorical_crossentropy', optimizer='adam')


start= 0
model_path = os.path.join(root_captioning,"data",f'caption-model.hdf5')
caption_model.load_weights(model_path)
    


# ### Generating Captions
# 
# It is important to understand that a caption is not generated with one single call to the neural network's predict function.  Neural networks output a fixed-length tensor.  To get a variable length output, such as free-form text, requires multiple calls to the neural network.
# 
# The neural network accepts two objects (which are mapped to the input neurons).  The first is the photo.  The second is an ever growing caption.  The caption begins with just the starting token.  The neural network's output is the prediction of the next word in the caption.  This continues until an end token is predicted or we reach the maximum length of a caption.  Each time predict a new word is predicted for the caption.  The word that has the highest probability (from the neural network) is chosen. 

def generateCaption(photo):
    in_text = START
    for i in range(max_length):
        sequence = [wordtoidx[w] for w in in_text.split() if w in wordtoidx]
        sequence = pad_sequences([sequence], maxlen=max_length)
        yhat = caption_model.predict([photo,sequence], verbose=0)
        yhat = np.argmax(yhat)
        word = idxtoword[yhat]
        in_text += ' ' + word
        if word == STOP:
            break
    final = in_text.split()
    final = final[1:-1]
    final = ' '.join(final)
    return final


#@staticmethod
def __draw_label(img, text, pos, bg_color):
    font_face = cv2.FONT_HERSHEY_TRIPLEX
    scale = 0.5
    color = (255, 255, 255)
    thickness = cv2.FILLED
    margin = 5

    txt_size = cv2.getTextSize(text, font_face, scale, thickness)

    end_x = pos[0] + txt_size[0][0] + margin
    end_y = pos[1] - txt_size[0][1] - margin

    cv2.rectangle(img, pos, (end_x, end_y), bg_color, thickness)
    cv2.putText(img, text, pos, font_face, scale, color, 2, cv2.LINE_AA)



cap = cv2.VideoCapture(0)  

#cap.set(3,640)
#cap.set(4,480)

if (cap.isOpened()== False):
    print("Error opening video stream or file")
    

while(cap.isOpened()):
    ret, frame = cap.read()
    framenc = encodeImageArray(frame).reshape((1,OUTPUT_DIM))
    capstr = generateCaption(framenc)
    print("Caption:",capstr)
    print("_____________________________________")
    __draw_label(frame, capstr, (0,150), (50,125,50))
    cv2.imshow('Frame',frame)

    if cv2.waitKey(25) & 0xFF == ord('q'):
        break     

cap.release()
cv2.destroyAllWindows()

Credits

Dimiter Kendri

23 projects • 162 followers

Robotics and AI

Contact

Thanks to Jeff Heaton.

Comments

Please log in or sign up to comment.

Descriptive AI Camera

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Project Design Phases

Host Setup

Neural Net Training

Jetson Nano Camera Setup

AI Video Captioning

System Application

Further Improvements

Conclusion

Schematics

Video Captioning netowkr AI network

Code

JetsonNano_ImageCaptioningCam.py

Jetson Nano AI Image captioning

Credits

Dimiter Kendri

Comments

Embed the widget on your own site

Descriptive AI Camera

Descriptive AI Camera

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Project Design Phases

Host Setup

Neural Net Training

Jetson Nano Camera Setup

AI Video Captioning

System Application

Further Improvements

Conclusion

Schematics

Video Captioning netowkr AI network

Code

JetsonNano_ImageCaptioningCam.py

Jetson Nano AI Image captioning

Credits

Dimiter Kendri

Comments

Related channels and tags