Published March 31, 2022

Text to speech for people with low vision (disability)

Text to speech device that can help people with low vision to read

BeginnerFull instructions provided5 days346

Text to speech for people with low vision (disability)

Things used in this project

Hardware components

AMD Kria KV260 Vision AI Starter Kit

NYK Nemesis Webcam

HK502 USB Speaker

Digilent Female Right Angle 6-pin Header

Jumper wires (generic)

SparkFun Pushbutton switch 12mm

Wifi dongle

Plastic Enclosure, Junction Box

Software apps and online services

Snappy Ubuntu Core

Jupyter Notebook

Hand tools and fabrication machines

Soldering iron (generic)

Solder Wire, Lead Free

Story

Many people was born with low vision disability. They've been struggle to be able to read properly. Usually they use zooming lens to read, but it is so painful to watch. Another way to help these people is using text to speech device, but the text to speech device in the market is expensive, and the recently developed device is to slow. Xilinx Kria KV260 is able to enhance and solve this problem by delivering fast conversion from text to speech.

HARDWARE

The hardware are used in this project are Kria KV260, NYK Nemesis USB Camera, HK5002 USB Speaker, some push button and camera holder that made using acrilic. The assembly process is shown in these pictures:

1 / 3 • Fig. 2. Inside the box

The webcam, speaker, and wifi dongle are attached to usb port. Push buttons are driven from PMOD Connectors. On top of the box, it's drilled some hole for the speakers, so the sound wouldn't be suppressed. The cam holder is made using acrilic, the high is about 30cm with wing span about 15cm, enough to give the camera full view of one page.

The schematics of push button are in pull down mode, so if the io read high pulse, it will execute program under the function. The number of PMOD connector are shown in these figure below

1 / 2 • Fig. 4. PMOD Connectors pins number. Source 1

SOFTWARE

This device runs on Ubuntu for Kria KV260. It can be downloaded and installed through this LINK. I've also tried also petalinux for Kria KV260, but I'd prefer using UBUNTU. First step it need to do is login into ubuntu, it requires LCD Monitor, HDMI Cable, Mouse-Keyboard and PSU. Since the device using usb wifi dongle, it is required to install the wifi dongle library via ubuntu software updater.

1 / 2 • Wifi dongle driver install

After connected to internet, it can be used to download the software that required to run on this device. The list of the software:

OpenCV
Google Tesseract
Google text to speech GTTS
Playsound or IPython
PYNQ PMOD GPIO
Remote Desktop (Optional)

OpenCV used to capture image from usb camera, and convert the image into grayscale image. Google tesseract is used to convert the image into text, and then GTTS is used to text to mp3 - audio file. There are several text to speech converter like PYtts, but GTTS offers more flexibility to choose language speech. Download and install all program in list.

Before it deployed into the main program, it need to be tested to capture image, convert image and translate image to audio. For the image conversion test, I downloaded some image from google and the result of test are shown in picture below, and the text to audio conversion result here:

1 / 3 • Test Image conversion

From the test, it shows excellent result of image conversion. Image first converted to grey image, then converted again to text. Detailed step will be shown in main program.

PYNQ PMOD GPIO driver is used to drive Kria KV260 Pmod gpio/push button, thanks to the Xilinx PYNQ developer team who deliver this one, so I didn't need to develop it from the start. The library can be downloaded and installed here

MAIN PROGRAM

The flow of main program is show in flowchart below. Where the conversion is started when the button is pressed. Button 1 act as main trigger to run the image conversion, the next button (2, 3, 4) is used as thread flag.

Main program flowchart

The program started with library import, as mentioned before, the mandatory modul that needed to import is cv2, gtts, pytesseract, and playsound. To be able to use the GPIO, we need to import from pynq modul.

import pyttsx3
import os
import time
import pytesseract
import cv2
from threading import Thread
from multiprocessing import Process
from PIL import Image
from gtts import gTTS
from playsound import playsound
from kv260 import BaseOverlay
from pynq.lib.pmod.pmod_io import Pmod_IO

The main program is coded using simple python syntax, it's very easy to understand and followed. Firs function, used to capture image from usb camera. it may need to find the compatible camera. Some camera might be not compatible with openCV. The code is shown:

######## get image from camera #########
def get_image():
    print("get image")
    cam = cv2.VideoCapture(0)
    cam.set(cv2.CAP_PROP_FRAME_WIDTH,1280)
    cam.set(cv2.CAP_PROP_FRAME_HEIGHT, 1024)
    #cam.open(0, cv2.CAP_ANY)
    ret, img = cam.read()
    image = cv2.rotate(img, cv2.cv2.ROTATE_90_CLOCKWISE)
    cv2.imwrite('test_cam.jpg', image)
    print("get image done")

Text image from sample is taken using command cv2.VideoCapture(0), the cv2 will detect the default camera by searching the device camera that usually located /dev/video. We need to set the camera to higher pixel using cam.set, if its not set, the camera will take the image in default settings (might resulted in poor pixel depth). The image may need to rotated clockwise or counter clockwise, then save it to test_cam.jpg

########## Convert image to text ##############
def get_text():
    global text
    print("convert image to text")
    img = cv2.imread(image_path)
    gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    if threshold == True:
        gray_img = cv2.threshold(gray_img, 0, 225, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    elif blur == True:
        gray_img = cv2.medianBlur(gray_img, 5)
    cv2.imwrite('grey_img.jpg', gray_img)
    filename = "{}.png".format(os.getpid())
    cv2.imwrite(filename, gray_img)

    text = pytesseract.image_to_string(Image.open(filename))
    os.remove(filename)
    print(text)
    print("convert image to text done")

get_text function is used to convert captured image into text. started from read the image, convert it into grey image, and apply some filter such binary threshold and median blur. Then save the converted image to different image name. This converted image then using tesseract, the image translated to string/text.

#########  TEXT to Speech ###############
def text_to_sound():
    global text
    print("convert text to MP3")
    tts = gTTS(text, lang='id')
    tts.save('text_gtts.mp3')
    print("convert text to MP3 done")

next function is text_to_sound, where we want to translate the generated text to voice using gtts. using simple object that we import befor gTTS, easily convert the text to audio format (mp3).

######### PLAYING Sound ###############
def play_sound(filename):
    global play
    #play sound
    while(True):
        if play == True:
            play = False
            print("now playing sound",filename)
            playsound(filename)

This function will be called in our thread/process. This will work if play equal to True.

pl_sound = Process(target=play_sound, args=('text_gtts.mp3',))

The function play_sound will be inserted to a process, the aim is that the program can be terminated in the middle of process.

######### Flow Program ###############
def flow_program():
    global start, play
    while(True):
        if start == True:
            start = False
            get_image()
            get_text()
            text_to_sound()
            play = True #start playing sound
            if pl_sound.is_alive() == False:
                pl_sound.start()

Main flow program will run under flow_program function, where when start flag equal to True, it will run the function get_image, get_text, and text to sound. Then easily raise the play flag.

############### SCAN GPIO ############
def scan_gpio1():
    global play, start
    while(True):
        time.sleep(0.5)
        val = SW1.read()
        if val == 1:
            start = True
            print("button 1 pressed")
            print("start:",start)

def scan_gpio2():
    global play, start, pl_sound
    while(True):
        time.sleep(0.5)
        val = SW2.read()
        if val == 1:
            print("button 2 pressed")
            start = False
            play = False
            #kill thread
            pl_sound.terminate()

def scan_gpio3():
    while(True):
        time.sleep(0.5)
        val = SW3.read()
        if val == 1:
            print("button 3 pressed")

def scan_gpio4():
    while(True):
        time.sleep(0.5)
        val = SW4.read()
        if val == 1:
            print("button 4 pressed")

As the designed flowchart, the program will work under the GPIO. In this function, when the GPIO 1 status equal to 1, it will raise start flag, means the program will start to work. Button 2 will stop the playsound process.

if __name__ == "__main__":
    # multitrhreading
    thread1 = Thread(target=flow_program, name='main thread')
    #scan GPIO
    th_gpio1 = Thread(target=scan_gpio1, name='scan gpio 1')
    th_gpio2 = Thread(target=scan_gpio2, name='scan gpio 2')
    th_gpio3 = Thread(target=scan_gpio3, name='scan gpio 3')
    th_gpio4 = Thread(target=scan_gpio4, name='scan gpio 4')
    # Start threads
    thread1.start()
    th_gpio1.start()
    th_gpio2.start()
    th_gpio3.start()
    th_gpio4.start()
    print("device ready")

Finally, all of function above will be called in a thread. The complete program can be downloaded form my github.

RESULT

1 / 3 • Camera capture

The result are shown in the picture above, the device performed well to transform the image to speech. First image is the image from camera, 2nd image is conversion result from camera image to grey image, 3rd is image to text conversion using tesseract. The audio result can be found here.

Demo video:

Demonstration video