Reading a page of a book

I started getting the TX2 all set up by updating this and that and ensuring OpenCV was wrapped by python3. I chose the reading a book as the first one to take a crack at and made some good progress or at least I know where I need to go from here. I also played a bit with saving images and depth maps from the ZED camera.

Improvements that I need to work towards-

Switch to a offline text to speech converter that sounds a bit more natural. Google is cool, but I have a nice processor onboard, and would like Sophie to work even when internet is down. You cannot comfort a child in a storm if when the internet goes out, you magically loose the ability to read.

Speed up the translation time, It seems if I crop the area with just text it significantly drops to like 2 seconds to translate. I will keep working on this, though I have a feeling it will be a whole new ball game when the child is holding the book. I think the robot will need to use a lot of processing power to even stabilize the incoming image first then divide up the image into what is a book page and what is not and then box those sections. Just need to start optimizing as I go, no kid (at least not mine) would wait 12 seconds to hear a robot read the first page I'm afraid lol.

Pros-

It works, it can take in a page of a book and translate to speech, it even does well at ignoring images (think kid picture books)
Did I mention that it works?

Cons-

It is slow (It takes on average about 10-12 seconds to convert a page of text)
It translates twice in this code, though that is more for debugging purposes
It uses a online converter for text to speech
It sounds a little too unnatural

Here is the "get it working code" (their is much clean up to go and the debugging of the pipeline is still in place)

# By Apollo Timbers for Sophie robot project
# Experimental code, needs work 
# This reads in a image to OpenCV then does some preprocessing steps to then feeds into pytesseract. Pytesseract then translates the words on the image and prints as a string
# It then does it again to pipe it to Google text to speech generator, a .mp3 is sent back and is played at the end of the translated text.  
# import opencv
import cv2
#import pytesseract module
import pytesseract
# import Google Text to Speech
import gtts
from playsound import playsound
 
# Load the input image
image = cv2.imread('bookpage4.jpg')
#cv2.imshow('Original', image)
#cv2.waitKey(0)
 
# Use the cvtColor() function to grayscale the image
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Show grayscale
# cv2.imshow('Grayscale', gray_image)
# cv2.waitKey(0)

# Median Blur
blur_image = cv2.medianBlur(gray_image, 3)

# Show Median Blur
# cv2.imshow('Median Blur', blur_image)
# cv2.waitKey(0)

# OpenCV stores images in BGR format and since pytesseract needs RGB format,
# convert from BGR to RGB format
img_rgb = cv2.cvtColor(blur_image, cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb))

# make request to google to get synthesis
tts = gtts.gTTS(pytesseract.image_to_string(img_rgb))

# save the audio file
tts.save("page.mp3")

# play the audio file
playsound("page.mp3")
 
# Window shown waits for any key pressing event
cv2.destroyAllWindows()

Sensors

Offline for the win!

Discussions

Become a Hackaday.io Member