In Part I, Smart Eye for Your Pi, I put a PiCamera under voice control using Mycroft(PiCroft). The resulting images were then sent to the clarifai image recognition service. PiCroft then spoke the tags/concepts that were returned for each image.
To be quite honest I almost didn't post the project, because I thought this was a bit too simplistic. I had visions of HAL9000 and such in my head and didn't get close. However, it was a start and the response was terrific. This motivated me to look into ways I could turn that static list of concepts into something more interesting. I began by doing some searching and reading on natural language processing, deep learning and chatbots and finally landed on the concept of Natural Language Generation(NLG). This is how part II of the Smart Eye came to life.
The Smart Eye now handles 2 new intents:
- "Hey Mycroft, do you see anything?" The PiCamera will capture an image and send it to clarifai. After receiving the concepts, it then separates out the nouns and adjectives, randomly samples words from these lists and creates a simple sentence to speak.
- "Hey Mycroft, can you see a ___?" The PiCamera will capture an image and send it to clarifai. After receiving the concepts, it then separates out the nouns and adjectives. Mycroft will then let you know if it "recognizes" the object by comparing the list of returned concepts to the object you asked about.
Before we get into the core of the project, a few important technical details:
Corrected AcknowledgementsDuring development of my original project, I was not aware that there was a public skill on the skill github repository connecting Mycroft & clarifai, clarifai-image-recognition-skill. I apologize for this oversight in my original write-up.
The essential difference here, is that I am sending the images from the PiCamera to clarifai with both under voice control. So I guess my original skill could be considered a "mashup" of the PiCamera skill and the Clarifai skill.
Using settings.jsonAfter posting the skill, I received some feedback from Mn0491 at Mycroft:
Awesome walkthrough! I was looking through your code, and a good way to store API keys is to leverage the skill settings feature of mycroft. Soon, users should be able to log into the home.mycroft.ai to input their api keys without having to do it through code. Also you can check out using the skill settings feature to store skill specific settings instead instead of using mycroft.conf (saw a TODO comment about it in your code). Good example can be found in the pandora skill on how to leverage the skill settings.
https://github.com/ethanaward/pianobar-skill3ItWha
This seemed like a great idea and I went forward with implementing it for this project. I created the settings.json using the nano editor:
The settings.json file lives in the root directory and is accessed from the __init__.py file:
{
"api_key": "your api_key for clarifai here",
"img_location": "/home/mycroft/smarteye.jpg"
}
I used this for the clarifai api-key and for telling Mycroft where to write the smart-eye image that will be sent to clarifai. Accessing the settings.json file requires access to the underlying os file system and directory structure, so it should be called in the initialize function, as below, otherwise the skill will not be able to access it.
def initialize(self):
self.load_data_files(dirname(__file__))
self.clarifai_app = ClarifaiApp(api_key=self.settings["api_key"])
LOGGER.info("api_key:" + self.settings["api_key"])
...
Natural Language Generation(NLG)NLG systems take variable, sometimes unpredictable inputs, and translate these into natural language outputs, like sentences in a human comprehensible form.
NLG systems have 3 main components:
- content selection - what to say
- surface realization - how to say it
- production - presentation/performance of the output
Sounds pretty simple, right? Well . . . actually I think there are few computing tasks more complicated than NLG. While this project is a very basic NLG system, perhaps only slightly more complicated than a vocal form letter, it does contain these 3 elements and often provides surprisingly good outputs.
The domain of an NLG system is quite important. In our case the domain happens to be anything placed in front of the PiCamera. This emphasizes a key point when working with NLG, and perhaps voice systems in general, the inputs are highly variable and unpredictable. I could put anything from a doll to a tomato in front of the PiCamera and expect an intelligent response.
The goal of this project is to have PiCroft make a simple English language sentence out of the concepts returned or to have PiCroft confirm it is looking at a specific object by checking against the returned concepts.
So I have reduced the problem domain to PiCroft only having to say, "I see . . ." This is a single turn skill and response indicating a single identified object placed in front of it making the NLG task here a straight forward temple filling problem. This is the part of the content selection component of a NLG system.
Python and the NLG SystemIn order to create this type of sentence structure, we need to process the concepts returned to us by clarfai's general model. After reviewing the results returned on various objects from Part 1 of this project, I saw that the concepts were either nouns or adjectives. So in order to create a simple sentience for PiCroft in the form of, "I see a(n) adjective noun in front of me", we will need to sort out those parts of speech from the list of returned concepts and make a sentence from them. This is of the surface realization component.
The main activity of a MyCroft skill takes place in the skill's __init__.py and that .py file extension means that if you can do it in Python you can do it with MyCroft!
In my research on NLG, I found this an excellent blog site by Liza Daly, and this particular post got me thinking about Python based solutions for NLG: Chatbot Fundamentals An interactive guide to writing bots in Python. In this series of posts, she describes her adventures in NLP and writing a chatbot shes calls, BroBot.
The lines of code that got me interested:
# start:example-respond.py
def respond(sentence):
"""Parse the user's inbound sentence and find candidate terms that make up a best-fit response"""
cleaned = preprocess_text(sentence)
parsed = TextBlob(cleaned)
# Loop through all the sentences, if more than one. This will help extract the most relevant
# response text even across multiple sentences (for example if there was no obvious direct noun
# in one sentence
pronoun, noun, adjective, verb = find_candidate_parts_of_speech(parsed)
...
Hey, she parsed out the exact POS that I needed! After some more reading and research, I came upon the Pattern:
Pattern is a web mining module for Python. It has tools for:
- Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser
- Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet
- Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)
- Network Analysis: graph centrality and visualization.
It is well documented and bundled with 50+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern.
Within the family of Pattern modules, I found pattern.en which provides the necessary functionality for getting the not only parsing the POS but also determining the correct indefinite article to use for a given noun or adjective!
In order to use it, we'll use the following import statement
# NLP/NLG related libraries
from random import *
from pattern.en import parse, Sentence, article
Next, we'll create 2 new intents:
def initialize(self):
...
describe_intent = IntentBuilder("DescribeIntent").require("DescribeKeyword").build()
self.register_intent(describe_intent, self.handle_describe_intent)
recognize_intent = IntentBuilder("RecognizeIntent").require("RecognizeKeyword").require("ObjName").build()
self.register_intent(recognize_intent, self.handle_recognize_intent)
The describe_intent will deal with the case where we wish to have Mycroft tell us what he sees in front of him. The recognize_intent will deal with the cases where we ask Mycroft if a specific object is in front of him.
Let's step out of the __init__.py file for a minute and create the appropriate .voc or vocab files that Mycroft will use to the keywords parsed to determine which intent handler call.
These are DescribeKeyword.voc:
what do you see
what do you see here
tell me about what you see
tell me about what you are looking at
describe what you see
can you tell me what you are looking at
do you see anything
and RecognizeKeyword,voc:
can you see
The Recognize intent must be able to deal with any object you might think to ask Mycroft about. The use of regular expressions in the recognize.rx file helps us to do this:
(can you see|can you see a|can you see the|can you see this|can you see that)(?P<ObjName>.*)
The variable ObjName will be passed on the message bus and into the __init__.py file for us to use:
...
recognize_intent = IntentBuilder("RecognizeIntent").require("RecognizeKeyword").require("ObjName").build()
self.register_intent(recognize_intent, self.handle_recognize_intent)
...
def handle_recognize_intent(self,message):
self.speak_dialog("general.eye")
object_name = message.data.get("ObjName")
...
This will help handle the unpredictable nature of inputs by the user. I also handle the case of varying indefinite and definite articles with regex. I wasn't sure if this was the best way to handle these, but it seems to work for now.
Now that I have the name of the object we hope to recognize, the PiCamera is activated and an image is taken to be sent to clarifai. I've improved the clarity of the code somewhat by making this process a set of 2 method calls.
self.take_picture()
results = self.general_model_results()
Future refactoring of the code should make take these functions out of the __init__.py and into distinct importable modules.
Now comes some of the NLG fun. This should look familiar as I modeled it from Liza Daly's code above.
nouns, adjectives = self.nouns_and_adjectives(results)
This parsing of the parts of speech uses the patten.en library which returns a POS tag for every word parsed.
The word's part-of-speech tag is NN, which means that it is a noun. . .Common part-of-speech tags are NN (noun), VB (verb), JJ (adjective), RB (adverb) and IN (preposition).
def nouns_and_adjectives(self,results):
nouns = []
adjectives = []
results_tree = parse(results,chunks=False)
sentence = Sentence(results_tree)
for word in sentence:
if word.type == 'NN':
nouns.append(word.string)
elif word.type =='JJ':
adjectives.append(word.string)
return nouns, adjectives
Now that I have the nouns and adjectives separated, they can be used to make a simple sentence of the form, "I see article adjective noun" or "I don't see article adjective noun, but I do see article adjective noun."
if object_name in nouns:
yes_str = "Yes I see " + article(object_name) + " " + object_name
self.speak(yes_str)
else:
speak_str = "no I don't see " + article(object_name) + " " + object_name
noun_to_use = ''.join(sample(nouns,1))
speak_str = speak_str + " but I do see " + article(noun_to_use) + " " + noun_to_use
self.speak(speak_str)
Both the describe_intent and recognize_intent handle these cases somewhat differently. For instance, the describe_intent randomly samples the list of nouns without regard to the the confidence levels to create it's response. The recognize_intent, does this as well, but only when the object is not in the list.
ExamplesThis is really what everyone wants to see. I hope what you see here inpsires to look back over the write up above! With these following examples, I am demonstrating the production component, courtesy of PiCroft, of my NLG system.
"Hey Mycroft, do you see anything?"
"Hey Mycroft, can you see a ___?"
A Short Note on DebuggingAs I work more with the Mycroft system, I am getting more comfortable with debugging directly on the platform. I am still doing some preliminary testing and exploration of functionality on Python on a desktop or laptop, but I am doing a bit more directly on Mycroft/PiCroft. One thing that has helped is my use of the LOGGER object.
from mycroft.util.log import getLogger
...
LOGGER = getLogger(__name__)
...
LOGGER.info("api_key:" + self.settings["api_key"])
...
LOGGER.info('results: '+ results)
LOGGER.info('nouns: ' + ', '.join(n))
LOGGER.info('adjectives: ' + ', '.join(a))
The results of these statements will be found in /var/log/mycroft-skills.log. Use it, it will help you understand not only where your errors are in the __init__.py but it will also help you understand how all the files in the skill folder work together.
ConclusionsI'm under no illusions. My RPi3 running Picroft and using clarifai doesn't really understand what it is seeing or saying. The smart eye is not going to pass any Turing Test. It is easy to trip up and says some really goofy, illogical and nonsensical things. In fact, it even once called my cat a canine, nothing against dogs but seriously folks! Canine!
However it is an interesting tool to explore voice, vision and AI. A few weeks ago, I didn't know what what NLG was, now I have a basic, o k very basic, NLG system up and running. How do I improve this?
The first way, especially in terms of accuracy, will be to select the nouns and adjectives returned at the highest confidence levels. I will need to handle articles more reliably and also I have to deal with plurals.
There must be a cleaner way to implement surface realization than by long, never ending if elif else then statement chains, right?
Also to improve this, I need to be able to systematically collect and store images, associated concepts and generated verbal responses and start to expose the smart eye to more objects and perhaps larger, more interesting scenes. Is there way to connect returned concepts, memory of prior objects and maybe wiki searches to expand the intelligence and human affectation of the response???
Enjoy, let me know what you think and "Let me see" what you do with your own . . .
ReferencesDesigning Voice User Interfaces: Principles of Conversational Experiences 1st Edition by Cathy Pearl:
Liza Daly's Blog Site.
Comments
Please log in or sign up to comment.