Within this project navigating through a virtual reality environment with the help of voice control shall be realized. It shall be possible to express a desired target by voice and then receive a visual highlighting of the target within the VR environment.
Since in practice there is not always an internet connection for setting up a voice control available - or moreover: is undesireable for data protection reasons - I will implement an embedded voice control that works without any internet connection.
This results in three super-ordinate work steps:
- loading the VR environment on the target platform
- creating the speech dialogue for the voice control
- and linking the VR environment with the voice control to implement the navigation functionality
Voice control systems are more present than ever. Nearly every smartphone owner, smart home user and even car drivers already have a digital voice assistant in its immediate vicinity. The typical voice control systems already work very well and reliable - especially if you compare them with the first offsprings in cars about ten years ago. These voice control systems typically have one thing in common: they require an internet connection. This is for several reasons, amongst other things the outsourcing of the processing power that is needed for natural language understanding which goes beyond a fixed vocabulary that the user must know. But this internet connection also comes with some risks and inevitably raises the question of data security. What happens with the recordings of my voice? Does the voice control always listen? Who has access to this data and is it really just used for voice control purposes?
I have already been confirmed by my friends the phenomenom of coincidentally receiving targeted and personalized advertisment on social media right after talking about a specific product with my mobile nearby - a quick google search confirms this suspicion.
For those reasons within this project I decided to use a voice control solution, which delivers equally high quality recognition results while respecting the privacy of the user. Such a solution is offered by the german company voice INTER connect GmbHwith its product vicCONTROL industrial. The recognition results are processed offline and do not leave the accompanying hardware. Additionally vicCONTROL industrial supports 30 native languages. In this project I use an ARM-based development kit which is offered by voice INTER connect in cooperation with the also german company Phytec. In cooperation with the company Spectra x86-compatible platforms are also covered.
VR environmentPreparationsFor the vr environment I am using an already exisiting 3D model of a train station. Inside this model various food shops can be found from which the respective navigation destination is to be visually highlighted later.
The two most important libraries I am using for this are OpenGL and assimp.OpenGL is used for creating and displaying vector graphics in C++. With assimp 3D models can be loaded and modified in a way that they can be displayed via OpenGL later.
OpenGL: creating and loading graphicsFor OpenGL being able to display vector graphics the graphics pipeline has to be run through. The following figure shows the steps of this pipeline that are relevant for this project.
- At first corner points (vertices) are passed to OpenGL. In general corner points can be considered as the smallest components of any 3D object. The connection between two corner points is called "edge" while the connection between more than two corner points is called "face". In OpenGL various attributes can be assigned to these corner points. In this project I assign position coordinates, which represent the position inside the three dimensional space, and texture coordinates that are necessary for the use of textures.
- In the vertex shader the position of each corner point is calculated with the help of the position coordinates.
- As a next step graphic primitives are created. These are simple, geometric connections of individual corner points. For this project only triangles are used as graphic primitives. This means that each 3D object in the VR environment actually consists of two dimensional triangles.
- Within the process of rasterization the individual graphic primitives are assigned to the pixels of the image memory. The pixels created by this process are labeled as "fragments" within OpenGL.
- Inside the fragment shader the colors of each individual fragment are calculated. Each fragment can be assigend manually either color values or textures. For the calculation of the colors I use the already existing textures of the present 3D model in this project.
- During the testing and blending phase the deep memory procedure is performed. This is used to determine if 3D objects overlap each other. Overlapping objects are not visible to the user and do not need to be displayed.
After passing through the entire graphics pipeline the 3D model can be displayed on a screen. This process is always passed through before displaying such a model. After a single run a static model is available. If the model needs to be dynamized this process must be run through so many times per second that it appears fluid to the human eye (typically about 25 times per second).
To understand the way assimp loads a 3D model one first needs to understand how loaded files inside assimp actually are built. In order to better explain the essential points of structure I will refer to the following figure.
The entire 3D model is loaded by assimp into a scene object which is furthermore divided into nodes. Each node refers to certain mesh objects (collections of corner points, edges and layers; viz. 3D objects) inside the scene object. Assimp offers an importing function for the loading of models that takes care of the loading process. This can be assigned certain parameters that specialize the loading process (e.g. complex 3D objects are divided into less complex ones). The following script shows the importer function of this project.
//! loads a 3d model with assimp; stores loaded mesh
/*!
* \param path path of the model
*/
void ModelLoader::loadModel(string const &path)
{
// imports/loads the model as a scene
Assimp::Importer importer;
// post processing options, so the loading process is faster
const aiScene* scene = importer.ReadFile(path, aiProcess_CalcTangentSpace |\
aiProcess_GenSmoothNormals |\
aiProcess_JoinIdenticalVertices |\
aiProcess_ImproveCacheLocality |\
aiProcess_RemoveRedundantMaterials |\
aiProcess_SplitLargeMeshes |\
aiProcess_Triangulate |\
aiProcess_GenUVCoords |\
aiProcess_SortByPType |\
aiProcess_FindDegenerates |\
aiProcess_FindInvalidData |\
aiProcess_OptimizeMeshes |\
0);
// checks if scene was loaded correctly; if not, error message is displayed
if(!scene || scene->mFlags & AI_SCENE_FLAGS_INCOMPLETE || !scene->mRootNode) {
cout << " loading assimp model failed " << importer.GetErrorString() << endl;
return;
}
// sets directory; without this function the textures can not be loaded directory = path.substr(0, path.find_last_of('/'));
// process the nodes of the scene
processNode(scene->mRootNode, scene);
}
In case this code section raises some question marks I can recommend the assimpdocumentation of the importer function.
For processing the just loaded 3D model there are two important steps to run for each node.
- The information of each individual 3D object need to be extracted from the assimp mesh objects and converted for OpenGL. This means that the corner points with their related position and texture coordinates must be extracted from the mesh objects.
- All textures inside the loaded model must be loaded with OpenGL and attached to the correct corner points.
This process is also described in this tutorial after which I proceeded.
Camera and movement controlYou can imagine moving inside the VR environment from the perspective of an actual human being. With the help of the movement control one can move around, the camera control allows one to look around. The camera is moved with the cursor while the rotation is calculated with the help of Euler's angles Pitch and Yaw.
At this point I recommend to restrict the Pitch angle of the camera to the range of -89° to 89°. This prevents two unfavourable situations: the rollover of the camera and the gimbal lock.
For the movement control the arrow keys of the keyboard are used. The particular movement directions are calculated with a pre-defined speed.
Voice controlvicCONTROL industrialFor a better understanding of the following sections I will describe the general functionality of vicCONTROL industrial at first.
When creating and deploying a speech application to a target platform with vicCONTROL industrial two main layers are considered: the online and the offline layer. At first a speech application is created with the help of the web-based software tool vicSDC which is then downloaded as a.zip archive (online layer; displayed in the following figure as a dashed cloud). This file is then stored on the target platform, in my example a miraBoard manufactured by the company Phytec, inside the directory /home/root/speech/dialog_application/
. From this point on an internet connection is not required anymore (offline layer; displayed in the following figure as a dashed rectangle). For being able to put the speech application into operation it is required to generate a license file inside vicSDC and to transfer it to the miraBoard. I recommend to study the vicCONTROL industrialfor this process.
As already mentioned the speech dialogue is created with the web tool vicSDC (a registration code is provided at purchasing the speech kit). When creating the speech dialogue the terms
- Intent
- Slot
- Value
are used. Their meaning is important for a further understanding. Generally speaking these terms can be understood hierarchically whereas an Intent is the top layer and can consist of several Slots. A value in turn is a subdivision of a Slot.
Contrary to this hierarchy I recommend to at first set up the necessary Slotsand their Values. Slots are elements the user wants to control with his voice. For this project my Slots are "Essen" (Eating) and "Trinken" (Drinking). I use them to distinguish between the different food stores.
Subsequently several Values are assigned to the Slots. These can be seen as the arguments of each Slot that can be furthermore assigned synonyms to for allowing different formulations. For example I assigned inside my speech dialogue the Value "Softdrinks" to the Slot "Trinken" (Drinking). Since every individual user has different preferences I assigned for example different synonyms such aus "Cola", "Limo" etc. to the Value "Softdrinks". A little advice: you may want to ask other persons which synonyms they would use to increase the flexibility of your voice control.
Once the Slots and Values have been entered, the Intents, i.e. the specific control intentions, are created. They can be also described as aims to be achieved by using the voice control. Inside this project I want to implement the navigation inside a virtual train station. Thus I need two Intents:
- for starting the navigation:
navigation_bahnhof
("navigation train station") - for quiting the navigation:
navigation_beenden
("stop navigation")
Now comes the part, which is critical for the quality of the voice control, where example sentences a typical user would say to execute a specific control intention are entered inside vicSDC. One should always enter as many example sentences as possible to maximize the robustness of the voice control. An example sentence for the Intentnavigation_bahnhof
("navigation train station") may be "Show me where I can get water ice". After entering the example sentences one highlights the word/s that represent the element to control (Slot). It is also possible that one example sentence contains multiple Slots which is not the case in my project though.
For activating the voice control to enable the receiving of control commands vicCONTROL industrial offers three different methods:
- Push To Talk: the voice control is activated by a single keystroke. It is deactivated after a pre-defined time (timeout)
- Push While Talking: the voice control is activated by keystroke. It remains active as long as the key is pressed,
- Wake Up Phrase: the voice control is activated by a pre-defined phrase and gets deactivated after a pre-defined time
In the curse of this project I have chosen a wake up phrase ("Start navigation"). This means that the voice control only reacts to commands if this sentence has been said beforehand. In my case the voice control disables itself in two scenarios:
- the command was recognized, evaluated and a corresponding control action was executed
- if during a pre-defined time interval (in this project I set it to 10 seconds) no command was recognized
The communication between the VR environment and the voice control is realized via the network protocoll MQTT over which vicCONTROL industrialpublishes its recognition results as JSON strings. In this scenario both the VR environment and the voice control act as MQTT clients. The broker - comparable with a network router - is running on the hardware platform of vicCONTROL industrial.
By default vicCONTROL industrial logs on to the local network with the IP address 192.168.3.11 and publishes its recognition results on the MQTT topic "speech/ouput". This topic in turn is subscribed by the VR environment for further processing of the recognition data.
Published JSON stringAs an illustration I show here a JSON string output which was produced by the command "Ich möchte Waffeln und Wasser bitte" ("I would like waffles and water please") by vicCONTROL industrial.
{
"header"
{
"source" : "navigation_bahnhof_German_Germany",
"type" : "intent",
"version" : "1.1.0"
} ,
"content" :
{
"orthography" : "ich moechte Waffeln und Wasser bitte",
"intent" : "navigation_bahnhof",
"slots" :
[
{
"name" : "essen",
"value" : "Backwaren",
"orthography" : "Waffeln",
"ids" :
[
"0 x0000000000000001"
]
} ,
{
"name" : "trinken",
"value" : "Softdrinks",
"orthography" : "Wasser",
"ids" :
[
"0 x0000000000000001"
]
}
]
}
}
With this example one can see that within one single voice command two different *Slots* ("Essen", "Trinken") were recognized based on the synonyms for the actual *Values* ("Backwaren", "Softdrinks").
Exemplary evaluation of JSON stringThe advantage of JSON strings is that every element of the string can be accessed individually. Therefore only the part that is actually relevant can be evaluated. For the evaluation of the JSON strings I use the library RapidJSON. The following script shows an exemplary for the evaluation of an incoming JSON string provided by vicCONTROL industrial.
rapidjson::Document response;
// Check if string was parsed correctly
if (response.Parse<0>(dialogmessage.c_str()).HasParseError()) {
std::cout << "Error parsing json−string" << std : endl;
}
// Check if json-string is valid speech-recognition-ouput by checking if an intent is present
if (response["content"].HasMember("intent")) {
// Get the intent
const rapidjson::Value& dialogmessage = response["content"];
if (dialogmessage["intent"].IsString() && !dialogmessage["intent"].IsNull()) {
std::string name = dialogmessage["intent"].GetString();
if (name.compare("navigation_beenden") == 0) {
// Code to be executed when user finished the navigation
// E.g. turning off all status lights
}
if (name.compare("navigation_bahnhof") == 0) {
// Access the slots
const rapidjson::Value& dialogmessage = response["content"]["slots"];
// Iterating through all through all values inside json-string
for (rapidjson::SizeType i = 0; i < dialogmessage.Size(); i++) {
// Access the values
if (dialogmessage[i]["value"].IsString() && !dialogmessage[i]["value"].IsNull()) {
// Code to be executed if value was found inside recognition-result
// E.g. turning script for navigation
}
}
}
}
}
The first step is to check whether the incoming JSON string is a complete voice recognition result of vicCONTROL industrial.
Then the system checks whether the navigation was quit (Intent"navigation_beenden"). However, if the Intent"navigation_bahnhof" was parsed the containing Slots are adressed. The corresponding Values can then be used, for example, to execute the corresponding code for the optical highlighting of the navigation destination inside the VR environment.
Controlling the VR environmentIn advance I created a cube-shaped object above every food store inside the VR environment with OpenGL, which serves as visual feedback. These cubes are controlled via the variables cubeone_draw
, cubetwo_draw
, cubethree_draw
and cubefour_draw
.
A cube is only activated when in the process of evaluating the JSON string for the Intent "navigation_bahnhof" certain Values have been found. The cube is deactivated again when the Intent "navigation_beenden" was received.
Thus the script of the previous section can be expanded like that.
// for "navigation_beenden" set the bool variables for the cubes to false
std::string name = dialogmessage["intent"].GetString();
if (name.compare("navigation_beenden") == 0) {
cubeone_draw = false;
cubetwo_draw = false;
cubethree_draw = false;
cubefour_draw = false;
}
if (name.compare("navigation_bahnhof") == 0) {
// access the slots values of the json-string
const rapidjson::Value& dialogmessage = response["content"]["slots"];
// iterates through as many slot arrays as the string has
for (rapidjson::SizeType i = 0; i < dialogmessage.Size(); i++) {
// compares the value-string with another string; sets bool variables to true, if specific word is detected and therefore starts the cube drawing
if (dialogmessage[i]["value"].IsString() && !dialogmessage[i]["value"].IsNull()) {
std::string name = dialogmessage[i]["value"].GetString();
if (name.compare("Backwaren") == 0) {
cubeone_draw = true;
}
if (name.compare("Süßigkeiten") == 0) {
cubetwo_draw = true;
}
if (name.compare("Asiatisches Essen") == 0) {
cubethree_draw = true;
}
if (name.compare("Softdrinks") == 0) {
cubefour_draw = true;
}
}
}
}
ConclusionThe aim of this project to realize an internet independent voice control inside a VR environment was achieved. For the user it is possible to use several synonyms for the same controlling intention which makes the control by voice more robust and flexible.
This article offers an approach to implement the received voice control recognition results of the speech recognizer into the graphic interface.
Comments
Please log in or sign up to comment.