Published February 27, 2025 © GPL3+

Voice Assistant with ChatGPT on DFRobot ESP32 S3 AI Camera

Will guide you to build a DIY Voice control GPT system on DFRobot ESP32 S3 Camera.

IntermediateFull instructions provided1 hour1,876

Voice Assistant with ChatGPT on DFRobot ESP32 S3 AI Camera

Things used in this project

Hardware components

DFRobot ESP32-S3 AI Camera Module (Edge Image Recognition, Night Vision, ChatGPT Voice Interaction)

Story

Ever wished to create a voice-activated ChatGPT using a DFRobot ESP32-S3 AI camera? This blog is here to guide you through the entire process. We'll walk you through the steps to collect user commands via the built-in microphone, save the audio as a WAV file on an SD card, convert the audio to text using Deepgram API, query OpenAI for responses, and print the answers in the serial terminal.

Components Required:

DFRobot ESP32-S3 AI Camera
MicroSD Card
USB Cable
Computer with Arduino IDE installed

ESP32-S3 AI CAM Overview:

The ESP32-S3 AI CAM is an advanced camera module built around the ESP32-S3 chip, designed for video image processing and voice interaction. It excels in AI projects such as video surveillance, edge image recognition, and voice dialogue. The module features a wide-angle infrared camera for all-weather monitoring, ensuring clear images even in low-light conditions.

With a built-in microphone and speaker, it supports voice recognition and dialogue, making it ideal for smart home and IoT applications. Additionally, it can connect to the internet via Wi-Fi, enabling advanced tasks like image classification and natural language dialogue through cloud AI platforms.

Unlock the Future of Manufacturing with Justway's 3D Printing Service!

Experience the pinnacle of innovation and efficiency with JUSTWAY's top-tier 3D Printing Service. Whether you're looking to create detailed prototypes, functional parts, or custom designs, we have you covered.

JUSTWAY's cutting-edge technologies, including SLA, SLS, DLP, MJF, FDM, and SLM, ensure precision and quality in every print. Choose from a wide range of materials such as resins, nylons, metals, and more to bring your vision to life.

Benefit from instant quotes, seamless online order tracking, and professional post-processing services. With rapid production and reliable delivery, JUSTWAY is your go-to partner for all your 3D printing needs.

But wait, there's more! JUSTWAY also offers an array of other top-notch metal 3D printing services to complement your manufacturing requirements:

CNC Machining Service: High-precision milling, turning, and electrical discharge machining (EDM) for intricate parts.
Sheet Metal Fabrication Service: Custom sheet metal parts tailored to your specifications.
Injection Molding Service: High-quality, mass-production parts made from various materials.
Surface Finishing Service: Enhance the appearance and durability of your parts with professional finishing options.

Elevate your manufacturing game with JUSTWAY—where innovation meets perfection!

How to Place an Order on JUSTWAY for Your 3D Model

Ordering your 3D model on JUSTWAY is simple. Start by preparing your 3D CAD file in an accepted format. Visit the JUSTWAY website, sign in or create an account, and upload your design in the "Get Instant Quote" section.

Select your manufacturing process, customize your order with the desired materials and finishes, and receive an instant quote.

Confirm the details,

make a payment, and track your order online.

Once your order is placed, JUSTWAY will ensure high-quality production with strict quality control checks. Expect your 3D model to be delivered within the specified lead time. Enjoy the convenience and efficiency of bringing your 3D model to life with JUSTWAY!

Project Flow:

My plan to create a voice assistant using OpenAI or DeepSeek support involves asking the ESP32 S3 AI Camera a question. The camera will capture our request through its built-in I2S PDM microphone and then save the recording to the SD card

Then we can use DeepGram to convert the audio to text and then we can use OpenRouter to get the answers.

Step 1: Setting Up the Hardware

Insert the MicroSD card into the ESP32-S3 AI camera.

Step 2: Setting Up the Software

Install Arduino IDE: Download and install the latest version of the Arduino IDE from the official website.

Install Required Libraries: Open the Arduino IDE and install the following libraries:

SD
HTTPClient
WiFiClientSecure
ArduinoJson

To install these libraries, go to Sketch > Include Library > Manage Libraries, search for each library, and click Install

Step 3: Collecting Audio Data

Initializing the Microphone: Use the following code to initialize the built-in microphone and record audio commands.

The below sketch can record the audio for 5 sec then it will save it as a.wav file on the SD card also, it plays the audio via a builtin speaker.

#include <Arduino.h>
#include <SPI.h>

#include "ESP_I2S.h"

#define SAMPLE_RATE     (16000)
#define DATA_PIN        (GPIO_NUM_39)
#define CLOCK_PIN       (GPIO_NUM_38)
#define REC_TIME 5  //Recording time 5 seconds

int sck = 12;
int miso = 13;
int mosi = 11;
int cs = 10;

void setup()
{
  uint8_t *wav_buffer;
  size_t wav_size;
  I2SClass i2s;
  I2SClass i2s1;
  Serial.begin(115200);
  pinMode(3, OUTPUT);
  pinMode(41, OUTPUT);
  i2s.setPinsPdmRx(CLOCK_PIN, DATA_PIN);
  if (!i2s.begin(I2S_MODE_PDM_RX, SAMPLE_RATE, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    Serial.println("Failed to initialize I2S PDM RX");
  }
  i2s1.setPins(45, 46, 42);
  if (!i2s1.begin(I2S_MODE_STD, SAMPLE_RATE, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    Serial.println("MAX98357 initialization failed!");
  }

  SPI.begin(sck, miso, mosi, cs);
  if (!SD.begin(cs)) {
    Serial.println("SD Card initialization failed!");
    return;
  }

  Serial.println("start REC");
  digitalWrite(3, HIGH);
  wav_buffer = i2s.recordWAV(REC_TIME, &wav_size);
  File file = SD.open(AUDIO_FILE, FILE_WRITE);
  if (file) {
    file.write(wav_buffer, wav_size);
    file.close();
    Serial.println("Audio Saved successfully");
  }
  digitalWrite(3, LOW);
  //Play the recording
  i2s1.playWAV(wav_buffer, wav_size);
}

void loop()
{

}

Step 4: Converting Audio to Text Using Deepgram API

Sign Up for Deepgram API: Go to the Deepgram website,

Then, sign up for an API key.

Send Audio File to Deepgram: Use the following code to send the WAV file to Deepgram API and convert the audio to text.

Note: Add your DeepGram API Key in the code

String SpeechToText_Deepgram( String audio_filename )
{
  uint32_t t_start = millis();

  // ---------- Connect to Deepgram Server (only if needed, e.g. on INIT and after lost connection)

  if ( !client.connected() )
  { DebugPrintln("> Initialize Deepgram Server connection ... ");
    client.setInsecure();
    /* no effect: client.setConnectionTimeout(4000); */
    if (!client.connect("api.deepgram.com", 443))
    { Serial.println("\nERROR - WifiClientSecure connection to Deepgram Server failed!");
      client.stop(); /* might not have any effect, similar with client.clear() */
      return ("");  // in rare cases: WiFiClientSecure freezes (library issue?)
    }
    DebugPrintln("Done. Connected to Deepgram Server.");
  }
  uint32_t t_connected = millis();


  File audioFile = SD.open( audio_filename );
  if (!audioFile) {
    Serial.println("ERROR - Failed to open file for reading");
    return ("");
  }
  size_t audio_size = audioFile.size();
  audioFile.close();
  DebugPrintln("> Audio File [" + audio_filename + "] found, size: " + (String) audio_size );

  String socketcontent = "";
  while (client.available())
  { char c = client.read(); socketcontent += String(c);
  } int RX_flush_len = socketcontent.length();

  // ---------- Sending HTTPS request header to Deepgram Server

  String optional_param;                          // see: https://developers.deepgram.com/docs/stt-streaming-feature-overview
  optional_param =  "?model=nova-2-general";      // Deepgram recommended model (high readability, lowest word error rates)
  optional_param += (STT_LANGUAGE != "") ? ("&language=" + (String)STT_LANGUAGE) : ("&detect_language=true"); // see #defines
  optional_param += "&smart_format=true";         // applies formatting (Punctuation, Paragraphs, upper/lower etc ..)
  optional_param += "&numerals=true";             // converts numbers from written to numerical format (works with 'en' only)
  optional_param += STT_KEYWORDS;                 // optionally too: keyword boosting on multiple custom vocabulary words

  client.println("POST /v1/listen" + optional_param + " HTTP/1.1");
  client.println("Host: api.deepgram.com");
  client.println("Authorization: Token " + String(deepgramApiKey));
  client.println("Content-Type: audio/wav");
  client.println("Content-Length: " + String(audio_size));
  client.println();   // header complete, now sending binary body (wav bytes) ..

  DebugPrintln("> POST Request to Deepgram Server started, sending WAV data now ..." );
  uint32_t t_headersent = millis();

  File file = SD.open( audio_filename, FILE_READ );
  const size_t bufferSize = 1024;      // best values seem anywhere between 1024 and 2048;
  uint8_t buffer[bufferSize];
  size_t bytesRead;
  while (file.available())
  { bytesRead = file.read(buffer, sizeof(buffer));
    if (bytesRead > 0) {
      client.write(buffer, bytesRead); // sending WAV AUDIO data
    }
  }
  file.close();
  DebugPrintln("> All bytes sent, waiting Deepgram transcription");
  uint32_t t_wavbodysent = millis();


  // ---------- Waiting (!) to Deepgram Server response (stop waiting latest after TIMEOUT_DEEPGRAM [secs])

  String response = "";   // waiting until available() true and all data completely received
  while ( response == "" && millis() < (t_wavbodysent + TIMEOUT_DEEPGRAM * 1000) )
  { while (client.available())
    { char c = client.read();
      response += String(c);
    }
    // printing dots '.' each 100ms while waiting response
    DebugPrint(".");  delay(100);
  }
  DebugPrintln();
  if (millis() >= (t_wavbodysent + TIMEOUT_DEEPGRAM * 1000))
  { Serial.print("\n*** TIMEOUT ERROR - forced TIMEOUT after " + (String) TIMEOUT_DEEPGRAM + " seconds");
    Serial.println(" (is your Deepgram API Key valid ?) ***\n");
  }
  uint32_t t_response = millis();

  // ---------- closing connection to Deepgram
  client.stop();     // observation: unfortunately needed, otherwise the 'audio_play.openai_speech() in AUDIO.H not working !
  int    response_len  = response.length();
  String transcription = json_object( response, "\"transcript\":" );
  String language      = json_object( response, "\"detected_language\":" );
  String wavduration   = json_object( response, "\"duration\":" );

  return transcription;
}

String json_object( String input, String element )
{ String content = "";
  int pos_start = input.indexOf(element);
  if (pos_start > 0)                                      // if element found:
  { pos_start += element.length();                       // pos_start points now to begin of element content
    int pos_end = input.indexOf( ",\"", pos_start);      // pos_end points to ," (start of next element)
    if (pos_end > pos_start)                             // memo: "garden".substring(from3,to4) is 1 char "d" ..
    { content = input.substring(pos_start, pos_end);     // .. thats why we use for 'to' the pos_end (on ," ):
    } content.trim();                                    // remove optional spaces between the json objects
    if (content.startsWith("\""))                        // String objects typically start & end with quotation marks "
    { content = content.substring(1, content.length() - 1); // remove both existing quotation marks (if exist)
    }
  }
  return (content);
}

In this code part, you just need to input your Audio file. It will convert the WAV file into text using Deepgram.

Step 5: Querying OpenAI for Responses

Sign Up for OpenAI API: Go to the OpenRouter website and sign up for an API key.

Send Text to OpenAI: Use the following code to send the text to OpenAI API and print the response.

Here if you wish to use DeepSeek change the model inside the code [deepseek/deepseek-r1-distill-llama-70b,openai/gpt-4o-mini-2024-07-18]

void deepseek(String userQuestion){
  if (WiFi.status() == WL_CONNECTED) {
      HTTPClient http;
      http.begin("https://openrouter.ai/api/v1/chat/completions");
      http.addHeader("Content-Type", "application/json");
      http.addHeader("Authorization", String("Bearer ") + apiKey);

      StaticJsonDocument<512> jsonDoc;
      jsonDoc["model"] = "openai/gpt-4o-mini-2024-07-18";  // For Gpt4o Language Model
//      jsonDoc["model"] = "deepseek/deepseek-r1-distill-llama-70b"; // For DeepSeek R1 Language Model
      JsonArray messages = jsonDoc.createNestedArray("messages");

      JsonObject systemMessage = messages.createNestedObject();
      systemMessage["role"] = "system";
      systemMessage["content"] = "Answer";

      JsonObject userMessage = messages.createNestedObject();
      userMessage["role"] = "user";
      userMessage["content"] = userQuestion;

      String requestBody;
      serializeJson(jsonDoc, requestBody);

      int httpResponseCode = http.POST(requestBody);
      String response = http.getString();

      StaticJsonDocument<1024> responseDoc;
      DeserializationError error = deserializeJson(responseDoc, response);

      if (!error) {
        String assistantResponse = responseDoc["choices"][0]["message"]["content"].as<String>();
        Serial.println("DeepSeek: ");
        Serial.println(assistantResponse);
      } else {
        Serial.println("JSON Solve error!");
      }
      http.end();
    }
}

Step 6: Integrating Everything

Final Code: Combine the previous steps into a single codebase.

#include <WiFiClientSecure.h>   // only here needed
#include <WiFi.h>
#include <HTTPClient.h>
#include <SD.h>
#include <SPI.h>
#include <ArduinoJson.h>

#include "ESP_I2S.h"

// I2S configuration
#define SAMPLE_RATE     (16000)
#define DATA_PIN        (GPIO_NUM_39)
#define CLOCK_PIN       (GPIO_NUM_38)
#define REC_TIME 5  //Recording time 5 seconds

int sck = 12;
int miso = 13;
int mosi = 11;
int cs = 10;

#ifndef DEBUG                   // user can define favorite behaviour ('true' displays addition info)
#  define DEBUG true            // <- define your preference here  
#  define DebugPrint(x);        if(DEBUG){Serial.print(x);}   /* do not touch */
#  define DebugPrintln(x);      if(DEBUG){Serial.println(x);} /* do not touch */
#endif

// WiFi credentials
const char* ssid = "";
const char* password = "";
const char* apiKey = "";
const char* deepgramApiKey =    "";   // ## INSERT your Deepgram credentials !

#define STT_LANGUAGE      "en-IN"  // forcing single language: e.g. "de" (German), reason: improving recognition quality
#define TIMEOUT_DEEPGRAM   12   // define your preferred max. waiting time [sec] for Deepgram transcription response     
#define STT_KEYWORDS            "&keywords=KALO&keywords=Janthip&keywords=Google"  // optional, forcing STT to listen exactly 


// --- global vars -------------
WiFiClientSecure client;
#define AUDIO_FILE "/audio.wav"

I2SClass i2s;
I2SClass i2s1;

void deepseek(String userQuestion){
  if (WiFi.status() == WL_CONNECTED) {
      HTTPClient http;
      http.begin("https://openrouter.ai/api/v1/chat/completions");
      http.addHeader("Content-Type", "application/json");
      http.addHeader("Authorization", String("Bearer ") + apiKey);

      StaticJsonDocument<512> jsonDoc;
      jsonDoc["model"] = "openai/gpt-4o-mini-2024-07-18";  // For Gpt4o Language Model
//      jsonDoc["model"] = "deepseek/deepseek-r1-distill-llama-70b"; // For DeepSeek R1 Language Model
      JsonArray messages = jsonDoc.createNestedArray("messages");

      JsonObject systemMessage = messages.createNestedObject();
      systemMessage["role"] = "system";
      systemMessage["content"] = "Answer";

      JsonObject userMessage = messages.createNestedObject();
      userMessage["role"] = "user";
      userMessage["content"] = userQuestion;

      String requestBody;
      serializeJson(jsonDoc, requestBody);

      int httpResponseCode = http.POST(requestBody);
      String response = http.getString();

      StaticJsonDocument<1024> responseDoc;
      DeserializationError error = deserializeJson(responseDoc, response);

      if (!error) {
        String assistantResponse = responseDoc["choices"][0]["message"]["content"].as<String>();
        Serial.println("DeepSeek: ");
        Serial.println(assistantResponse);
      } else {
        Serial.println("JSON Solve error!");
      }
      http.end();
    }
}

void setup() {

  Serial.begin(115200);

  // Connect to Wi-Fi
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) {
    delay(1000);
    Serial.print(".");
  }
  Serial.println("Connected to WiFi");

  SPI.begin(sck, miso, mosi, cs);
  if (!SD.begin(cs)) {
    Serial.println("SD Card initialization failed!");
    return;
  }

  pinMode(3, OUTPUT);
  pinMode(41, OUTPUT);
  
  i2s.setPinsPdmRx(CLOCK_PIN, DATA_PIN);
  
  if (!i2s.begin(I2S_MODE_PDM_RX, SAMPLE_RATE, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    Serial.println("Failed to initialize I2S PDM RX");
  }
  i2s1.setPins(45, 46, 42);
  if (!i2s1.begin(I2S_MODE_STD, SAMPLE_RATE, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    Serial.println("MAX98357 initialization failed!");
  }

  uint8_t *wav_buffer;
  size_t wav_size;

  Serial.println("start REC");
  digitalWrite(3, HIGH);
  wav_buffer = i2s.recordWAV(REC_TIME, &wav_size);
  digitalWrite(3, LOW);
  Serial.println("Recording finished.");

  // Save the recording to SD card
  File file = SD.open(AUDIO_FILE, FILE_WRITE);
  if (file) {
    file.write(wav_buffer, wav_size);
    file.close();
    Serial.println("Audio Saved successfully");
  }
  String transcription = SpeechToText_Deepgram(AUDIO_FILE);
  Serial.println(transcription);
  deepseek(transcription);
}

String json_object( String input, String element )
{ String content = "";
  int pos_start = input.indexOf(element);
  if (pos_start > 0)                                      // if element found:
  { pos_start += element.length();                       // pos_start points now to begin of element content
    int pos_end = input.indexOf( ",\"", pos_start);      // pos_end points to ," (start of next element)
    if (pos_end > pos_start)                             // memo: "garden".substring(from3,to4) is 1 char "d" ..
    { content = input.substring(pos_start, pos_end);     // .. thats why we use for 'to' the pos_end (on ," ):
    } content.trim();                                    // remove optional spaces between the json objects
    if (content.startsWith("\""))                        // String objects typically start & end with quotation marks "
    { content = content.substring(1, content.length() - 1); // remove both existing quotation marks (if exist)
    }
  }
  return (content);
}

String SpeechToText_Deepgram( String audio_filename )
{
  uint32_t t_start = millis();

  // ---------- Connect to Deepgram Server (only if needed, e.g. on INIT and after lost connection)

  if ( !client.connected() )
  { DebugPrintln("> Initialize Deepgram Server connection ... ");
    client.setInsecure();
    /* no effect: client.setConnectionTimeout(4000); */
    if (!client.connect("api.deepgram.com", 443))
    { Serial.println("\nERROR - WifiClientSecure connection to Deepgram Server failed!");
      client.stop(); /* might not have any effect, similar with client.clear() */
      return ("");  // in rare cases: WiFiClientSecure freezes (library issue?)
    }
    DebugPrintln("Done. Connected to Deepgram Server.");
  }
  uint32_t t_connected = millis();


  File audioFile = SD.open( audio_filename );
  if (!audioFile) {
    Serial.println("ERROR - Failed to open file for reading");
    return ("");
  }
  size_t audio_size = audioFile.size();
  audioFile.close();
  DebugPrintln("> Audio File [" + audio_filename + "] found, size: " + (String) audio_size );

  String socketcontent = "";
  while (client.available())
  { char c = client.read(); socketcontent += String(c);
  } int RX_flush_len = socketcontent.length();

  // ---------- Sending HTTPS request header to Deepgram Server

  String optional_param;                          // see: https://developers.deepgram.com/docs/stt-streaming-feature-overview
  optional_param =  "?model=nova-2-general";      // Deepgram recommended model (high readability, lowest word error rates)
  optional_param += (STT_LANGUAGE != "") ? ("&language=" + (String)STT_LANGUAGE) : ("&detect_language=true"); // see #defines
  optional_param += "&smart_format=true";         // applies formatting (Punctuation, Paragraphs, upper/lower etc ..)
  optional_param += "&numerals=true";             // converts numbers from written to numerical format (works with 'en' only)
  optional_param += STT_KEYWORDS;                 // optionally too: keyword boosting on multiple custom vocabulary words

  client.println("POST /v1/listen" + optional_param + " HTTP/1.1");
  client.println("Host: api.deepgram.com");
  client.println("Authorization: Token " + String(deepgramApiKey));
  client.println("Content-Type: audio/wav");
  client.println("Content-Length: " + String(audio_size));
  client.println();   // header complete, now sending binary body (wav bytes) ..

  DebugPrintln("> POST Request to Deepgram Server started, sending WAV data now ..." );
  uint32_t t_headersent = millis();

  File file = SD.open( audio_filename, FILE_READ );
  const size_t bufferSize = 1024;      // best values seem anywhere between 1024 and 2048;
  uint8_t buffer[bufferSize];
  size_t bytesRead;
  while (file.available())
  { bytesRead = file.read(buffer, sizeof(buffer));
    if (bytesRead > 0) {
      client.write(buffer, bytesRead); // sending WAV AUDIO data
    }
  }
  file.close();
  DebugPrintln("> All bytes sent, waiting Deepgram transcription");
  uint32_t t_wavbodysent = millis();


  // ---------- Waiting (!) to Deepgram Server response (stop waiting latest after TIMEOUT_DEEPGRAM [secs])

  String response = "";   // waiting until available() true and all data completely received
  while ( response == "" && millis() < (t_wavbodysent + TIMEOUT_DEEPGRAM * 1000) )
  { while (client.available())
    { char c = client.read();
      response += String(c);
    }
    // printing dots '.' each 100ms while waiting response
    DebugPrint(".");  delay(100);
  }
  DebugPrintln();
  if (millis() >= (t_wavbodysent + TIMEOUT_DEEPGRAM * 1000))
  { Serial.print("\n*** TIMEOUT ERROR - forced TIMEOUT after " + (String) TIMEOUT_DEEPGRAM + " seconds");
    Serial.println(" (is your Deepgram API Key valid ?) ***\n");
  }
  uint32_t t_response = millis();

  // ---------- closing connection to Deepgram
  client.stop();     // observation: unfortunately needed, otherwise the 'audio_play.openai_speech() in AUDIO.H not working !
  int    response_len  = response.length();
  String transcription = json_object( response, "\"transcript\":" );
  String language      = json_object( response, "\"detected_language\":" );
  String wavduration   = json_object( response, "\"duration\":" );

  return transcription;
}

void loop() {

}

Final Output:

Once you have set up everything, just open the serial terminal and look for the response.

First, it will connect to the network then it will record 5-second audio. Then it will use deepGram for audio to text and then it will query to the OpenAI or DeepSeek.

Demo:

1 / 4

As of my testing, It's giving pretty fast results. My next plan is to get the response into an audio output using the ESP32 S3's speaker.

Use Case

Accessibility Solutions: Develop advanced AI assistants for hands-free interaction for individuals with physical disabilities.
Smart Home Automation: Control home appliances with voice commands for improved convenience

Conclusion

Congratulations! You've successfully built a voice command ChatGPT using the OpenAI, DFRobot ESP32-S3 AI Camera, and Deepgram API. This guide provides a comprehensive walkthrough to help you create and customize your voice-activated AI assistant. Happy coding!

Feel free to experiment and add more features, and don't hesitate to share your projects and experiences. If you have any questions or need further assistance, feel free to ask!