How to set up an environment to build your own JARVIS (chatbot agent)

Sinchan Bhattacharya
4 min readMay 2, 2021

Watching Iron-Man movies, I always wished to have my very own Jarvis. I am sure all the Iron-Man fans out there would feel the same. Although, Jarvis became very famous after the Iron Man movie, movies around or associating Artificial Intelligence date way back. I remember watching a great Artificial Intelligence-based German movie — Metropolis, which was released in 1927.

J.A.R.V.I.S

In all these movies depicting A.I., one thing is common — it is able to understand what we, humans, are saying and able to hold up a conversation. Although it sounds simple activity that we are doing every moment, yet when broken down to the most granular level, we would see how intricately all the different components of the human body — the ears, brain, mouth, neurons, nervous system, calcium channels in neurons, hair cells in the cochlea, larynx…. come together to work as one unit and perform the whole action having a conversation.

To have A.I. bot do the same we need to provide it with Ears, Brain, and a mouth (not a loud one :P) at the very least. Now let's keep the hardware system aside (will talk about it in another story) and focus on the software side of the A.I. bot.

Here we will learn how to set up an end-to-end Python environment so that it can:

  1. Listen
  2. Understand
  3. Speak

Let’s start with the LISTENING part :

Listening is the part where the audio signal is converted to signals in the auditory and neural system for humans. But for an AI agent, listening is being able to capture audio signals and convert them to something that can be fed to the Understanding unit of the AI agent, and this something is text — READABLE TEXT. Hence, it is called a Speech-To-Text converter or STT.

Speech-to-Text

Now we are going to install the required libraries in Python to perform STT tasks.

Installing the SpeechRecognition library:

Open command prompt or conda prompt and write the following command.

pip install SpeechRecognition

Once the installation is done, check the installation using the following command:

import speech_recognition as sr
sr.__version__

Once the Speech recognizer library is installed let’s try out a speech recognizer function:

Here we are testing Google’s Speech recognizer function

filename = 'c:/audio.wav'  #The speech audio file to be converted
r = sr.Recognizer()
with sr.AudioFile(filename) as source:
audio_data = r.record(source)
text = r.recognize_google(audio_data)
print(text)

The Speech_recognizer library has multiple speech recognition engines like Google API, IBM’s API, CMU Sphinx, etc. The following papers benchmarks different speech recognition engines:

  1. https://link.springer.com/chapter/10.1007/978-3-030-49161-1_7
  2. https://arxiv.org/ftp/arxiv/papers/1904/1904.12403.pdf

For building a stand-alone bot, i.e. bot which can work without an internet connection, we need to use a speech-to-text model that can be executed locally. The Sphinx model developed at CMU can work for this purpose.

The CMU Sphinx model needs to be installed prior to using that model and this is how you can do it:

You can do a pip install:

pip install pocketsphinx

You may encounter several errors while installing pocketshpinx, like:

  1. Installing pocketsphinx python module: command ‘swig.exe’ failed
  2. Visual C++ missing
  3. Missing pocketSphinx module

The best path to get CMU Sphinx installed is as follow:

  1. Install Visual C++ : https://visualstudio.microsoft.com/downloads/
  2. Then open your conda command prompt and do the following
conda install swig
python -m pip install --upgrade pip setuptools wheel
pip install pocketsphinx

After the installation is successful you can test it via these commands

filename = 'c:/audio.wav'  #The speech audio file to be convertedwith sr.AudioFile(filename) as source:
audio_data = r.record(source)
text = r.recognize_sphinx(audio_data)
print(text)

The next step is installing the Speaking Module

Photo by James Yarema on Unsplash

The speaking models are known as text-to-speech models.

The are several speech-to-text engines available, here I will be showing the pyttsx and google’s speech to text (gtts).

To use pyttsx:

Doing a pip install of pyttsx might get you pyttsx: No module named ‘engine’ error. So the solution is :

pip install pyttsx3
pip install python-engineio

Then test pyttsx3 using the following code:

import pyttsx3
engine = pyttsx3.init()
text = "Hi I am Jarvis"
engine.say(text)
engine.runAndWait()

Now for installing Google Text to Speech follow the below steps:

pip install gTTS

And running gTTS

import gtts
from playsound import playsound
tts = gtts.gTTS("Hi I am Jarvis")
tts.save("D:/hello.mp3")
playsound("D:/hello.mp3")

Now that you got the hearing and speaking capability of the A.I. bot set, the next step is to set up the brain, which I would discuss in a different post.

Hope this post helped you to get one step closer to giving life to your personal A.I. bot.

For developing your own Speech-To-Text module, you can use take a look at the following links.

  1. https://github.com/jim-schwoebel/voice_datasets
  2. https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

Reference :

  1. https://predictivehacks.com/simple-example-of-speech-to-text/

--

--