How to set up an environment to build your own JARVIS (chatbot agent)
Watching Iron-Man movies, I always wished to have my very own Jarvis. I am sure all the Iron-Man fans out there would feel the same. Although, Jarvis became very famous after the Iron Man movie, movies around or associating Artificial Intelligence date way back. I remember watching a great Artificial Intelligence-based German movie — Metropolis, which was released in 1927.
In all these movies depicting A.I., one thing is common — it is able to understand what we, humans, are saying and able to hold up a conversation. Although it sounds simple activity that we are doing every moment, yet when broken down to the most granular level, we would see how intricately all the different components of the human body — the ears, brain, mouth, neurons, nervous system, calcium channels in neurons, hair cells in the cochlea, larynx…. come together to work as one unit and perform the whole action having a conversation.
To have A.I. bot do the same we need to provide it with Ears, Brain, and a mouth (not a loud one :P) at the very least. Now let's keep the hardware system aside (will talk about it in another story) and focus on the software side of the A.I. bot.
Here we will learn how to set up an end-to-end Python environment so that it can:
- Listen
- Understand
- Speak
Let’s start with the LISTENING part :
Listening is the part where the audio signal is converted to signals in the auditory and neural system for humans. But for an AI agent, listening is being able to capture audio signals and convert them to something that can be fed to the Understanding unit of the AI agent, and this something is text — READABLE TEXT. Hence, it is called a Speech-To-Text converter or STT.
Now we are going to install the required libraries in Python to perform STT tasks.
Installing the SpeechRecognition library:
Open command prompt or conda prompt and write the following command.
pip install SpeechRecognition
Once the installation is done, check the installation using the following command:
import speech_recognition as sr
sr.__version__
Once the Speech recognizer library is installed let’s try out a speech recognizer function:
Here we are testing Google’s Speech recognizer function
filename = 'c:/audio.wav' #The speech audio file to be converted
r = sr.Recognizer()with sr.AudioFile(filename) as source:
audio_data = r.record(source)
text = r.recognize_google(audio_data)
print(text)
The Speech_recognizer library has multiple speech recognition engines like Google API, IBM’s API, CMU Sphinx, etc. The following papers benchmarks different speech recognition engines:
- https://link.springer.com/chapter/10.1007/978-3-030-49161-1_7
- https://arxiv.org/ftp/arxiv/papers/1904/1904.12403.pdf
For building a stand-alone bot, i.e. bot which can work without an internet connection, we need to use a speech-to-text model that can be executed locally. The Sphinx model developed at CMU can work for this purpose.
The CMU Sphinx model needs to be installed prior to using that model and this is how you can do it:
You can do a pip install:
pip install pocketsphinx
You may encounter several errors while installing pocketshpinx, like:
- Installing pocketsphinx python module: command ‘swig.exe’ failed
- Visual C++ missing
- Missing pocketSphinx module
The best path to get CMU Sphinx installed is as follow:
- Install Visual C++ : https://visualstudio.microsoft.com/downloads/
- Then open your conda command prompt and do the following
conda install swig
python -m pip install --upgrade pip setuptools wheel
pip install pocketsphinx
After the installation is successful you can test it via these commands
filename = 'c:/audio.wav' #The speech audio file to be convertedwith sr.AudioFile(filename) as source:
audio_data = r.record(source)
text = r.recognize_sphinx(audio_data)
print(text)
The next step is installing the Speaking Module
The speaking models are known as text-to-speech models.
The are several speech-to-text engines available, here I will be showing the pyttsx and google’s speech to text (gtts).
To use pyttsx:
Doing a pip install of pyttsx might get you pyttsx: No module named ‘engine’ error. So the solution is :
pip install pyttsx3
pip install python-engineio
Then test pyttsx3 using the following code:
import pyttsx3
engine = pyttsx3.init()
text = "Hi I am Jarvis"
engine.say(text)
engine.runAndWait()
Now for installing Google Text to Speech follow the below steps:
pip install gTTS
And running gTTS
import gtts
from playsound import playsoundtts = gtts.gTTS("Hi I am Jarvis")
tts.save("D:/hello.mp3")
playsound("D:/hello.mp3")
Now that you got the hearing and speaking capability of the A.I. bot set, the next step is to set up the brain, which I would discuss in a different post.
Hope this post helped you to get one step closer to giving life to your personal A.I. bot.
For developing your own Speech-To-Text module, you can use take a look at the following links.
- https://github.com/jim-schwoebel/voice_datasets
- https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
Reference :