Building an Alexa- or Siri-Equivalent Bot in Python | by Andrew Hershy | Sep, 2022

A step-by-step guide to building an AI assistant

0* hH 4Zl6sncEI0yC 300w
Table of ContentsIntroduction
APIs and Design
Code Breakdown

It’s not as difficult as you think to build an AI program that listens to speech and answers questions. We can make the magic happen in an afternoon by leveraging a few python packages and APIs.

Here’s some back and forth questions and answers between me and the bot:

Me: What is the population of Brazil?

Bot: The population of Brazil is estimated to be over 209 million people

Me: What does ice cream taste like?

Bot: Most ice cream tastes sweet and creamy

Me: How do I use chopsticks?

Bot: To use chopsticks, you must hold one in each hand. Place the chopstick in your dominant hand between your thumb and index finger, and hold it in place with your middle finger. Place the chopstick in your non-dominant hand between your thumb and index finger, and hold it in place with your ring finger and pinky finger. To pick up food, use the chopstick in your dominant hand to hold the food in place, and then use the chopstick

Granted, these aren’t the most enlightening answers. And that chopstick one at the end is a bit strange, lol. However, the fact that this application can interpret speech and answer questions, no matter how seemingly limited, is amazing in my opinion. And unlike the mainstream AI assistance bots, we can see what’s under the hood here and play around with it.

  • Run the file via the command prompt when the user is ready to ask a question
  • Pyaudio enables the computer mic to pick up speech data
  • Audio data is stored in a variable called ‘stream,’ then encoded and transformed into JSON data
  • JSON data is sent to AssemblyAI API to be converted to text. Text data is then sent back
  • Text data is sent to OpenAI API to be channeled into the text-davinci-002 engine for processing
  • Answer to the question is retrieved and shown on the console below your question

This tutorial utilizes two core APIs:

  • AssemblyAI to transcribe the audio into text
  • OpenAI to interpret the question and return an answer

Design (high level)

This project is broken up into two files: main and openai_helper.

The ‘main’ script is used mainly for the voice-to-text API connection. It involves setting up a WebSockets server, filling in all the parameters required for pyaudio, and creating asynchronous functions required for sending and receiving the speech data concurrently between our application and the AssemblyAi’s server.

The openai_helper file is short and is used solely to connect to the open ai “text-davinci-002” engine. This connection is used to receive answers to our questions.

First, we import all the libraries our application will use. Pip installation may be required for some of these, depending on whether you’ve used them. See the comments for context below:

Then we set up our pyaudio parameters. These inputs are default settings found in various places on the web. Feel free to experiment as needed, but the defaults worked fine for me. We set the stream variable as our initial container for the audio data, and then we print the default input device parameters as a dictionary. The keys of the dictionary mirror the data fields of PortAudio’s structure. Here’s the code:

Next, we are creating multiple asynchronous functions for the sending and receiving required to transform our verbal questions into text. These functions are running concurrently, which enables the speech data to be converted into base64 format, converted into JSON, sent to the server via API, and then received back in a readable format. The WebSockets server is also a vital piece of the script below, as that’s what makes the direct stream as seamless as it is.

Lastly, we have our simple API connection to openai. If you look at line 44 of the gist above (, you can see we are pulling the function ask_computer from this other file and using the output as the answers to our questions.

This was a neat project for anyone interested in playing around with the same technology that makes Siri or Alexa function. Not much coding experience is required because we leverage APIs to do our processing. I would highly recommend forking the repo of this project and playing around first-hand if any reader wants to learn more about these technologies. Cheers!

News Credit

%d bloggers like this: