Build an Application That Turns Voice Into Text Into Image | by Andrew Hershy | Sep, 2022

Speaking images into existence using DALL-E mini and assembly

1*8gfSWk1ip6lFeyqDMGcd A
DALLE-2 Image Source Prompt: “steampunk Iphone 12”

Speech, text, and images are the three ways humanity has transmitted information throughout history. In this project, we are going to build an application that listens to the speech, turns that speech into text, then turns that text into images. All this can be done in the afternoon. We live in a remarkable time!

1*d 86 A58UYQrnvv4w963XA
speech to text to image

This project was influenced by a YouTube tutorial, so please check that out, as I found it very helpful and they deserve credit.

Background knowledge needed:

  • DALL-E was created by the organization OpenAI. This introduced the world to AI-generated images and took off in popularity about a year ago. They have a free API that does all sorts of other fun AI-related functions also.
  • DALL-E mini is an open-source alternative to DALL-E that tinkerers, like you and I, can play around with for free. This is the engine we’ll be leveraging in this tutorial
  • DALL-E Playground is an open source application that does two things: 1. Uses Google Colab to create and run a backend DALL-E mini server which provides the GPU processing needed to generate images. And 2. Provides a front-end web interface via javascript that users can interact with and view their images on.
  1. Reengineers DALL-E Playground’s front-end interface from JavaScript to streamlit Python (because 1. The UI looks better 2. It functions more seamlessly with the speech-to-text API and 3. Python is cooler).
  2. Leverages AssemblyAI’s transcription models to transcribe speech into the text input DALL-E mini engine can work with
  3. Listens to speech and displays creative and interesting images

This project is broken up into two primary files: and

If the summaries of the files below sound like gibberish to you, hang in there! Because within the code ,itself, there are many comments which break down these concepts more thoroughly!

The main script is used for both the streamlit web application and the voice-to-text API connection. It involves configuring the streamlit session-state, creating visual features such as buttons and sliders on the web app interface, setting up a WebSockets server, filling in all the parameters required for pyaudio, creating asynchronous functions for sending and receiving the speech data concurrently between our application and the AssemblyAi’s server.

The file is used to connect the streamlit web application to the Google Colab server running the DALL-E mini engine. This file has a few functions which serve the following purposes:

  1. Establishes a connection to backend server and verifies it’s valid
  2. Initiates call to the server by sending text input for processing
  3. Retrieves image JSON data, and decodes data using base64.b64decode()

Please reference my GitHub here to see the full application. I tried to include comments and a breakdown of what each chunk of code is doing as I went along, so hopefully, it’s fairly intuitive. And please reference the original project’s repository here for additional context.

main file:

dalle file:

This project is a proof of concept for something I’d like to have in my house one day. I’d like to have a screen on my wall in the middle of a decorative frame. Let’s call it a smart picture frame. This screen will have a built-in microphone that listens to all conversations spoken in proximity. Using speech-to-text transcription and natural language processing, the frame will filter and choose the most interesting assortment of words spoken every 30 seconds or so. From there, the text will be continually visualized to dynamically add more depth to the atmosphere.

Imagine visual representations and themes of conversation being displayed on the wall during hangouts and family gatherings in real time. How many creative ideas can emerge from something similar to this? How can the mood of the house change and morph depending on the mood of the participants? The house will feel less like an inorganic structure and more like a participant, itself. Very interesting to think about.

Alas, this project was a fun way to get our hands dirty and play around with these concepts. It’s sort of disappointing that the DALL-E mini doesn’t have the same sort of extremely high-quality images that engines like the OpenAI DALL-E2 have. Nevertheless, I still enjoyed learning the process and principles behind the technology on this project. Most likely in a few years, APIs for these high-resolution image-generating services will be easier to access and play around with anyway. Thanks to anyone who made it all the way through. And good luck on your journey towards learning every day.

News Credit

%d bloggers like this: