Speaking images into existence using DALL-E mini and assembly
Speech, text, and images are the three ways humanity has transmitted information throughout history. In this project, we are going to build an application that listens to the speech, turns that speech into text, then turns that text into images. All this can be done in the afternoon. We live in a remarkable time!
This project was influenced by a YouTube tutorial, so please check that out, as I found it very helpful and they deserve credit.
Background knowledge needed:
- DALL-E was created by the organization OpenAI. This introduced the world to AI-generated images and took off in popularity about a year ago. They have a free API that does all sorts of other fun AI-related functions also.
- DALL-E mini is an open-source alternative to DALL-E that tinkerers, like you and I, can play around with for free. This is the engine we’ll be leveraging in this tutorial
- Leverages AssemblyAI’s transcription models to transcribe speech into the text input DALL-E mini engine can work with
- Listens to speech and displays creative and interesting images
This project is broken up into two primary files: main.py and dalle.py.
If the summaries of the files below sound like gibberish to you, hang in there! Because within the code ,itself, there are many comments which break down these concepts more thoroughly!
main script is used for both the streamlit web application and the voice-to-text API connection. It involves configuring the streamlit session-state, creating visual features such as buttons and sliders on the web app interface, setting up a WebSockets server, filling in all the parameters required for
pyaudio, creating asynchronous functions for sending and receiving the speech data concurrently between our application and the AssemblyAi’s server.
The dalle.py file is used to connect the
streamlit web application to the Google Colab server running the DALL-E mini engine. This file has a few functions which serve the following purposes:
- Establishes a connection to backend server and verifies it’s valid
- Initiates call to the server by sending text input for processing
- Retrieves image JSON data, and decodes data using base64.b64decode()
Please reference my GitHub here to see the full application. I tried to include comments and a breakdown of what each chunk of code is doing as I went along, so hopefully, it’s fairly intuitive. And please reference the original project’s repository here for additional context.
This project is a proof of concept for something I’d like to have in my house one day. I’d like to have a screen on my wall in the middle of a decorative frame. Let’s call it a smart picture frame. This screen will have a built-in microphone that listens to all conversations spoken in proximity. Using speech-to-text transcription and natural language processing, the frame will filter and choose the most interesting assortment of words spoken every 30 seconds or so. From there, the text will be continually visualized to dynamically add more depth to the atmosphere.
Imagine visual representations and themes of conversation being displayed on the wall during hangouts and family gatherings in real time. How many creative ideas can emerge from something similar to this? How can the mood of the house change and morph depending on the mood of the participants? The house will feel less like an inorganic structure and more like a participant, itself. Very interesting to think about.
Alas, this project was a fun way to get our hands dirty and play around with these concepts. It’s sort of disappointing that the DALL-E mini doesn’t have the same sort of extremely high-quality images that engines like the OpenAI DALL-E2 have. Nevertheless, I still enjoyed learning the process and principles behind the technology on this project. Most likely in a few years, APIs for these high-resolution image-generating services will be easier to access and play around with anyway. Thanks to anyone who made it all the way through. And good luck on your journey towards learning every day.