Nov 23

6 min read

Voice-Assisted Image Generation With Stable Diffusion

A voice-assisted app to generate images from speech

Ever since text-to-image models such as DALL-E, DALL-E2, and Google Imagen have shown breakthroughs by generating astonishing and realistic images just from a textual prompt, there has been an increasing interest among users to test these models themselves.

A few months back, Stable Diffusion fulfilled this desire and open-sourced a handful of model checkpoints, that can be accessed both using a GUI or an API.

Now, while most of the prior applications built by the community are driven by text prompts, the integration of speech to generate images is still relatively unexplored.

Therefore, in this blog, we will build a Streamlit application that takes speech-based inputs from the user and generates an image.

More specifically, we shall first record the input. Next, we will leverage a text-to-speech model to transcribe the audio input. Lastly, we shall pass the text obtained to the Stable Diffusion model using its API.

The highlight of the article is as follows:

· App Workflow
· Prerequisites
· Building The Streamlit App
· Executing the Application
· Conclusion

You can find the code for this article here.

App Workflow

As discussed above, the Stable Diffusion model expects a text prompt as an input. However, if we start with speech, we first need to convert speech to text and then feed the transcribed text as input to the Stable Diffusion model.

To generate audio transcription, I will use AssemblyAI’s speech-to-text transcription API.

The high-level workflow of the application is demonstrated in the image below:

A high-level workflow of the application (Image by author)

First, the user will provide voice input, which will be recorded. Next, we will send the audio file to AssemblyAI for transcription. Once the transcribed text is ready and retrieved from AssemblyAI’s servers, we will provide it as input to the Stable Diffusion model using the Replicate API.

Prerequisites

A few requirements exist to create a voice-based app that can interact with Stable Diffusion are specified below:

#1 Install Streamlit

First, as we are creating this application using Streamlit, we should install the streamlit library using the following command:

#2 Install Replicate

Next, to use the Stable Diffusion model, we should install the Replicate library as well as follows:

#3 Import Dependencies

Next, we import the python libraries we will utilize in this project.

#4 Get the AssemblyAI API Token

To leverage the transcription services of AssemblyAI, you should get an API Access token from the AssemblyAI website. Let’s name it assembly_auth_key for our Streamlit app.

#5 Get the Stable Diffusion API Key

Lastly, you should obtain an API Key to invoke the image generation model. You can get your API key here.

Once you get the key, run the following command in the terminal:

Building The Streamlit App

Once we have fulfilled all the prerequisites for our application, we can proceed with building the app.

For this, we shall define five different functions. These are:

record_audio(file_name): As the name suggests, this will allow the user to provide verbal inputs to the application. The function will collect the audio and store it in an audio file locally as file_name. I have referred to this code for integrating this method into the app.
upload_to_assemblyai(file_name): This function will take the audio file, upload it to AssemblyAI’s server and return the URL of the file as upload_url.
transcribe(upload_url): Once the upload_url is available, we shall create a POST request to transcribe the audio file. This will return the transcription_id, which will be used to fetch the transcription results from AssemblyAI.
get_transcription_result(transcription_id): To retrieve the transcribed text, we shall execute a GET request with the transcription_id obtained from transcribe() method. The function will return the transcribed text, which we will store as a prompt variable.
call_stable_diffusion(prompt): Lastly, this function will pass the prompt received from the user and retrieve the output from the Stable Diffusion model.

Method 2: Uploading Audio File to AssemblyAI

Once the audio file is ready and saved locally, we shall upload this file to AssemblyAI and obtain its URL.

However, before uploading the file, we should declare the headers and the transcription endpoints of AssemblyAI.

In the code block above:

The upload_endpoint specifies the AssemblyAI’s upload service.
After uploading the file, we will use the transcription_endpoint to transcribe the audio file.

The upload_to_assemblyai() method is implemented below:

We make a post request to AssemblyAI with the upload_endpoint, the headers and the path to the audio file (file_path). We collect and return the upload_url from the JSON response received.

Method 3: Transcribing the Audio File

Next, we shall define the transcribe() method.

In contrast to the POST request made in the upload_to_assemblyai() method, here, we invoke the transcription_endpoint instead as the objective is to transcribe the file.

The method returns the transcription_id for our POST request, which we can use to fetch the transcription results.

Method 4: Fetching the Transcription Results

The fourth step in this list is to fetch the transcription results from AssemblyAI using a GET request.

To fetch the results corresponding to our specific request, we should provide the unique identifier (transcription_id) received from AssemblyAI in our GET request. The get_transcription_result() method is implemented below:

The transcription run-time will vary depending upon the input audio’s duration. Therefore, we should make repeated GET requests to check the status of our request and fetch the results once the status changes to completed or indicates an error. Here, we return the transcription text (prompt).

Method 5: Sending the Prompt to Stable Diffusion

The final method will send the prompt as input to the Stable Diffusion model using the Replicate API.

Integrating the Functions in Main Method

As the final step in our Streamlit application, we integrate the functions defined above in the main() method.

Executing the Application

Now that we have built the entire application, it’s time to run it.

Open a new terminal session and navigate to the working directory. Here, execute the following command:

streamlit run file-name.py

Replace file-name.py with the name of your app file.

Demo Walkthrough

Next, let’s do a quick walkthrough of our Streamlit voice-enable Stable Diffusion application.

As we saw above, the app asks to speak the prompt. In the walkthrough below, I have presented the following prompt to Stable Diffusion: “A mouse eating a burger from McDonald’s.”

The application records the audio and saves it to a file locally. Next, it sends the file to AssemblyAI for transcription. Finally, the transcribed text is sent to Stable Diffusion, whose response is displayed on the application.

Conclusion

To conclude, in this post, we built a voice-based interaction tool to prompt Stable Diffusion with voice using the AssemblyAI API and Streamlit.

Specifically, I demonstrated how to take voice input, convert it to text using AssemblyAI and then send that as a prompt to Stable Diffusion.

You can find the code for this article here.

Thanks for reading!

🚀 Subscribe to the Daily Dose of Data Science. Here, I share elegant tips and tricks on Data Science, one tip a day. Receive these tips right in your inbox daily.

Get a weekly summary of the top 1% Research papers, News, Open Source Repos, and Tweets in Machine Learning.

🧑‍💻 Become a Data Science PRO! Get the FREE Data Science Mastery Toolkit with 450+ Pandas, NumPy, and SQL questions.

I like to explore, experiment, and write about data science concepts and tools. You could connect with me on LinkedIn.

Voice-Assisted Image Generation With Stable Diffusion

A voice-assisted app to generate images from speech

App Workflow

Prerequisites

#1 Install Streamlit

#2 Install Replicate

#3 Import Dependencies

#4 Get the AssemblyAI API Token

#5 Get the Stable Diffusion API Key

Building The Streamlit App

Method 2: Uploading Audio File to AssemblyAI

Method 3: Transcribing the Audio File

Method 4: Fetching the Transcription Results

Method 5: Sending the Prompt to Stable Diffusion

Integrating the Functions in Main Method

Executing the Application

Demo Walkthrough

Conclusion

More from Geek Culture

Recommended from Medium

Titanic — Machine Learning from Disaster (Kaggle Competition) — Version 8.0

Countermeasures to tackle spoofing challenges in face recognition

Vision Transformer and its Applications

OpenAI Gym Startup Guide, Azure ML, NLP Trends, and Jobs

Deep Learning with a Small Training Batch (or Lack Thereof)

How 300 Matchboxes Learned to Play Tic-Tac-Toe Using MENACE

OU DALab @ DEEM Workshop (SIGMOD Series)

Extract Image Data from Waymo OD

Get the Medium app

Avi Chawla