Speech-to-Text Converter in Python

As a Python developer, I often found myself working with audio recordings and voice notes that contained valuable information but were difficult to reuse. The problem became even more noticeable when the speech was in different languages, and I needed the content in text form for documentation.

Replaying audio repeatedly or manually typing it out was inefficient, especially when all I wanted was clean, editable text that I could copy, paste, or store for later use. That’s when I thought to build a Speech-to-Text Converter in Python.

This tool not only converts spoken audio into readable text but also supports multilingual speech, making it easier to work with content across languages. Users can instantly copy the transcribed text or download it as a TXT file, turning audio into a flexible and reusable format.

In this tutorial, I’ll walk through the key concepts behind building this Speech-to-Text Converter using Python.

Download Complete Solution Package

Speech-to-Text Converter in Python

I developed a Speech-to-Text Converter tool using Python, where I have used the Streamlit library to design a user interface, streamlit best library for web apps.

I have also used the Whisper library to convert speech to text.

pip install streamlit openai-whisper torch pydub

In this way, we can install all the necessary libraries. For audio handling, I have used pydub and ffmpeg (required dependency).

Why we need to download FFmpeg because Python and Whisper cannot read most audio formats by themselves. FFmpeg is the translator that converts those audio files into raw sound data that Whisper can understand.

Windows: Download from https://ffmpeg.org (Linux: sudo apt install ffmpeg)

Hover over the Windows icon
build provider like BtbN or gyan.dev, which redirects to a GitHub page.
You’ll get a zip like: ffmpeg-release-full.7z
Download the latest GPL shared or release full ZIP file (e.g., ffmpeg-master-latest-win64-gpl.zip).

Extract

Open the downloaded ZIP file and extract all its contents.
Rename the extracted folder
Move this ffmpeg folder into your root C: drive

Add to PATH Environment Variable

Open the Windows Search bar and type environ, then select “Edit the system environment variables“.
In the System Properties window, click “Environment Variables”.
Under “System variables,” find and select the Path variable, then click “Edit”.
Click “New” and paste the path to your FFmpeg bin folder: C:\ffmpeg\bin.
Click OK on all open windows to save.

Verify Installation by opening Command Prompt, Type ffmpeg -version

Whisper internally does this: whisper → ffmpeg → audio decoding → model

Supported Audio File Formats

MP3
WAV
M4A
FLAC
OGG

These are the audio formats that are supported.

User Controls in Python Speech-to-Text Converter

You will get a browse files button, with which you can upload an audio file, after uploading you can see a transcribe audio button. After the audio has been transcribed, you will see a text interface where the transcribed text is displayed, and you will also have a download as TXT button to download the transcribed text as a text file.

Python’s Streamlit has a default server upload limit = 200 MB; changing it must be done outside your app code.

Locate your project folder
Create .streamlit folder
Create a config.toml file

[server]
maxUploadSize = 50

Inside config.toml file, you write the above code to fix the upload limit to 50 MB. You can have the upload limit based on your requirement. Save the file config.toml, restart Streamlit, and run streamlit run app.py.

Libraries Used in the Speech-to-Text Converter in Python

Here are list of libraries that I have used in the Python Speech-to-Text Converter tool.

import streamlit as st
import whisper
import tempfile
import os
import warnings
from pydub import AudioSegment

streamlit

Python’s streamlit is used to build the user interface of the application. It handles audio uploads, buttons, text display, error messages, and session state management, making the tool interactive and user-friendly.

whisper

Used for converting speech into text. Python’s Whisper model processes the uploaded audio file and generates accurate transcriptions, including support for multiple languages.

tempfile

Python tempfileto create temporary files for storing uploaded audio and generated text files. This avoids permanent storage and helps keep the system clean.

os

Used for file system operations such as checking file existence and deleting temporary files after processing is completed.

warnings

Python warnings to suppress unnecessary runtime warnings (such as FP16 warnings on CPU), keeping the application logs clean and readable.

pydub

Python pydub to calculate audio duration and validate minimum and maximum time limits before transcription.

ffmpeg (dependency, not a Python import)

Used internally for audio decoding and format handling when working with different audio file types.

Validations Used in Python Speech-to-Text Converter

Here are some validations that I have used in the Speech-to-Text Converter.

Audio Duration Validation (Minimum and Maximum)

The tool validates the length of the uploaded audio file before transcription. Audio files shorter than the minimum required duration (approximately 2 seconds) are rejected to avoid empty or meaningless transcriptions.

Similarly, audio files longer than the maximum allowed duration (20 minutes) are restricted to ensure performance and stability.

duration_seconds = len(audio) / 1000

            if duration_seconds < 2:
                st.error("Audio duration must be at least 2 seconds.")
                st.stop()

            if duration_seconds > 20 * 60:
                st.error("Audio duration exceeds 20 minutes. Kindly upload a shorter audio file within 20 minutes.")
                st.stop()

Supported Audio File Format Validation

Only supported audio formats are accepted by the Python Speech-to-Text Converter. This validation prevents incompatible or corrupted files from being processed, ensuring that the speech recognition engine works reliably.

audio_file = st.file_uploader(
    "🎧 Upload audio file",
    type=["mp3", "wav", "m4a", "flac", "ogg"]
)

Audio File Size Validation

The tool checks the size of the uploaded audio file and restricts files that exceed the allowed limit. This helps prevent memory issues and improves overall processing speed.

duration_seconds = len(audio) / 1000
                if duration_seconds < 2.0:
                    st.error("Audio too short to transcribe reliably.")
                    st.stop()
if audio_file.size > MAX_SIZE_MB * 1024 * 1024:
    st.error("File too large. Please upload audio under 25MB.")
    st.stop()

Silence Detection Validation

This Python tool uses speech detection thresholds to identify audio files that contain little or no speech. If the audio is mostly silent, transcription accuracy is preserved by preventing unreliable output.

segments = result.get("segments", [])

            if not segments:
                st.error("No speech detected in the audio.")
                st.stop()

Empty or Invalid Audio Input Validation

If the uploaded file is empty, corrupted, or cannot be read properly, the application stops execution and displays a clear error message. This protects the system from unexpected crashes.

Transcription Failure Handling

During the transcription process, any runtime errors are caught using Python exception handling. If transcription fails, the user is notified with an appropriate message, and the process is safely stopped

Speech Supported by Speech-to-Text Converter in Python

Here is the list of languages that the Speech-to-Text Converter tool supports:

Arabic

Russian

Spanish

French

English

Chinese

Japanese

You can upload an audio file in any of the above languages, and a converted text will be displayed below. You can also download a text file containing the converted text.

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/

Speech-to-Text Converter in Python [Download Complete Solution]

Speech-to-Text Converter in Python

Supported Audio File Formats

User Controls in Python Speech-to-Text Converter

Libraries Used in the Speech-to-Text Converter in Python

streamlit

whisper

tempfile

os

warnings

pydub

ffmpeg (dependency, not a Python import)

Validations Used in Python Speech-to-Text Converter

Audio Duration Validation (Minimum and Maximum)

Supported Audio File Format Validation

Audio File Size Validation

Silence Detection Validation

Empty or Invalid Audio Input Validation

Transcription Failure Handling

Speech Supported by Speech-to-Text Converter in Python

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends