Whisper-based video text extraction tool

Mar 14, 2025·
Dirk Kemper
Dirk Kemper
· 3 min read

Background

Whisper is a speech to text model built by OpenAI which was trained on 680,000 hours of multilingual texts which can be easily run on local hardware, even with the largest model sizes. In my experience it appears to be very powerful while also being fast enough to run in acceptable time limits. The model is truly multilingual and supports 20+ languages, albeit in various forms of reliability and quality. Obviously the most often spoken languages have been represented in the training set the most, so extraction quality for those is the best.

If you want to extract a specific speech snippet from a video file, doing this in a sound editor like Audacity by picking out the right portion of the waveform can be pretty time consuming. By using Whisper to convert the full video file into written form in a single pass, you can much more easily pinpoint the text sequences to extract by simply reading through them.

I will present a lightweight Steamlit-based interface to perform extraction of audio snippets (as MP3) from a given video file which allows you to select text fragments and store these as separate audio files.

Installation

The code is in the following Github repository: https://github.com/kemperd/whisper-audio-extractor

Clone the repo:

git clone https://github.com/kemperd/whisper-audio-extractor

Create virtual environment:

conda create -n whisper-audio-extractor

Install dependencies:

pip install -r requirements.txt

Run the Streamlit server:

./run.sh

Workflow

The workflow of the tool is as follows:

  1. Upload the video file from which you want to extract texts
  2. Convert the video to audio format, i.e. save the audio stream as a separate file
  3. Run Whisper on the audio file to create a TSV-file identifying all text segments
  4. Extract audio samples from the original file using the TSV file

These steps can be executed on each of the individual tabs.

The following video shows the above workflow in practice:

Paddings

The sidebar shows two sliders to configure the begin and end padding of each extracted audio sample. Using these sliders you can extract just a bit more of the fragment at either the beginning or the end of the sample. This is needed as Whisper sometimes makes slight errors in determining the start and end timestamps, requiring some manual correction.

Extraction issues

It sometimes occurs that Whisper will keep on extracting the same text for a prolonged time, sometimes lasting even 30 minutes. If this happens it may help to switch from the turbo model to a larger model size on the text extraction tab, e.g large. Note that this will significantly increase extraction times and in turn may present other issues not present with the turbo model. In my experience the large model sometimes tends to point out texts in fragments where there is no speech at all.