Voice Bot

Giuseppe Attanasio

Last updated on Oct 4, 2023

We all receive hundreds of voice notes every year. However, listening to some of them, at particular moments and times, might violate good social norms and manners.

The main goal of Voice Bot is to transcribe Telegram voice notes using AI. The logic is simple: you receive a note you want to read in plain text and forward it to the bot to get the transcript back.

The bot is currently running at https://telegram.me/the_whisper_bot – feel free to try it!

Things I’ve learned along the way:

Implementing and deploying the full stack takes half a day if you are familiar with Python and REST APIs, which is remarkably fast.
OpenAI’s Whisper is better than Meta’s SeamlessM4T (ehm, no scientific evaluation here; Whisper just gives transcripts back that seem on average more reasonable).
Both Whisper and SeamlessM4T are some pieces of fantastic tech.
Both Whisper and SeamlessM4T generate noisy outputs, especially if my senders talk fast, trim spoken words, and are in a noisy environment.

Components

A Beam App to serve a serverless inference REST endpoint for speech models. Roughly, it receives a voice note’s bytes and returns a transcript text.
A Python Bot to let people send or forward voice notes and forward them in turn to the Beam App. The bot is implemented with python-telegram-bot.

The app does not log, save, preprocess, or post-process any user data, except for each user’s preferred language and model preference.

Models

SeamlessM4T is a multilingual and multi-task model that translates and transcribes across speech and text.

Whisper is a general-purpose speech recognition model. In this project, it is used for multilingual speech recognition and speech translation.

Useful Links

SeamlessM4T Demo on HF: https://huggingface.co/spaces/facebook/seamless_m4t
OpenAI’s Whisper Demo on Beam: https://github.com/slai-labs/get-beam/blob/main/examples/whisper-tutorial/app.py

Limitations

We trim voice notes to a maximum of 60 seconds for SeamlessM4T and 240 seconds for Whisper.
The app gets suspended if not invoked for over 120 seconds. If that happens, you’ll cold start it and wait ~60 seconds to get your transcript.
This is a side project, so
- code is not nice and tidy
- I can’t guarantee 24/7 assistance
- I can’t guarantee it’ll be up forever

software