Python Scripts

Transcribe Audio with multiple speakers

Code

🧠 WhisperX Offline Transcription Setup with GUI

📝 Summary

This guide details how to set up and patch WhisperX to transcribe long audio files (MP3, MP4, etc.) offline using a GUI-based Python app, bypassing VAD model downloads and network dependencies.


📦 Project Overview


✅ Key Features


🛠️ Setup Instructions

1. 🐍 Python Environment

bash
python -m venv .venv .venv\Scripts\activate pip install whisperx torchaudio pydub tkinter speechbrain

2. 📁 Folder Structure

bash
/transcriber/ ├── transcribe.py # GUI application ├── models/vad/pytorch_model.bin # Downloaded manually ├── .venv/...

3. 🔧 Environment Variable (Set in transcribe.py)

python
os.environ["WHISPERX_VAD_MODEL_PATH"] = r"D:\\PY\\models\\vad\\pytorch_model.bin"

4. 🎯 GUI Usage

Run:

bash
python transcribe.py

Then:


🔧 WhisperX Modifications

vad.py Patch

python
def load_vad_model(...): model_fp = os.environ.get("WHISPERX_VAD_MODEL_PATH") if not model_fp or not os.path.exists(model_fp): raise FileNotFoundError("Local VAD model path invalid.") print(f"Using local VAD model at: {model_fp}") bundle = torchaudio.pipelines.HUBERT_BASE return bundle.get_model().to(device).eval() def merge_chunks(chunks, *args, **kwargs): return chunks

asr.py Patch

Modified transcribe() inside FasterWhisperPipeline:

python
duration = audio.shape[0] / SAMPLE_RATE chunk_duration = 30.0 vad_segments = [] start = 0.0 while start < duration: end = min(start + chunk_duration, duration) vad_segments.append({"start": start, "end": end}) start = end

🐛 Issues Resolved

Issue Resolution
TranscriptionOptions.__new__() missing args Manually passed asr_options with required fields
HTTP 301 for VAD model Replaced remote load with offline .bin path
'dict' has no attribute 'ndim' Dummy VAD model returned incompatible type → fully bypassed
vad_segments unexpected argument Removed invalid param from transcribe() call
input shape (1, 80, 782456) too large Manual chunking into 30s segments

📁 Final Notes


💾 Files to Backup for Future Use


🧩 Future Improvements