Production-Ready Uzbek Speech Data

Uzbek Voice Data, Engineered for AI

From native Uzbek podcasts to verified, segmented, multi-speaker datasets. Plug-and-play with HuggingFace, Common Voice, and LJSpeech.

The Challenge

Building Uzbek Speech Models is Hard

Training Uzbek speech models (ASR, TTS, voice cloning) requires thousands of hours of clean, labeled audio — segmented by speaker, tagged by dialect, scored by quality. Manual collection takes months. We automate the entire pipeline and deliver datasets that are plug-and-play with HuggingFace, Common Voice, and LJSpeech.

1,000+ hrs

Clean audio required for production ASR models

10+

Uzbek dialects with distinct acoustic signatures

100%

Human-verified quality for every segment

Our Pipeline

How It Works

From raw audio to production-ready datasets in three stages

1

Source & Extract

Audio extracted from Talabam.com podcasts — Uzbekistan's premier content platform. Native speakers, real conversations, 10+ dialects.

  • Native Uzbek speech
  • Multi-speaker content
  • 10+ dialect regions
2

Process & Clean

7-step pipeline: quality analysis, noise removal (Demucs + DeepFilterNet), Whisper transcription, speaker diarization (Pyannote 3.1).

  • VAD filtering (≥50% speech)
  • Background music removed
  • Word-level timestamps
3

Verify & Export

Human-in-the-loop review for flagged content. Metadata enrichment: emotion, dialect (10 regions), DNSMOS quality scores.

  • Human verification
  • 5–30s segments
  • HuggingFace compatible

Data Source

Powered by Talabam.com

All audio is sourced from Talabam.com — Uzbekistan's premier podcast and video content platform. This gives us authentic, legally-accessible content with proper consent flows.

Native Speech

Real conversations, not scripted readings

Multi-Speaker

2–6 speakers per episode with natural turn-taking

Regional Diversity

Hosts and guests from all 10+ dialect regions

Fresh Content

Hundreds of hours published weekly

Source PlatformTalabam.com
Content TypePodcasts & Video
Audio FormatWAV 16kHz 16-bit
Legal StatusCC BY 4.0 Licensed

Dialect Coverage

10 Uzbek Dialects

Every segment is tagged with dialect metadata — build models that understand regional variations from all corners of Uzbekistan.

Northern (Urban Standard)

Toshkent

Urban standard, widely used in media

Eastern

Farg'ona

Distinct intonation patterns

Eastern

Namangan

Mountain region variety

Southern

Samarqand

Historical center dialect

Western

Xorazm

Unique vocabulary influences

South-Central

Qashqadaryo

Transitional features

Central-Western

Buxoro

Classical influences

Eastern

Andijon

Agricultural region speech

Central

Navoiy

Mining industry variety

Southern

Surxondaryo

Border region dialect

Detection via linguistic markers, acoustic fingerprinting, and human verification

FAQ

Frequently Asked Questions

Everything you need to know about our voice datasets

Ready to Power Your Voice AI?

Browse our collection of 200+ voice datasets or request custom data tailored to your specific needs.