# Biruniy Research — Complete Site Content ## About Biruniy **Biruniy is the dataset company.** The biruniy-v1 model is proof that our data produces world-class results. The website converts clients who need Uzbek speech data. The benchmark + model exist to prove: "Buy our data → your model will be this good." --- ## Hero: Uzbek Speech Data That Actually Works 50+ hours of human-verified conversational Uzbek audio — the only dataset where every word is reviewed by 2+ independent native speakers. Models trained on our data achieve 7–8% WER. Models trained on 1,600 hrs of unverified data? 16.7%. ### Key Stats - **50+** Verified Hours - **99.2%** Transcript Accuracy - **50+** Unique Speakers - **10** Uzbek Dialects - **✓** Emotion Labels ### Call to Action - Primary: Request Dataset Access → contact form - Secondary: See What Models Achieve → benchmark section --- ## The Uzbek Data Problem: Three Reasons Why Uzbek ASR & TTS Still Fail in Production ### 1. Crowd-Sourced = Low Quality Common Voice has Uzbek audio — but it's read-aloud sentences by random volunteers. Not how real people talk. Models trained on it: 14–24% WER. ### 2. Scraped = Unverified Some providers scrape 1,000+ hours of Uzbek audio. But nobody checks the transcripts. Garbage in = garbage out. 16.7% WER. ### 3. Nothing For Voice AI There is zero verified Uzbek speech data suitable for both ASR and TTS publicly available. Until now. We built the solution: a dataset where every word is verified by humans. --- ## What You Get: Production-Ready Uzbek Speech Data Download, train, deploy. No cleaning needed. ### Natural Conversation Real speech from podcasts and talk shows — not people reading scripts into a microphone. ### IAA Verified Every low-confidence word reviewed by 2+ independent native Uzbek speakers. Conflicts resolved by senior adjudicator. ### Audio Quality Scored DNSMOS quality scoring on every segment. Only MOS ≥ 3.5 included. Background noise removed with DeepFilterNet. ### Speaker Diarized 50+ unique speakers labeled. Train/Val/Test split by speaker — zero data leakage. ### Word-Level Timestamps Millisecond-accurate start/end times for every word. Confidence scores included. ### 10 Dialects Tagged Toshkent, Farg'ona, Samarqand, Xorazm, Buxoro, Namangan, Andijon, Qashqadaryo, Navoiy, Surxondaryo. ### Code-Switching Natural Uzbek ↔ Russian switching included. Critical for call center applications. ### Ready to Train HuggingFace-compatible JSONL + WAV. Works with PyTorch, Transformers, faster-whisper, VITS, and Coqui TTS out of the box. --- ## Benchmarks: Proof of What Models Trained on Our Data Achieve We trained etamin/biruniy-v1 on our dataset to prove the quality. Here's how it compares to every other Uzbek ASR model — including ones trained on 30× more data. ### Full Benchmark Table | Benchmark | Academic (USC/ISSAI) | Uzinfocom (Nutq.uz) | jmshd/whisper-uz | Kotib AI | islomov/rubaistt | BlueRaccoon | SyncAll AI | Biruniy Gold v1.0 | |---|---|---|---|---|---|---|---|---| | **Lab WER (clean)** | — | ~12.5% | 14.0% | 6–11% | — | 23.7% | Claims 2%* | **5–6%** | | **Real-World WER** | 11.6–17.4% | ~18–22% | ~25–30% | 16.7% | ~17% | ~30–35% | Unverified | **7–8%** | | **Dialect Tagging** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | **✅ 10 regions** | | **Speaker Diarization** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | Unknown | **✅ Included** | | **Human Verified** | Partial | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | **✅ IAA (2+)** | | **Data Volume** | ~105 hrs | Proprietary | ~200 hrs | 1,600 hrs | 475 hrs | ~15 hrs | Unknown | **50 hrs** | ### Important Notes - * SyncAll AI claims 2% WER — this is unverified and likely refers to intent recognition accuracy, not word-level WER. - Lab WER = Common Voice / FLEURS test sets (clean read speech). - Real-World WER = natural conversational speech with noise, dialects, and code-switching. - Biruniy results measured on held-out Biruniy Gold Test Set (speaker-disjoint, conversational). ### Why Dialects & Diarization Matter Most Uzbek datasets treat the language as a monolith. Biruniy data is tagged by 10 distinct regions (Toshkent, Xorazm, Samarqand, etc.) and includes millisecond-accurate speaker labels. Build models that know WHO is speaking and WHERE they are from. ### Quality > Quantity: The 50hr vs 1,600hr Comparison Kotib AI trained on 1,600 hours of scraped audio. Their Real-World WER: 16.7%. Biruniy trained on 50 hours of IAA-verified data. Our Real-World WER: 7–8%. Every hour of Biruniy data is worth 32 hours of noisy, scraped data. --- ## Not All Uzbek Data Is Equal: Comparison Table | Feature | Common Voice / FLEURS | Scraped (Kotib AI) | Biruniy Gold | |---|---|---|---| | **Speech type** | Read aloud | Unknown mix | Natural conversation | | **Transcript verification** | None | None | IAA (2+ reviewers) | | **Audio quality guarantee** | ❌ | ❌ | ✅ MOS ≥ 3.5 | | **Speaker labels** | ❌ | ❌ | ✅ 50+ speakers | | **Dialect tags** | ❌ | ❌ | ✅ 10 regions | | **Code-switching** | ❌ | ❌ | ✅ Uzbek ↔ Russian | | **Word timestamps** | ❌ | ❌ | ✅ Millisecond-level | | **Data leakage prevention** | ❌ | ❌ | ✅ Speaker-based splits | | **WER achievable** | 14–24% | 16.7% | 7–8% | | **Price** | Free | Free | Licensed | --- ## Pipeline: How We Build the Dataset 6 stages. 3 AI models. 2+ human reviewers per segment. ### Step 1 — Source Audio from Talabam.com — Uzbekistan's largest podcast and video platform. 100% native speakers. Real conversations. ### Step 2 — Clean DeepFilterNet v3 noise removal. DNSMOS quality scoring. Only MOS ≥ 3.5 passes. ### Step 3 — Segment Pyannote 3.1 speaker diarization. VAD filtering (≥50% speech). 5–30 second segments with speaker labels. ### Step 4 — Transcribe Whisper ASR with word-level timestamps. Confidence scoring on every word. Low-confidence words auto-flagged for review. ### Step 5 — Human Verify (The Critical Difference) Every flagged word reviewed by 2+ independent native Uzbek speakers (Inter-Annotator Agreement). Conflicts resolved by senior adjudicator. This is what makes Biruniy data gold-standard. ### Step 6 — Export HuggingFace-compatible JSONL + WAV. Speaker-based train/val/test splits. Full metadata: dialect, MOS, speaker ID, timestamps. --- ## Use Cases: Who Uses Biruniy Data ### Banking & Finance Build voice agents for Uzbek banking customers. Understand real conversational Uzbek — not scripted prompts. Code-switching support for Uzbek-Russian bilingual users. ### Call Centers Automate call transcription and quality monitoring. 10 dialect regions means your model works across all of Uzbekistan, not just Tashkent. ### AI Labs & Researchers Train or fine-tune Uzbek ASR models. Production-ready format. Zero preprocessing needed. Speaker-based splits prevent data leakage. --- ## Pricing: Dataset Access ### Research License — Contact for Pricing - ✅ 50+ hours verified conversational Uzbek - ✅ HuggingFace format (JSONL + WAV) - ✅ Speaker-based train/val/test splits - ✅ Full metadata (dialect, MOS, timestamps) - ✅ Commercial use allowed - ✅ Email support ### Enterprise — Custom Contract - ✅ Everything in Research License - ✅ Custom data collection (your domain vocabulary) - ✅ Scale to 200+ hours on demand - ✅ Domain-specific: banking, medical, legal, telecom - ✅ On-premise delivery option - ✅ Dedicated account manager - ✅ SLA guarantee --- ## Open Source: etamin/biruniy-v1 — Coming Soon We're open-sourcing the model we built to prove our dataset quality. Free for everyone. Apache 2.0 license. ### Model Card - **Architecture:** Whisper Large-v3 + LoRA - **Parameters:** 1.55B + 27M adapters - **WER (conversational):** 7–8% - **WER (clean):** 5–6% - **Training data:** 50 hrs Biruniy Gold Dataset - **License:** Apache 2.0 The model is free. The data that makes it possible is what we sell. Want results like this for your own model? Get the dataset. --- ## FAQ ### Why is crowd-sourced Uzbek speech data low quality? Common Voice has Uzbek audio, but it's read-aloud sentences by random volunteers — not how real people talk. Models trained on it achieve 14–24% WER. ### What's wrong with scraped Uzbek audio datasets? Some providers scrape 1,000+ hours of Uzbek audio, but nobody checks the transcripts. Garbage in equals garbage out — these models achieve 16.7% WER. ### Is there any verified conversational Uzbek speech data available? Until Biruniy, there was zero verified conversational Uzbek speech data publicly available. Biruniy Gold is the first dataset where every word is verified by 2+ independent native speakers. ### What WER can models achieve on Biruniy data? Models trained on Biruniy Gold achieve 7–8% WER on real-world conversational Uzbek speech. On clean read speech, the WER is 5–6%. ### How many Uzbek dialects does Biruniy cover? Biruniy data is tagged across 10 distinct Uzbek dialect regions: Toshkent, Farg'ona, Samarqand, Xorazm, Buxoro, Namangan, Andijon, Qashqadaryo, Navoiy, and Surxondaryo. ### What is the Biruniy Gold Dataset? Biruniy Gold is a production-ready conversational Uzbek speech dataset with 50+ hours of human-verified audio, IAA quality assurance, speaker diarization, 10 dialect tags, word-level timestamps, and code-switching annotations. ### Can I use Biruniy data for commercial applications? Yes. Biruniy offers a Research License that permits commercial use, as well as an Enterprise tier for custom data collection tailored to your domain vocabulary and use case. ### What format is the Biruniy dataset delivered in? The dataset is delivered in HuggingFace-compatible JSONL + WAV format. It includes speaker-based train/val/test splits, full metadata (dialect, MOS, speaker ID, timestamps), and works out of the box with PyTorch, Transformers, and faster-whisper. --- ## Contact: Get the Data That Powers 7–8% WER Tell us what you're building. We'll respond within 24 hours. - **Email:** cameron@etamin.uz - **Company:** etamin.uz - **Form fields:** Name, Email, Company, Need (Dataset License / Enterprise Custom / Research Partnership / Other), Message --- ## Company **Biruniy Research** — by Etamin (etamin.uz) Verified Uzbek speech data for AI teams. - Founded: 2024 - Location: Tashkent, Uzbekistan - Email: cameron@etamin.uz - Twitter: @biruniy - GitHub: github.com/biruniy © 2026 Etamin. All rights reserved.