# Biruniy Research — Verified Uzbek Speech Data for AI > Gold-standard conversational Uzbek speech dataset. 50+ hours, IAA-verified by 2+ native speakers per word, 10 dialects, speaker diarized. The only Uzbek dataset where every word is human-verified. ASR models trained on Biruniy achieve 7–8% WER vs 16.7% on 1,600hr scraped data. ## Quick Facts - **Dataset:** Biruniy Gold v1.0 — 50+ hours of human-verified conversational Uzbek audio - **Verification:** IAA (Inter-Annotator Agreement) — 2+ independent native speakers review every word - **WER:** 7–8% conversational, 5–6% clean read speech - **Dialects:** 10 Uzbek regions: Toshkent, Farg'ona, Samarqand, Xorazm, Buxoro, Namangan, Andijon, Qashqadaryo, Navoiy, Surxondaryo - **Speakers:** 50+ unique speakers, diarized, with speaker-disjoint train/val/test splits - **Audio Quality:** DNSMOS scored, only MOS ≥ 3.5 included, DeepFilterNet noise removal - **Transcript Accuracy:** 99.2% - **Code-Switching:** Uzbek ↔ Russian annotated - **Timestamps:** Millisecond-accurate word-level timestamps with confidence scores - **Format:** HuggingFace-compatible JSONL + WAV (works with PyTorch, Transformers, faster-whisper, VITS, Coqui TTS) - **Model (open source):** etamin/biruniy-v1 — Whisper Large-v3 + LoRA (1.55B + 27M adapters), Apache 2.0 - **Competitors:** Kotib AI 1,600hr scraped → 16.7% WER | Common Voice → 14–24% WER | USC/ISSAI 105hr → 11.6–17.4% WER - **Pricing:** Research License (contact for pricing) | Enterprise Custom (custom contract) - **Company:** Etamin (etamin.uz) - **Email:** cameron@etamin.uz ## Pages - [Homepage](https://biruniy.uz/): Full product page — benchmarks, dataset comparison, pipeline, use cases, pricing, open-source model, contact form - [Contact](https://biruniy.uz/contact): Dataset licensing, enterprise inquiries, research partnerships - [Full Content](https://biruniy.uz/llms-full.txt): Complete markdown version of all site content for LLM ingestion ## Benchmarks at a Glance | Model | Data Volume | Real-World WER | Dialects | Human Verified | |---|---|---|---|---| | **Biruniy Gold v1.0** | 50 hrs | 7–8% | 10 regions | IAA (2+) | | Academic (USC/ISSAI) | 105 hrs | 11.6–17.4% | None | Partial | | Uzinfocom (Nutq.uz) | Proprietary | 18–22% | None | Yes | | jmshd/whisper-uz | 200 hrs | 25–30% | None | No | | Kotib AI | 1,600 hrs | 16.7% | None | No | | islomov/rubaistt | 475 hrs | ~17% | None | No | | BlueRaccoon | 15 hrs | 30–35% | None | No | | SyncAll AI | Unknown | Unverified | None | Unknown | ## Pipeline 1. **Source** — Audio from Talabam.com (Uzbekistan's largest podcast/video platform) 2. **Clean** — DeepFilterNet v3 + DNSMOS scoring (MOS ≥ 3.5) 3. **Segment** — Pyannote 3.1 diarization + VAD filtering 4. **Transcribe** — Whisper ASR with word-level timestamps + confidence scoring 5. **Human Verify** — 2+ native Uzbek speakers review every flagged word (IAA) 6. **Export** — JSONL + WAV, speaker-based splits, full metadata