|
以下是更新后的中文数据集列表,新增一列标注语言类别(普通话 / 方言 / 藏语等):
| SLR编号 | 名称 | 类型 | 语言类别 | 描述 |
|--------|------|------|---------|------|
| SLR18 | THCHS-30 | Speech | 普通话 | A Free Chinese Speech Corpus Released by CSLT@Tsinghua University |
| SLR33 | Aishell | Speech | 普通话 | Mandarin data, provided by Beijing Shell Shell Technology Co.,Ltd |
| SLR38 | Free ST Chinese Mandarin Corpus | Speech | 普通话 | A free Chinese Mandarin corpus by Surfingtech, containing utterances from 855 speakers, 102600 utterances |
| SLR47 | Primewords Chinese Corpus Set 1 | Speech | 普通话 | Chinese Mandarin corpus released by Shanghai Primewords Co. Ltd., containing 100 hours of speech data |
| SLR50 | MADCAT Chinese data splits | Other | 普通话 | Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus |
| SLR55 | CLMAD | Text | 普通话 | A Chinese Language Model Adaptation Dataset (CLMAD) |
| SLR62 | aidatatang_200zh | Speech | 普通话 | A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers |
| SLR68 | MAGICDATA Mandarin Chinese Read Speech Corpus | Speech | 普通话 | The corpus by Magic Data Technology Co., Ltd., containing 755 hours of scripted read speech data from 1080 native speakers of the Mandarin Chinese spoken in mainland China |
| SLR82 | CN-Celeb | Speech | 普通话 | A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University |
| SLR85 | HI-MIA | Speech | 普通话 | A far-field text-dependent speaker verification database for AISHELL Speaker Verification Challenge 2019 |
| SLR87 | MobvoiHotwords | Speech | 普通话 | Chinese hotwords detection dataset, provided by Mobvoi CO.,LTD |
| SLR93 | AISHELL-3 | Speech | 普通话 | Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd. |
| SLR111 | AISHELL-4 | Speech | 普通话 | A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Beijing Shell Shell Technology Co.,Ltd |
| SLR119 | AliMeeting | Speech | 普通话 | A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Alibaba Group |
| SLR120 | HI-MIA-CW | Speech | 普通话 | A Free Mandarin Supplemental Speech Corpus to HI-MIA Database, whose contents are negative samples for wake-up words "Hi, Mia" |
| SLR121 | WenetSpeech | Speech | 普通话 | A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition |
| SLR123 | MAGICDATA Mandarin Chinese Conversational Speech Corpus | Speech | 普通话 | The corpus by Magic Data Technology Co., Ltd., containing 180 hours of rich annotated Mandarin spontaneous conversational speech data |
| SLR124 | TIBMD@MUC speech data set | Speech | 藏语(多方言) | A Tibetan multi-dialect speech data (84.33 hours) |
| SLR133 | XBMU-AMDO31 | Speech | 藏语(安多方言) | Tibetan Amdo dialect speech data from NLIT, Northwest Minzu University |
| SLR138 | SHALCAS22A | Speech | 普通话 | A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd. |
| SLR149 | Tibetan Greetings | Speech | 藏语(多方言) | Selected Tibetan greetings speech data categorized according to the dialectal region |
| SLR158 | NICT-Tib1 | Speech | 藏语(拉萨方言) | 33.5-hour Lhasa-Tibetan read-speech corpus with Kaldi-style transcripts |
| SLR159 | AISHELL-5 | Speech | 普通话 | The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition, provided by Beijing AISHELL Technology Co.,Ltd. |
**补充说明**:
- **普通话(Mandarin)**:共 19 个数据集,主要为汉语普通话(标准中文)语音识别、说话人识别、会议语音等资源
- **藏语(Tibetan)**:共 5 个数据集,涵盖多方言(TIBMD@MUC)、安多方言(XBMU-AMDO31)、拉萨方言(NICT-Tib1)等
- 列表中未出现粤语、闽南语、客家话等其他汉语方言的独立数据集,如有需要可进一步从描述中识别可能包含方言内容的资源
|
|
|
|
|
|
|
|
|