# Speech/NLP Datasets
###### tags: `mirlab`,`Corpus`,`Speech`
**紀錄各種 Speech/Text Corpus 的相關資訊,以中英文相關 data 為主,包含實驗室已有的 Corpus 及已知公開免費或授權的 Corpus,收費 Corpus 暫且不作紀錄**
---
# ASR Corpus
適合 ASR 用途的 Corpus
- Name: corpus name
- Duration: total duration of corpus, '-' meant unknow
- Info.: information of corpus
- Data Perpare: data perpare code using in Kaldi if any
- Link: link to corpus main or download page, and the machine name if already download
## 中文語料 (台灣口音)
| Name | Duration | Info. | Data Perpare | Link |
| -------- | -------- | -------- | ------- | ------- |
| tanPoem | ~40 hrs | 唐詩 - mirlab語料 | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/) | 29.5 / 51 / 63 |
| MATBN | ~150 hrs | 中文廣播新聞語料,檔案總時長200hrs | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/), [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/tree/master/s0/) | 29.5 / 51, [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php#matbn) |
| NER (formosa) | ~504 hrs | 北科大教育電台廣播節目語音 - 廖元甫老師 | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/), [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/tree/master/s0/) | 29.5 / 51, [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php#eat) |
| TCC300 | ~30 hrs | 麥克風朗讀語音 | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/), [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/tree/master/s0/) | 29.5 / 51, [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php#tcc300edu) |
| Mozilla Common Voice (TW) | ~78 hrs | mp3, 官網保持更新,音檔數量會隨時間增長,表格紀錄版本:v6.1.zh-TW_78h_2020-12-11 | None | [CommonVoice](https://commonvoice.mozilla.org/zh-TW/datasets) |
## 中文語料 (中國口音)
| Name | Duration | Info. | Data Perpare | Link |
| -------- | -------- | -------- | ------- | ------- |
| aidatatang | ~140 hrs | 電話錄音為主,由 Data Tang 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/62/) |
| aishell | ~151 hrs | 錄音文本涉及智能家居、無人駕駛、工業生產等11個領域,由 Aishell 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/33/) |
| magicdata | ~712 hrs | 由 MAGIC DATA 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/68/) |
| primewords | ~99 hrs | 電話錄音,由 Shanghai Primewords 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/47/) |
| ST-CMDS | ~110 hrs | 使用手機在室內安靜環境錄音,由 surfing ai 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/38/) |
| THCHS-30 | ~26 hrs | 由 清大CSLT 釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/18/) |
| SUD12 | - | 短句錄音,由 清大CSLT 釋出 | None | [CSLT](http://166.111.134.19:7777/data/susr/SUB12/index.html) |
| CN-Celeb | 274 hrs | 戶外錄音,涵蓋 11 種場景,由清大CSLT 釋出 | [kaldi-cnceleb](https://github.com/kaldi-asr/kaldi/tree/master/egs/cnceleb) | [openslr](http://www.openslr.org/82/) |
| CSLT-DISGUISE | - | 同時針對一般講話及偽裝 (disguise)講話的情況錄音,由清大CSLT 釋出 | None | [weiyun](https://share.weiyun.com/a7355eb4321dafd2887460daa915191d) |
| AISHELL-2019B-EVAL | 31 hrs | 翻譯網站收集的文本錄音(?) | None | [aishell](http://www.aishelltech.com/aishell_2019C_eval) |
| mir-emo | ~500 hrs | 電話錄音,8K | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/) | 29.5 / 51 |
| Mozilla Common Voice (CH) | ~78 hrs | mp3, 官網保持更新,音檔數量會隨時間增長,表格紀錄版本:v6.1.zh-CN_78h_2020-12-11 | None | [CommonVoice](https://commonvoice.mozilla.org/zh-CN/datasets) |
## 中英混合語料
| Name | Duration | Info. | Data Perpare | Link |
| -------- | -------- | -------- | ------- | ------- |
| EAT | ~61 hrs (16K) | 台灣口音英語語料 - 音檔有分 16K & 8K ,29.5 上只有 16K,data perpare 也只處理 16K,8K 音檔要找Roger拿光碟才會有 | [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/tree/master/s0/) | 29.5, [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php#eat) |
## 英文語料
| Name | Duration | Info. | Data Perpare | Link |
| -------- | -------- | -------- | ------- | ------- |
| MIR-English | ~7 hrs | 台灣口音英文 - mirlab語料 | [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/blob/master/s0) | 29.5 |
| Librispeech | ~960 hrs | audiobooks 音檔,多國口音 | [kaldi-libri](https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/) | [openslr](https://www.openslr.org/12/) |
| Wall Street Journal (WSJ) | ~80 hrs | 新聞文本記者錄音 | [kaldi-wsj](https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/) | 29.5 / 51, [WSJ0](https://catalog.ldc.upenn.edu/LDC93S6A), [WSJ1](https://catalog.ldc.upenn.edu/LDC94S13A) |
| tedlium-r1 | 118 hrs | English TED talks | None | [LIMU](https://lium.univ-lemans.fr/en/ted-lium1/), [openslr](https://www.openslr.org/7/) |
| tedlium-r2 | 207 hrs | English TED talks, 沒說是否包含r1 data | None | [LIMU](https://lium.univ-lemans.fr/en/ted-lium2/), [openslr](https://www.openslr.org/19/) |
| tedlium-r3 | 452 hrs | English TED talks, 包含r2 data| [kaldi-tedlium](https://github.com/kaldi-asr/kaldi/blob/master/egs/tedlium) | [LIMU](https://lium.univ-lemans.fr/en/ted-lium3/), [openslr](https://www.openslr.org/51/) |
| ST-AEC | - | ST American English Corpus | None | [openslr](http://www.openslr.org/45/) |
| CHiME-5 | 50 hrs | 要申請,20 個家中聚會環境 | [kaldi-chime5](https://github.com/kaldi-asr/kaldi/tree/master/egs/chime5) | [sheffield.U](https://licensing.sheffield.ac.uk/product/chime5) |
| LJSpeech | 24 hrs | reading passages from 7 non-fiction books | None | [LJSD](https://keithito.com/LJ-Speech-Dataset/) |
| FSDD | - | 8k, English Digit Speech | None | [github](https://github.com/Jakobovski/free-spoken-digit-dataset) |
| VoxForge | - | open speech dataset,另有多國語言 | None | [voxforge](http://www.voxforge.org/home) |
| VCTK | - | 多種口音英文 | None | [datashare](https://datashare.ed.ac.uk/handle/10283/3443) |
| SellCorpus | ~32 hrs | 中國口音英文 | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/local) | [SellCorpus](http://www.roseducation.org/sell-corpus/corpus.html) |
| NSC | ~3000 hrs | 新加坡口音英文,要申請,1TB,2000 hrs prompted 錄音 & 1000 hrs 對話錄音,egs 範例只用對話錄音,官方表示 data 會隨著時間增長 | [kaldi-nsc](https://github.com/kaldi-asr/kaldi/tree/master/egs/nsc) | [IMDA](https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus) |
| Vystadial | ~45 hrs | 捷克的大學 open 的 data,推測為捷克口音,另有捷克語版本 | [kaldi-vyst_en](https://github.com/kaldi-asr/kaldi/blob/master/egs/vystadial_en/) | [openslr](https://www.openslr.org/6/) |
| UK-Ireland Eng. | - | 英國和愛爾蘭方言的英文語料 | None | [openslr](http://www.openslr.org/83/) |
| accentdb | 9 hrs | 印度口音英文 | None | [accentdb](https://accentdb.org/) |
| SWC | 395 hrs | Spoken Wikipedia Articles, 另有德語荷蘭語版本 | None | [NATS](https://nats.gitlab.io/swc/) |
| TIMIT | - | 英文連續語音 | [kaldi-timit](https://github.com/kaldi-asr/kaldi/tree/master/egs/timit/) | [NCHU](https://scidm.nchc.org.tw/dataset/darpa-timit) |
| Mozilla Common Voice (EN) | 2,181 hrs | mp3, 官網保持更新,音檔數量會隨時間增長,表格紀錄版本:v6.1.en_2181h_2020-12-11 | None | [CommonVoice](https://commonvoice.mozilla.org/en/datasets) |
| VoxCeleb | 2,000+ hrs | youtube影片音檔,要申請 | [kaldi-voxceleb](https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/) | [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) |
| LRW/LRS | 800+ hrs | BBC, TED, TEDx 節目音檔,要申請,可能要確認跟 tedlium-r3 有沒有重複的音檔 | None | [LipReading](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/) |
| VoxConverse | 50+ hrs | Youtube multispeaker 影片音檔,要申請 | None | [VoxConverse](https://www.robots.ox.ac.uk/~vgg/data/voxconverse/) |
---
# Scoring Corpus
適合語音評分用的 Corpus
- Name: corpus name
- Duration: total duration of corpus, '-' meant unknow
- Info.: information of corpus
- Link: link to corpus main or download page, and the machine name if already download
| Name | Duration | Info. | Link |
| -------- | -------- | -------- | ------- |
| MIR-SD | 1.5 hrs | 台灣口音英文,MIRLab 語料,由專家評分(?) | 29.5 / 63 |
| speechocean762 | - | 中國口音英文,小米聯合海天瑞聲開源,小孩大人各一半,由專家評分 | [openslr](https://openslr.org/101/) |
---
# Emotion Corpus
適合情緒辦識用途的 Audio Corpus (Text Corpus 請參考 NLP Corpus 章節)
- Name: corpus name
- Duration: total duration of corpus, '-' meant unknow
- Info.: information of corpus
- Link: link to corpus main or download page, and the machine name if already download
| Name | Duration | Info. | Link |
| -------- | -------- | -------- | ------- |
| NNIME | ~11 hrs | 台灣中文,要申請,6 種情緒,模仿居家生活 | [NNIME](https://nnime.ee.nthu.edu.tw/) |
| IEMOCAP | ~12 hrs | 英文?,要申請,應該 5 種情緒 | 29.5, [USC](https://sail.usc.edu/iemocap/iemocap_release.htm) |
| CREMA-D | - | 英文,6 種情緒,4種表現強度 | [github](https://github.com/CheyneyComputerScience/CREMA-D) |
| SAVEE | - | 英文,要申請,7 種情緒,同時有影像 | [Surrey.U](http://kahlan.eps.surrey.ac.uk/savee/) |
---
# Keyword Corpus
wake-up words / keyword spotting 相關研究用的 Audio Corpus
- Name: corpus name
- Duration: total duration of corpus, '-' meant unknow
- Info.: information of corpus
- Link: link to corpus main or download page, and the machine name if already download
| Name | Duration | Info. | Link |
| -------- | -------- | -------- | ------- |
| MobvoiHotwords | ~225 hrs | 中國中文,語者 3-65 歲,距離麥克風不同距離,不同環境噪音的情況 | [openslr](https://www.openslr.org/87/) |
| speech_commands | ~15 hrs | 英文,有 10 個 keywords | [tensorflow](https://www.tensorflow.org/datasets/catalog/speech_commands) |
| QUESST2014 | 23 hrs | 多語言,Query-by-Example Keyword Spotting用,內有非母語英文 | [BUT Speech@FIT](https://speech.fit.vutbr.cz/software/quesst-2014-multilingual-database-query-by-example-keyword-spotting) |
| SWS2013 | 20 hrs | 多語言,Query-by-Example Keyword Spotting用,內有非母語英文 | [BUT Speech@FIT](https://speech.fit.vutbr.cz/software/sws-2013-multilingual-database-query-by-example-keyword-spotting) |
| HI-MIA | - | 中國中英文,2019B-EVAL subset | [aishell](http://aishelltech.com/wakeup_data), [openslr](https://openslr.org/85/) |
| AISHELL-2019A-EVAL | 24 hrs | 中國中文,家居場景智能控制、命令等 11 個分類 | [aishell](http://www.aishelltech.com/aishell_2019A_eval) |
| AISHELL-2019B-EVAL | 437 hrs | 中國中文,“你好,米雅” “hi, mia”喚醒詞 | [aishell](http://www.aishelltech.com/aishell_2019B_eval) |
---
# NLP(Text) Corpus
純文本 Corpus,適合用在 NLP or Language Model
- Name: corpus name
- Info.: information of corpus
- Link: link to corpus main or download page, and the machine name if already download
## 中文語料
| Name | Info. | Link |
| -------- | -------- | -------- |
| treebank | 台灣中文,CKIP Lab 釋出,要授權 | [CKIP](https://ckip.iis.sinica.edu.tw/project/treebank) |
| GCC | 台灣中文,PTT 八卦版問答語料 | [github](https://github.com/zake7749/Gossiping-Chinese-Corpus) |
| PTT-text | 台灣中文,PTT C_Chat, e-shopping, Gossiping, HatePolitics, WRC 版推文,爬回來的 | 62:30 / 29.5 |
| CLMAD | 應該中國簡中,Language Model 用 dataset,主要為時尚,金融,體育和股票四個範疇 | [openslr](https://openslr.org/55/) |
| Combine_1 | 中國簡中,各種data集合,有些要百度下載 | [github](https://github.com/InsaneLife/ChineseNLPCorpus) |
| Combine_2 | 中國簡中,各種data集合,有些要百度下載 | [github](https://github.com/brightmart/nlp_chinese_corpus) |
| Combine_3 | 中國簡中,各種data集合,有些要百度下載 | [github](https://github.com/CLUEbenchmark/CLUE) |
| chinese-poetry | 中國簡中,中文诗歌古典文集 | [github](https://github.com/chinese-poetry/chinese-poetry) |
| insuranceqa | 中國簡中,保险行业语料库 | [github](https://github.com/chatopera/insuranceqa-corpus-zh) |
## 英文語料
| Name | Info | Link |
| -------- | -------- | -------- |
| LMRD | 情緒分析用 | [SAIL](http://ai.stanford.edu/~amaas/data/sentiment/) |
| MDSD | 情緒分析用 | [JHU](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/) |
| wiki-links | English Wikipedia | [Google Code](https://code.google.com/archive/p/wiki-links/downloads) |
| PICAE | 學術英文,來源分別有研討會、教科書、期刊文章、大學行政資料、大學雜誌、電視廣播節目等 | [sketchengine](https://www.sketchengine.eu/picae-corpus/) |
| English-Corpora | 各種在不同領域的英文文本 | [English-Corpora](https://www.english-corpora.org/) |
---
# Data Augmentation Corpus
Noise / Reverb 相關的 Corpus,應該會在做 Data Augmentation 時用到
- Name: corpus name
- Info.: information of corpus
- Link: link to corpus main or download page
| Name | Info. | Link |
| -------- | -------- | -------- |
| MUSAN | music, speech, and noise recordings | [openslr](https://openslr.org/17/) |
| FSD50K | human-labeled sound events, including human sounds, sounds of things, animals, natural sounds, musical instruments and more | [zenodo](https://zenodo.org/record/4060432#.YDiWs2gzaM8) |
| RWCP | non-speech sounds recorded in an anechoic room, reconstructed signals in various rooms, impulse responses for a microphone array, speech data recorded with the same array, and recordings of background noises | [openslr](https://openslr.org/13/). [NII-SRC](http://research.nii.ac.jp/src/en/RWCP-SSD.html) |
| RIRS | Room Impulse Response and Noise Database | [openslr](https://openslr.org/28/) |
| BUT Speech@FIT Reverb Database | various Room Impulse Responses, Room environmental noises | [BUT Speech@FIT](https://speech.fit.vutbr.cz/software/but-speech-fit-reverb-database) |
| AIRDB | impulse responses that were measured in a wide variety of rooms | [IKS](https://www.iks.rwth-aachen.de/en/research/tools-downloads/databases/aachen-impulse-response-database/) |
| C4DM RIR | room impulse responses was measured in the Great Hall, the Octagon, and a classroom | [isophonics](http://isophonics.net/content/room-impulse-response-data-set) |
| MARDY DB | impulse responses were measured from three loudspeakers | [commsp](https://www.commsp.ee.ic.ac.uk/~sap/resources/mardy-multichannel-acoustic-reverberation-database-at-york-database/) |
---
# 綜合語料
簡單來說就是一些我懶得整理或者沒什麼機會用到的 Corpus
- Name: corpus name
- Type: data type of corpus
- Info.: information of corpus
- Link: link to corpus main or download page
| Name | Type | Info. | Link |
| -------- | -------- | -------- | -------
| Common Crawl | Text | 很大 (TB單位),多語言混雜,要自己抽出中文/英文的句子 | [commoncrawl](https://commoncrawl.org/the-data/get-started/) |
| WikiMedia | Text | 維基自行打包釋出,多國語言,有點複雜,中文應該是[這個](https://dumps.wikimedia.org/zhwiki/) | [WikiDumps](https://dumps.wikimedia.org/) |
| tencent | Text | 各種問答 Dataset,中英文都有 | [Tencent AI Lab](https://ai.tencent.com/ailab/nlp/dialogue/#datasets) |
| Chinese-Cloze-RC | Text | 中國中文閱讀理解 Corpus,主要是每一篇文章會有各自 label 的類別 | [github](https://github.com/ymcui/Chinese-Cloze-RC) |
| TACE | Text | 中國中文 Embedding Corpus,好像沒有原文 | [Tencent AI Lab](https://ai.tencent.com/ailab/nlp/en/embedding.html) |
| CWV | Text | 中國中文 Word Vectors Corpus,好像沒有原文 | [github](https://github.com/Embedding/Chinese-Word-Vectors) |
| NLPIR | Text | 北京理工大学放出各種中英文 Text,有些要授權 | [NLPIR](http://www.nlpir.org/wordpress/category/corpus%e8%af%ad%e6%96%99%e5%ba%93/) |
| chaizi | Text | 漢語拆字字典,For Fun | [github](https://github.com/kfcd/chaizi) |
| English-Corpora | Text | 各種在不同領域的英文文本 | [English-Corpora](https://www.english-corpora.org/) |
| Mozilla Common Voice | Audio | mp3, 多國語言,官網保持更新,音檔數量會隨時間增長 | [commonvoice](https://commonvoice.mozilla.org/zh-HK/datasets) |
| ACLCLP | Audio | 台灣中文,計算語言學學會提供的語料,Roger 應該都可以免費取得 | [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php) |
| CSLT-Trivial | Audio | 中國清大放出針對 7 種語助詞(?)錄音的 Data | [weiyun](https://share.weiyun.com/389a55251c59fc4f9740d5c28be380f7) |
| LRS3-Lang | Audio | 取自 TEDx 1,300+ hrs video,包含 13 種語言,Coming Soon | [Lip Reading](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3-lang.html) |
| Multilingual TEDx | Audio | 多語言 TEDx 節目,包含 8 種語言,不知道跟 LRS3-Lang 有沒有關 | [openslr](https://openslr.org/100/) |
| VGG-Sound | Audio | Youtube 上各種短片收集的音效 (不是人聲喔),並且 label 及 分類 | [VGGSound](https://www.robots.ox.ac.uk/~vgg/data/vggsound/) |
| CSLT | Audio | 中國清大放出的各種中文 Corpus,上面應該都有整理,不排除以後會有新的 | [CSLT](http://cslt.riit.tsinghua.edu.cn/resources.php?Public%20data) |
| AIShell | Audio | AIShell 放出的各種中文 Corpus,要申請 | [aishell](http://www.aishelltech.com/kysjcp) |
| tensorflow | Audio | tensorflow 整理的各種英文 Corpus,我懶了 | [tensorflow](https://www.tensorflow.org/datasets/catalog/accentdb) |
| BABEL | Audio | multi-language database comprising five of the most widely differing Eastern European languages | [Speechlab](http://www.reading.ac.uk/AcaDepts/ll/speechlab/babel/) |
| Sheik | Lexicon | Cantonese lexicon | [CSLT](http://166.111.134.19:7777/data/cantonese/sheik/index.html) |
---
# Reference
http://cslt.riit.tsinghua.edu.cn/resources.php?Public%20data
https://www.mdeditor.tw/pl/2DO9/zh-tw
https://www.itread01.com/content/1542207183.html
https://www.robots.ox.ac.uk/~vgg/data/
http://www.dreams-itn.eu/index.php/dissemination/science-blogs/24-rir-databases
https://www.tensorflow.org/datasets
https://www.english-corpora.org/
https://blog.ailemon.me/2018/11/21/free-open-source-chinese-speech-datasets/?fbclid=IwAR1dikdjECrTU5ZKhR5jhSk58DL2PnhDYMaFviMW9PPaN1iT8BoiLspk2gw
https://tatoeba.org/eng/downloads