Speech/NLP Datasets

# Speech/NLP Datasets ###### tags: `mirlab`,`Corpus`,`Speech` **紀錄各種 Speech/Text Corpus 的相關資訊，以中英文相關 data 為主，包含實驗室已有的 Corpus 及已知公開免費或授權的 Corpus，收費 Corpus 暫且不作紀錄** --- # ASR Corpus 適合 ASR 用途的 Corpus - Name: corpus name - Duration: total duration of corpus, '-' meant unknow - Info.: information of corpus - Data Perpare: data perpare code using in Kaldi if any - Link: link to corpus main or download page, and the machine name if already download ## 中文語料 (台灣口音) | Name | Duration | Info. | Data Perpare | Link | | -------- | -------- | -------- | ------- | ------- | | tanPoem | ~40 hrs | 唐詩 - mirlab語料 | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/) | 29.5 / 51 / 63 | | MATBN | ~150 hrs | 中文廣播新聞語料，檔案總時長200hrs | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/), [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/tree/master/s0/) | 29.5 / 51, [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php#matbn) | | NER (formosa) | ~504 hrs | 北科大教育電台廣播節目語音 - 廖元甫老師 | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/), [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/tree/master/s0/) | 29.5 / 51, [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php#eat) | | TCC300 | ~30 hrs | 麥克風朗讀語音 | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/), [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/tree/master/s0/) | 29.5 / 51, [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php#tcc300edu) | | Mozilla Common Voice (TW) | ~78 hrs | mp3, 官網保持更新，音檔數量會隨時間增長，表格紀錄版本：v6.1.zh-TW_78h_2020-12-11 | None | [CommonVoice](https://commonvoice.mozilla.org/zh-TW/datasets) | ## 中文語料 (中國口音) | Name | Duration | Info. | Data Perpare | Link | | -------- | -------- | -------- | ------- | ------- | | aidatatang | ~140 hrs | 電話錄音為主，由 Data Tang 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/62/) | | aishell | ~151 hrs | 錄音文本涉及智能家居、無人駕駛、工業生產等11個領域，由 Aishell 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/33/) | | magicdata | ~712 hrs | 由 MAGIC DATA 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/68/) | | primewords | ~99 hrs | 電話錄音，由 Shanghai Primewords 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/47/) | | ST-CMDS | ~110 hrs | 使用手機在室內安靜環境錄音，由 surfing ai 公司釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/38/) | | THCHS-30 | ~26 hrs | 由清大CSLT 釋出 | [kaldi-multi_cn](https://github.com/kaldi-asr/kaldi/tree/master/egs/multi_cn/s5/) | [openslr](https://www.openslr.org/18/) | | SUD12 | - | 短句錄音，由清大CSLT 釋出 | None | [CSLT](http://166.111.134.19:7777/data/susr/SUB12/index.html) | | CN-Celeb | 274 hrs | 戶外錄音，涵蓋 11 種場景，由清大CSLT 釋出 | [kaldi-cnceleb](https://github.com/kaldi-asr/kaldi/tree/master/egs/cnceleb) | [openslr](http://www.openslr.org/82/) | | CSLT-DISGUISE | - | 同時針對一般講話及偽裝 (disguise)講話的情況錄音，由清大CSLT 釋出 | None | [weiyun](https://share.weiyun.com/a7355eb4321dafd2887460daa915191d) | | AISHELL-2019B-EVAL | 31 hrs | 翻譯網站收集的文本錄音(？) | None | [aishell](http://www.aishelltech.com/aishell_2019C_eval) | | mir-emo | ~500 hrs | 電話錄音，8K | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/) | 29.5 / 51 | | Mozilla Common Voice (CH) | ~78 hrs | mp3, 官網保持更新，音檔數量會隨時間增長，表格紀錄版本：v6.1.zh-CN_78h_2020-12-11 | None | [CommonVoice](https://commonvoice.mozilla.org/zh-CN/datasets) | ## 中英混合語料 | Name | Duration | Info. | Data Perpare | Link | | -------- | -------- | -------- | ------- | ------- | | EAT | ~61 hrs (16K) | 台灣口音英語語料 - 音檔有分 16K & 8K ，29.5 上只有 16K，data perpare 也只處理 16K，8K 音檔要找Roger拿光碟才會有 | [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/tree/master/s0/) | 29.5, [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php#eat) | ## 英文語料 | Name | Duration | Info. | Data Perpare | Link | | -------- | -------- | -------- | ------- | ------- | | MIR-English | ~7 hrs | 台灣口音英文 - mirlab語料 | [cet-mixed-asr](https://github.com/ntumirlab/chinese-english-tawinese-mixedspeech-asr/blob/master/s0) | 29.5 | | Librispeech | ~960 hrs | audiobooks 音檔，多國口音 | [kaldi-libri](https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/) | [openslr](https://www.openslr.org/12/) | | Wall Street Journal (WSJ) | ~80 hrs | 新聞文本記者錄音 | [kaldi-wsj](https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/) | 29.5 / 51, [WSJ0](https://catalog.ldc.upenn.edu/LDC93S6A), [WSJ1](https://catalog.ldc.upenn.edu/LDC94S13A) | | tedlium-r1 | 118 hrs | English TED talks | None | [LIMU](https://lium.univ-lemans.fr/en/ted-lium1/), [openslr](https://www.openslr.org/7/) | | tedlium-r2 | 207 hrs | English TED talks, 沒說是否包含r1 data | None | [LIMU](https://lium.univ-lemans.fr/en/ted-lium2/), [openslr](https://www.openslr.org/19/) | | tedlium-r3 | 452 hrs | English TED talks, 包含r2 data| [kaldi-tedlium](https://github.com/kaldi-asr/kaldi/blob/master/egs/tedlium) | [LIMU](https://lium.univ-lemans.fr/en/ted-lium3/), [openslr](https://www.openslr.org/51/) | | ST-AEC | - | ST American English Corpus | None | [openslr](http://www.openslr.org/45/) | | CHiME-5 | 50 hrs | 要申請，20 個家中聚會環境 | [kaldi-chime5](https://github.com/kaldi-asr/kaldi/tree/master/egs/chime5) | [sheffield.U](https://licensing.sheffield.ac.uk/product/chime5) | | LJSpeech | 24 hrs | reading passages from 7 non-fiction books | None | [LJSD](https://keithito.com/LJ-Speech-Dataset/) | | FSDD | - | 8k, English Digit Speech | None | [github](https://github.com/Jakobovski/free-spoken-digit-dataset) | | VoxForge | - | open speech dataset，另有多國語言 | None | [voxforge](http://www.voxforge.org/home) | | VCTK | - | 多種口音英文 | None | [datashare](https://datashare.ed.ac.uk/handle/10283/3443) | | SellCorpus | ~32 hrs | 中國口音英文 | [mandarin-asr](https://github.com/ntumirlab/mandarin-asr/tree/develop/local) | [SellCorpus](http://www.roseducation.org/sell-corpus/corpus.html) | | NSC | ~3000 hrs | 新加坡口音英文，要申請，1TB，2000 hrs prompted 錄音 & 1000 hrs 對話錄音，egs 範例只用對話錄音，官方表示 data 會隨著時間增長 | [kaldi-nsc](https://github.com/kaldi-asr/kaldi/tree/master/egs/nsc) | [IMDA](https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus) | | Vystadial | ~45 hrs | 捷克的大學 open 的 data，推測為捷克口音，另有捷克語版本 | [kaldi-vyst_en](https://github.com/kaldi-asr/kaldi/blob/master/egs/vystadial_en/) | [openslr](https://www.openslr.org/6/) | | UK-Ireland Eng. | - | 英國和愛爾蘭方言的英文語料 | None | [openslr](http://www.openslr.org/83/) | | accentdb | 9 hrs | 印度口音英文 | None | [accentdb](https://accentdb.org/) | | SWC | 395 hrs | Spoken Wikipedia Articles, 另有德語荷蘭語版本 | None | [NATS](https://nats.gitlab.io/swc/) | | TIMIT | - | 英文連續語音 | [kaldi-timit](https://github.com/kaldi-asr/kaldi/tree/master/egs/timit/) | [NCHU](https://scidm.nchc.org.tw/dataset/darpa-timit) | | Mozilla Common Voice (EN) | 2,181 hrs | mp3, 官網保持更新，音檔數量會隨時間增長，表格紀錄版本：v6.1.en_2181h_2020-12-11 | None | [CommonVoice](https://commonvoice.mozilla.org/en/datasets) | | VoxCeleb | 2,000+ hrs | youtube影片音檔，要申請 | [kaldi-voxceleb](https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/) | [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) | | LRW/LRS | 800+ hrs | BBC, TED, TEDx 節目音檔，要申請，可能要確認跟 tedlium-r3 有沒有重複的音檔 | None | [LipReading](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/) | | VoxConverse | 50+ hrs | Youtube multispeaker 影片音檔，要申請 | None | [VoxConverse](https://www.robots.ox.ac.uk/~vgg/data/voxconverse/) | --- # Scoring Corpus 適合語音評分用的 Corpus - Name: corpus name - Duration: total duration of corpus, '-' meant unknow - Info.: information of corpus - Link: link to corpus main or download page, and the machine name if already download | Name | Duration | Info. | Link | | -------- | -------- | -------- | ------- | | MIR-SD | 1.5 hrs | 台灣口音英文，MIRLab 語料，由專家評分(？) | 29.5 / 63 | | speechocean762 | - | 中國口音英文，小米聯合海天瑞聲開源，小孩大人各一半，由專家評分 | [openslr](https://openslr.org/101/) | --- # Emotion Corpus 適合情緒辦識用途的 Audio Corpus (Text Corpus 請參考 NLP Corpus 章節) - Name: corpus name - Duration: total duration of corpus, '-' meant unknow - Info.: information of corpus - Link: link to corpus main or download page, and the machine name if already download | Name | Duration | Info. | Link | | -------- | -------- | -------- | ------- | | NNIME | ~11 hrs | 台灣中文，要申請，6 種情緒，模仿居家生活 | [NNIME](https://nnime.ee.nthu.edu.tw/) | | IEMOCAP | ~12 hrs | 英文？，要申請，應該 5 種情緒 | 29.5, [USC](https://sail.usc.edu/iemocap/iemocap_release.htm) | | CREMA-D | - | 英文，6 種情緒，4種表現強度 | [github](https://github.com/CheyneyComputerScience/CREMA-D) | | SAVEE | - | 英文，要申請，7 種情緒，同時有影像 | [Surrey.U](http://kahlan.eps.surrey.ac.uk/savee/) | --- # Keyword Corpus wake-up words / keyword spotting 相關研究用的 Audio Corpus - Name: corpus name - Duration: total duration of corpus, '-' meant unknow - Info.: information of corpus - Link: link to corpus main or download page, and the machine name if already download | Name | Duration | Info. | Link | | -------- | -------- | -------- | ------- | | MobvoiHotwords | ~225 hrs | 中國中文，語者 3-65 歲，距離麥克風不同距離，不同環境噪音的情況 | [openslr](https://www.openslr.org/87/) | | speech_commands | ~15 hrs | 英文，有 10 個 keywords | [tensorflow](https://www.tensorflow.org/datasets/catalog/speech_commands) | | QUESST2014 | 23 hrs | 多語言，Query-by-Example Keyword Spotting用，內有非母語英文 | [BUT Speech@FIT](https://speech.fit.vutbr.cz/software/quesst-2014-multilingual-database-query-by-example-keyword-spotting) | | SWS2013 | 20 hrs | 多語言，Query-by-Example Keyword Spotting用，內有非母語英文 | [BUT Speech@FIT](https://speech.fit.vutbr.cz/software/sws-2013-multilingual-database-query-by-example-keyword-spotting) | | HI-MIA | - | 中國中英文，2019B-EVAL subset | [aishell](http://aishelltech.com/wakeup_data), [openslr](https://openslr.org/85/) | | AISHELL-2019A-EVAL | 24 hrs | 中國中文，家居場景智能控制、命令等 11 個分類 | [aishell](http://www.aishelltech.com/aishell_2019A_eval) | | AISHELL-2019B-EVAL | 437 hrs | 中國中文，“你好，米雅” “hi, mia”喚醒詞 | [aishell](http://www.aishelltech.com/aishell_2019B_eval) | --- # NLP(Text) Corpus 純文本 Corpus，適合用在 NLP or Language Model - Name: corpus name - Info.: information of corpus - Link: link to corpus main or download page, and the machine name if already download ## 中文語料 | Name | Info. | Link | | -------- | -------- | -------- | | treebank | 台灣中文，CKIP Lab 釋出，要授權 | [CKIP](https://ckip.iis.sinica.edu.tw/project/treebank) | | GCC | 台灣中文，PTT 八卦版問答語料 | [github](https://github.com/zake7749/Gossiping-Chinese-Corpus) | | PTT-text | 台灣中文，PTT C_Chat, e-shopping, Gossiping, HatePolitics, WRC 版推文，爬回來的 | 62:30 / 29.5 | | CLMAD | 應該中國簡中，Language Model 用 dataset，主要為時尚，金融，體育和股票四個範疇 | [openslr](https://openslr.org/55/) | | Combine_1 | 中國簡中，各種data集合，有些要百度下載 | [github](https://github.com/InsaneLife/ChineseNLPCorpus) | | Combine_2 | 中國簡中，各種data集合，有些要百度下載 | [github](https://github.com/brightmart/nlp_chinese_corpus) | | Combine_3 | 中國簡中，各種data集合，有些要百度下載 | [github](https://github.com/CLUEbenchmark/CLUE) | | chinese-poetry | 中國簡中，中文诗歌古典文集 | [github](https://github.com/chinese-poetry/chinese-poetry) | | insuranceqa | 中國簡中，保险行业语料库 | [github](https://github.com/chatopera/insuranceqa-corpus-zh) | ## 英文語料 | Name | Info | Link | | -------- | -------- | -------- | | LMRD | 情緒分析用 | [SAIL](http://ai.stanford.edu/~amaas/data/sentiment/) | | MDSD | 情緒分析用 | [JHU](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/) | | wiki-links | English Wikipedia | [Google Code](https://code.google.com/archive/p/wiki-links/downloads) | | PICAE | 學術英文，來源分別有研討會、教科書、期刊文章、大學行政資料、大學雜誌、電視廣播節目等 | [sketchengine](https://www.sketchengine.eu/picae-corpus/) | | English-Corpora | 各種在不同領域的英文文本 | [English-Corpora](https://www.english-corpora.org/) | --- # Data Augmentation Corpus Noise / Reverb 相關的 Corpus，應該會在做 Data Augmentation 時用到 - Name: corpus name - Info.: information of corpus - Link: link to corpus main or download page | Name | Info. | Link | | -------- | -------- | -------- | | MUSAN | music, speech, and noise recordings | [openslr](https://openslr.org/17/) | | FSD50K | human-labeled sound events, including human sounds, sounds of things, animals, natural sounds, musical instruments and more | [zenodo](https://zenodo.org/record/4060432#.YDiWs2gzaM8) | | RWCP | non-speech sounds recorded in an anechoic room, reconstructed signals in various rooms, impulse responses for a microphone array, speech data recorded with the same array, and recordings of background noises | [openslr](https://openslr.org/13/). [NII-SRC](http://research.nii.ac.jp/src/en/RWCP-SSD.html) | | RIRS | Room Impulse Response and Noise Database | [openslr](https://openslr.org/28/) | | BUT Speech@FIT Reverb Database | various Room Impulse Responses, Room environmental noises | [BUT Speech@FIT](https://speech.fit.vutbr.cz/software/but-speech-fit-reverb-database) | | AIRDB | impulse responses that were measured in a wide variety of rooms | [IKS](https://www.iks.rwth-aachen.de/en/research/tools-downloads/databases/aachen-impulse-response-database/) | | C4DM RIR | room impulse responses was measured in the Great Hall, the Octagon, and a classroom | [isophonics](http://isophonics.net/content/room-impulse-response-data-set) | | MARDY DB | impulse responses were measured from three loudspeakers | [commsp](https://www.commsp.ee.ic.ac.uk/~sap/resources/mardy-multichannel-acoustic-reverberation-database-at-york-database/) | --- # 綜合語料簡單來說就是一些我懶得整理或者沒什麼機會用到的 Corpus - Name: corpus name - Type: data type of corpus - Info.: information of corpus - Link: link to corpus main or download page | Name | Type | Info. | Link | | -------- | -------- | -------- | ------- | Common Crawl | Text | 很大 (TB單位)，多語言混雜，要自己抽出中文/英文的句子 | [commoncrawl](https://commoncrawl.org/the-data/get-started/) | | WikiMedia | Text | 維基自行打包釋出，多國語言，有點複雜，中文應該是[這個](https://dumps.wikimedia.org/zhwiki/) | [WikiDumps](https://dumps.wikimedia.org/) | | tencent | Text | 各種問答 Dataset，中英文都有 | [Tencent AI Lab](https://ai.tencent.com/ailab/nlp/dialogue/#datasets) | | Chinese-Cloze-RC | Text | 中國中文閱讀理解 Corpus，主要是每一篇文章會有各自 label 的類別 | [github](https://github.com/ymcui/Chinese-Cloze-RC) | | TACE | Text | 中國中文 Embedding Corpus，好像沒有原文 | [Tencent AI Lab](https://ai.tencent.com/ailab/nlp/en/embedding.html) | | CWV | Text | 中國中文 Word Vectors Corpus，好像沒有原文 | [github](https://github.com/Embedding/Chinese-Word-Vectors) | | NLPIR | Text | 北京理工大学放出各種中英文 Text，有些要授權 | [NLPIR](http://www.nlpir.org/wordpress/category/corpus%e8%af%ad%e6%96%99%e5%ba%93/) | | chaizi | Text | 漢語拆字字典，For Fun | [github](https://github.com/kfcd/chaizi) | | English-Corpora | Text | 各種在不同領域的英文文本 | [English-Corpora](https://www.english-corpora.org/) | | Mozilla Common Voice | Audio | mp3, 多國語言，官網保持更新，音檔數量會隨時間增長 | [commonvoice](https://commonvoice.mozilla.org/zh-HK/datasets) | | ACLCLP | Audio | 台灣中文，計算語言學學會提供的語料，Roger 應該都可以免費取得 | [ACLCLP](http://www.aclclp.org.tw/use_mat_c.php) | | CSLT-Trivial | Audio | 中國清大放出針對 7 種語助詞(？)錄音的 Data | [weiyun](https://share.weiyun.com/389a55251c59fc4f9740d5c28be380f7) | | LRS3-Lang | Audio | 取自 TEDx 1,300+ hrs video，包含 13 種語言，Coming Soon | [Lip Reading](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3-lang.html) | | Multilingual TEDx | Audio | 多語言 TEDx 節目，包含 8 種語言，不知道跟 LRS3-Lang 有沒有關 | [openslr](https://openslr.org/100/) | | VGG-Sound | Audio | Youtube 上各種短片收集的音效 (不是人聲喔)，並且 label 及分類 | [VGGSound](https://www.robots.ox.ac.uk/~vgg/data/vggsound/) | | CSLT | Audio | 中國清大放出的各種中文 Corpus，上面應該都有整理，不排除以後會有新的 | [CSLT](http://cslt.riit.tsinghua.edu.cn/resources.php?Public%20data) | | AIShell | Audio | AIShell 放出的各種中文 Corpus，要申請 | [aishell](http://www.aishelltech.com/kysjcp) | | tensorflow | Audio | tensorflow 整理的各種英文 Corpus，我懶了 | [tensorflow](https://www.tensorflow.org/datasets/catalog/accentdb) | | BABEL | Audio | multi-language database comprising five of the most widely differing Eastern European languages | [Speechlab](http://www.reading.ac.uk/AcaDepts/ll/speechlab/babel/) | | Sheik | Lexicon | Cantonese lexicon | [CSLT](http://166.111.134.19:7777/data/cantonese/sheik/index.html) | --- # Reference http://cslt.riit.tsinghua.edu.cn/resources.php?Public%20data https://www.mdeditor.tw/pl/2DO9/zh-tw https://www.itread01.com/content/1542207183.html https://www.robots.ox.ac.uk/~vgg/data/ http://www.dreams-itn.eu/index.php/dissemination/science-blogs/24-rir-databases https://www.tensorflow.org/datasets https://www.english-corpora.org/ https://blog.ailemon.me/2018/11/21/free-open-source-chinese-speech-datasets/?fbclid=IwAR1dikdjECrTU5ZKhR5jhSk58DL2PnhDYMaFviMW9PPaN1iT8BoiLspk2gw https://tatoeba.org/eng/downloads