Speech/NLP Datasets
紀錄各種 Speech/Text Corpus 的相關資訊,以中英文相關 data 為主,包含實驗室已有的 Corpus 及已知公開免費或授權的 Corpus,收費 Corpus 暫且不作紀錄
ASR Corpus
適合 ASR 用途的 Corpus
- Name: corpus name
- Duration: total duration of corpus, '-' meant unknow
- Info.: information of corpus
- Data Perpare: data perpare code using in Kaldi if any
- Link: link to corpus main or download page, and the machine name if already download
中文語料 (台灣口音)
中文語料 (中國口音)
中英混合語料
Name |
Duration |
Info. |
Data Perpare |
Link |
EAT |
~61 hrs (16K) |
台灣口音英語語料 - 音檔有分 16K & 8K ,29.5 上只有 16K,data perpare 也只處理 16K,8K 音檔要找Roger拿光碟才會有 |
cet-mixed-asr |
29.5, ACLCLP |
英文語料
Name |
Duration |
Info. |
Data Perpare |
Link |
MIR-English |
~7 hrs |
台灣口音英文 - mirlab語料 |
cet-mixed-asr |
29.5 |
Librispeech |
~960 hrs |
audiobooks 音檔,多國口音 |
kaldi-libri |
openslr |
Wall Street Journal (WSJ) |
~80 hrs |
新聞文本記者錄音 |
kaldi-wsj |
29.5 / 51, WSJ0, WSJ1 |
tedlium-r1 |
118 hrs |
English TED talks |
None |
LIMU, openslr |
tedlium-r2 |
207 hrs |
English TED talks, 沒說是否包含r1 data |
None |
LIMU, openslr |
tedlium-r3 |
452 hrs |
English TED talks, 包含r2 data |
kaldi-tedlium |
LIMU, openslr |
ST-AEC |
- |
ST American English Corpus |
None |
openslr |
CHiME-5 |
50 hrs |
要申請,20 個家中聚會環境 |
kaldi-chime5 |
sheffield.U |
LJSpeech |
24 hrs |
reading passages from 7 non-fiction books |
None |
LJSD |
FSDD |
- |
8k, English Digit Speech |
None |
github |
VoxForge |
- |
open speech dataset,另有多國語言 |
None |
voxforge |
VCTK |
- |
多種口音英文 |
None |
datashare |
SellCorpus |
~32 hrs |
中國口音英文 |
mandarin-asr |
SellCorpus |
NSC |
~3000 hrs |
新加坡口音英文,要申請,1TB,2000 hrs prompted 錄音 & 1000 hrs 對話錄音,egs 範例只用對話錄音,官方表示 data 會隨著時間增長 |
kaldi-nsc |
IMDA |
Vystadial |
~45 hrs |
捷克的大學 open 的 data,推測為捷克口音,另有捷克語版本 |
kaldi-vyst_en |
openslr |
UK-Ireland Eng. |
- |
英國和愛爾蘭方言的英文語料 |
None |
openslr |
accentdb |
9 hrs |
印度口音英文 |
None |
accentdb |
SWC |
395 hrs |
Spoken Wikipedia Articles, 另有德語荷蘭語版本 |
None |
NATS |
TIMIT |
- |
英文連續語音 |
kaldi-timit |
NCHU |
Mozilla Common Voice (EN) |
2,181 hrs |
mp3, 官網保持更新,音檔數量會隨時間增長,表格紀錄版本:v6.1.en_2181h_2020-12-11 |
None |
CommonVoice |
VoxCeleb |
2,000+ hrs |
youtube影片音檔,要申請 |
kaldi-voxceleb |
VoxCeleb |
LRW/LRS |
800+ hrs |
BBC, TED, TEDx 節目音檔,要申請,可能要確認跟 tedlium-r3 有沒有重複的音檔 |
None |
LipReading |
VoxConverse |
50+ hrs |
Youtube multispeaker 影片音檔,要申請 |
None |
VoxConverse |
Scoring Corpus
適合語音評分用的 Corpus
- Name: corpus name
- Duration: total duration of corpus, '-' meant unknow
- Info.: information of corpus
- Link: link to corpus main or download page, and the machine name if already download
Name |
Duration |
Info. |
Link |
MIR-SD |
1.5 hrs |
台灣口音英文,MIRLab 語料,由專家評分(?) |
29.5 / 63 |
speechocean762 |
- |
中國口音英文,小米聯合海天瑞聲開源,小孩大人各一半,由專家評分 |
openslr |
Emotion Corpus
適合情緒辦識用途的 Audio Corpus (Text Corpus 請參考 NLP Corpus 章節)
- Name: corpus name
- Duration: total duration of corpus, '-' meant unknow
- Info.: information of corpus
- Link: link to corpus main or download page, and the machine name if already download
Name |
Duration |
Info. |
Link |
NNIME |
~11 hrs |
台灣中文,要申請,6 種情緒,模仿居家生活 |
NNIME |
IEMOCAP |
~12 hrs |
英文?,要申請,應該 5 種情緒 |
29.5, USC |
CREMA-D |
- |
英文,6 種情緒,4種表現強度 |
github |
SAVEE |
- |
英文,要申請,7 種情緒,同時有影像 |
Surrey.U |
Keyword Corpus
wake-up words / keyword spotting 相關研究用的 Audio Corpus
- Name: corpus name
- Duration: total duration of corpus, '-' meant unknow
- Info.: information of corpus
- Link: link to corpus main or download page, and the machine name if already download
Name |
Duration |
Info. |
Link |
MobvoiHotwords |
~225 hrs |
中國中文,語者 3-65 歲,距離麥克風不同距離,不同環境噪音的情況 |
openslr |
speech_commands |
~15 hrs |
英文,有 10 個 keywords |
tensorflow |
QUESST2014 |
23 hrs |
多語言,Query-by-Example Keyword Spotting用,內有非母語英文 |
BUT Speech@FIT |
SWS2013 |
20 hrs |
多語言,Query-by-Example Keyword Spotting用,內有非母語英文 |
BUT Speech@FIT |
HI-MIA |
- |
中國中英文,2019B-EVAL subset |
aishell, openslr |
AISHELL-2019A-EVAL |
24 hrs |
中國中文,家居場景智能控制、命令等 11 個分類 |
aishell |
AISHELL-2019B-EVAL |
437 hrs |
中國中文,“你好,米雅” “hi, mia”喚醒詞 |
aishell |
NLP(Text) Corpus
純文本 Corpus,適合用在 NLP or Language Model
- Name: corpus name
- Info.: information of corpus
- Link: link to corpus main or download page, and the machine name if already download
中文語料
Name |
Info. |
Link |
treebank |
台灣中文,CKIP Lab 釋出,要授權 |
CKIP |
GCC |
台灣中文,PTT 八卦版問答語料 |
github |
PTT-text |
台灣中文,PTT C_Chat, e-shopping, Gossiping, HatePolitics, WRC 版推文,爬回來的 |
62:30 / 29.5 |
CLMAD |
應該中國簡中,Language Model 用 dataset,主要為時尚,金融,體育和股票四個範疇 |
openslr |
Combine_1 |
中國簡中,各種data集合,有些要百度下載 |
github |
Combine_2 |
中國簡中,各種data集合,有些要百度下載 |
github |
Combine_3 |
中國簡中,各種data集合,有些要百度下載 |
github |
chinese-poetry |
中國簡中,中文诗歌古典文集 |
github |
insuranceqa |
中國簡中,保险行业语料库 |
github |
英文語料
Data Augmentation Corpus
Noise / Reverb 相關的 Corpus,應該會在做 Data Augmentation 時用到
- Name: corpus name
- Info.: information of corpus
- Link: link to corpus main or download page
Name |
Info. |
Link |
MUSAN |
music, speech, and noise recordings |
openslr |
FSD50K |
human-labeled sound events, including human sounds, sounds of things, animals, natural sounds, musical instruments and more |
zenodo |
RWCP |
non-speech sounds recorded in an anechoic room, reconstructed signals in various rooms, impulse responses for a microphone array, speech data recorded with the same array, and recordings of background noises |
openslr. NII-SRC |
RIRS |
Room Impulse Response and Noise Database |
openslr |
BUT Speech@FIT Reverb Database |
various Room Impulse Responses, Room environmental noises |
BUT Speech@FIT |
AIRDB |
impulse responses that were measured in a wide variety of rooms |
IKS |
C4DM RIR |
room impulse responses was measured in the Great Hall, the Octagon, and a classroom |
isophonics |
MARDY DB |
impulse responses were measured from three loudspeakers |
commsp |
綜合語料
簡單來說就是一些我懶得整理或者沒什麼機會用到的 Corpus
- Name: corpus name
- Type: data type of corpus
- Info.: information of corpus
- Link: link to corpus main or download page
Name |
Type |
Info. |
Link |
Common Crawl |
Text |
很大 (TB單位),多語言混雜,要自己抽出中文/英文的句子 |
commoncrawl |
WikiMedia |
Text |
維基自行打包釋出,多國語言,有點複雜,中文應該是這個 |
WikiDumps |
tencent |
Text |
各種問答 Dataset,中英文都有 |
Tencent AI Lab |
Chinese-Cloze-RC |
Text |
中國中文閱讀理解 Corpus,主要是每一篇文章會有各自 label 的類別 |
github |
TACE |
Text |
中國中文 Embedding Corpus,好像沒有原文 |
Tencent AI Lab |
CWV |
Text |
中國中文 Word Vectors Corpus,好像沒有原文 |
github |
NLPIR |
Text |
北京理工大学放出各種中英文 Text,有些要授權 |
NLPIR |
chaizi |
Text |
漢語拆字字典,For Fun |
github |
English-Corpora |
Text |
各種在不同領域的英文文本 |
English-Corpora |
Mozilla Common Voice |
Audio |
mp3, 多國語言,官網保持更新,音檔數量會隨時間增長 |
commonvoice |
ACLCLP |
Audio |
台灣中文,計算語言學學會提供的語料,Roger 應該都可以免費取得 |
ACLCLP |
CSLT-Trivial |
Audio |
中國清大放出針對 7 種語助詞(?)錄音的 Data |
weiyun |
LRS3-Lang |
Audio |
取自 TEDx 1,300+ hrs video,包含 13 種語言,Coming Soon |
Lip Reading |
Multilingual TEDx |
Audio |
多語言 TEDx 節目,包含 8 種語言,不知道跟 LRS3-Lang 有沒有關 |
openslr |
VGG-Sound |
Audio |
Youtube 上各種短片收集的音效 (不是人聲喔),並且 label 及 分類 |
VGGSound |
CSLT |
Audio |
中國清大放出的各種中文 Corpus,上面應該都有整理,不排除以後會有新的 |
CSLT |
AIShell |
Audio |
AIShell 放出的各種中文 Corpus,要申請 |
aishell |
tensorflow |
Audio |
tensorflow 整理的各種英文 Corpus,我懶了 |
tensorflow |
BABEL |
Audio |
multi-language database comprising five of the most widely differing Eastern European languages |
Speechlab |
Sheik |
Lexicon |
Cantonese lexicon |
CSLT |
Reference
http://cslt.riit.tsinghua.edu.cn/resources.php?Public data
https://www.mdeditor.tw/pl/2DO9/zh-tw
https://www.itread01.com/content/1542207183.html
https://www.robots.ox.ac.uk/~vgg/data/
http://www.dreams-itn.eu/index.php/dissemination/science-blogs/24-rir-databases
https://www.tensorflow.org/datasets
https://www.english-corpora.org/
https://blog.ailemon.me/2018/11/21/free-open-source-chinese-speech-datasets/?fbclid=IwAR1dikdjECrTU5ZKhR5jhSk58DL2PnhDYMaFviMW9PPaN1iT8BoiLspk2gw
https://tatoeba.org/eng/downloads