Chapman-Shaoxing ECG Dataset — Data Summary & Usage Guide

# Chapman-Shaoxing ECG Dataset — Data Summary & Usage Guide > **Path:** `\\TRUENAS\wtmh_dataset\chapman-shaoxing` > **Integrity Status:** > All 42,583 CSV-listed records have matching .hea/.mat files > 2,569 valid records exist in WFDBRecords but are absent from the CSV > 83 NaN records in the pre-built npy array > **Last Verified:** 2026-03-19 --- ## Table of Contents 1. [Dataset Overview](#1-dataset-overview) 2. [Directory Structure](#2-directory-structure) 3. [File Format](#3-file-format) 4. [Metadata Field Reference](#4-metadata-field-reference) 5. [Label System](#5-label-system) 6. [Statistical Analysis](#6-statistical-analysis) 7. [Loading the Data](#7-loading-the-data) 8. [Train/Test Split](#8-traintest-split) 9. [Known Issues](#9-known-issues) 10. [Citation](#10-citation) --- ## 1. Dataset Overview The Chapman-Shaoxing dataset was collected at Shaoxing People's Hospital (China) in collaboration with Chapman University. It is one of the largest publicly available 12-lead ECG datasets, covering five major rhythm classes with SNOMED-CT standard diagnostic codes. | Item | Details | |------|---------| | CSV Records | **42,583** | | Actual ECG Files | **45,152** (2,569 additional files not indexed in CSV) | | Leads | **12** (I, II, III, aVR, aVL, aVF, V1–V6) | | Sampling Rate | **500 Hz** (WFDB original) / **100 Hz** (pre-built npy) | | Record Length | 10 seconds (500 Hz → 5,000 pts; 100 Hz → 1,000 pts) | | Diagnostic Superclasses | **5** (SB, SR, AFIB, ST, GSVT) | | Diagnostic Codes | **90** raw codes (with SNOMED-CT mapping) | | Label System | SNOMED-CT + institutional abbreviations (multi-label) | | File Format | WFDB + MATLAB `.mat` (raw) / NumPy `.npy` (pre-built) | | Source | [PhysioNet Chapman-Shaoxing v1.0.0](https://physionet.org/content/chapman-shaoxing/1.0.0/) | --- ## 2. Directory Structure ![fig6_structure](https://hackmd.io/_uploads/ryrLG_tq-l.png) ``` chapman-shaoxing/ ├── explore_dataset.py └── a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/ ├── cs_database.csv # Metadata and labels (42,583 rows) ├── ConditionNames_SNOMED-CT.csv # 63 condition code definitions ├── X_cs_100hz.npy # Pre-built signal array (42583, 1000, 12) float32, 1.9 GB ├── RECORDS # 452 subdirectory path entries ├── SHA256SUMS.txt └── WFDBRecords/ ├── 01/ │ ├── 010/ │ │ ├── JS00001.hea # Header (Age, Sex, SNOMED-CT Dx) │ │ ├── JS00001.mat # MATLAB signal │ │ └── ... │ └── ... ├── 02/ ~ 46/ └── ... ``` > **Naming convention:** `WFDBRecords/{GG}/{SSS}/JS{XXXXX}` — GG = 2-digit group, SSS = 3-digit subgroup, XXXXX = 5-digit ECG ID. --- ## 3. File Format ### 3.1 Key Differences from PTB-XL | Item | PTB-XL | Chapman-Shaoxing | |------|--------|-----------------| | Signal format | `.dat` (PhysioNet binary) | `.mat` (MATLAB v5) | | Header format | `.hea` (WFDB) | `.hea` (WFDB) | | Metadata | Separate CSV (includes age/sex) | Only in `.hea` header | | Sampling rates | 100 Hz / 500 Hz | 500 Hz (+ pre-built 100 Hz npy) | | Label confidence | 0–100 numeric | Binary (present/absent) | ### 3.2 WFDB Header Format (.hea) ``` JS00001 12 500 5000 JS00001.mat 16+24 1000/mV 16 0 -254 21756 0 I ... JS00001.mat 16+24 1000/mV 16 0 527 32579 0 V6 #Age: 85 #Sex: Male #Dx: 164889003,59118001,164934002 ← SNOMED-CT codes #Rx: Unknown #Hx: Unknown #Sx: Unknown ``` | Field | Meaning | |-------|---------| | `12` | 12 leads | | `500` | Sampling rate (Hz) | | `5000` | Samples per lead (10 s × 500 Hz) | | `16+24` | 16-bit signal + 24-bit byte offset | | `1000/mV` | Gain = 1000 ADU/mV | | `#Dx:` | SNOMED-CT codes (comma-separated, multi-label) | ### 3.3 Signal Dimensions | Method | Shape | Type | |--------|-------|------| | wfdb.rdsamp (500 Hz) | `(5000, 12)` | float64 (mV) | | X_cs_100hz.npy (100 Hz) | `(1000, 12)` per record | float32 (mV) | | Full npy array | `(42583, 1000, 12)` | float32 (mV) | ### 3.4 12-Lead ECG Sample ![fig1_ecg_sample](https://hackmd.io/_uploads/HJXPzdK5We.png) *Figure: JS00001, 85-year-old male. Diagnosis: AFIB (Atrial Fibrillation) + RBBB (Right Bundle Branch Block) + TWC (T-wave Change). 500 Hz, 10 seconds.* --- ## 4. Metadata Field Reference ### 4.1 cs_database.csv (7 columns) | Field | Type | Description | |-------|------|-------------| | `ecg_id` | str | Unique identifier (format: `JS{5 digits}`) | | `filename` | str | WFDB record path (no extension) | | `dx_raw` | list | Raw diagnostic code list (abbreviations, multi-label) | | `diagnostic_superclass` | list | Superclass label (one of 5) | | `split_group_id` | str | Split group identifier | | `split_group_source` | str | Split source (record_id) | | `strat_fold` | int | Stratified CV fold number (1–10) | > **Note:** Age, sex, and raw SNOMED-CT codes are **not in the CSV** — they must be parsed from `.hea` headers. ### 4.2 Extracting Age/Sex from Headers ```python def parse_hea_meta(hea_path): meta = {} with open(hea_path) as f: for line in f: if line.startswith('#Age:'): meta['age'] = line.split(':')[1].strip() elif line.startswith('#Sex:'): meta['sex'] = line.split(':')[1].strip() elif line.startswith('#Dx:'): meta['snomed'] = line.split(':')[1].strip() return meta ``` --- ## 5. Label System ### 5.1 Two-tier Label Architecture ``` SNOMED-CT raw codes (in .hea) ↕ mapped via ConditionNames_SNOMED-CT.csv Abbreviation codes (dx_raw in cs_database.csv) ↓ aggregated 5 diagnostic superclasses (diagnostic_superclass) ``` ### 5.2 Five Diagnostic Superclasses | Superclass | Full Name | Description | |------------|-----------|-------------| | **SB** | Sinus Bradycardia | HR < 60 bpm | | **SR** | Sinus Rhythm | Normal sinus rhythm | | **AFIB** | Atrial Fibrillation | Irregular atrial activity | | **ST** | Sinus Tachycardia | HR > 100 bpm | | **GSVT** | Grouped Supraventricular Tachycardia | SVT group | > Unlike PTB-XL, superclasses are **single-label** (mutually exclusive); however `dx_raw` is multi-label. ### 5.3 Superclass & Top Dx Code Distribution ![fig2_superclass](https://hackmd.io/_uploads/r1Rwf_K5bx.png) ![fig3_top_dx](https://hackmd.io/_uploads/H1b_MdY5Ze.png) --- ## 6. Statistical Analysis ### 6.1 Demographics ![fig5_demographics](https://hackmd.io/_uploads/Hy_2zuY5Ze.png) *Left: Age concentrated between 40–80 years (mean 59.4, median 62, range 4–89). Right: Male 56.8% (24,016), Female 43.8% (18,545), Unknown 22.* ### 6.2 Fold Distribution & NaN Analysis ![fig4_fold_nan](https://hackmd.io/_uploads/By0FzOY5-e.png) *Left: 10 balanced folds (~4,257–4,259 records each). Right: 83 records (0.19%) in X_cs_100hz.npy contain NaN values.* ### 6.3 Key Statistics | Metric | Value | |--------|-------| | CSV total records | 42,583 | | WFDBRecords actual files | 45,152 | | Records in files but not CSV | 2,569 | | NaN records in npy | 83 (0.19%) | | Unique diagnostic codes | 90 | | Codes with SNOMED-CT definition | 63 | | Age range (valid) | 4–89 years (mean=59.4, median=62, n=42,315) | | Sex distribution | Male 56.8% (24,016) / Female 43.8% (18,545) / Unknown 22 | --- ## 7. Loading the Data ### 7.1 Requirements ```bash pip install wfdb pandas numpy scipy ``` ### 7.2 Method 1: Load Pre-built npy Array (Fastest) ```python import numpy as np import pandas as pd import ast base = "//TRUENAS/wtmh_dataset/chapman-shaoxing/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/" # Load pre-built 100 Hz array X = np.load(base + "X_cs_100hz.npy") # shape: (42583, 1000, 12) df = pd.read_csv(base + "cs_database.csv") df['dx_raw'] = df['dx_raw'].apply(ast.literal_eval) df['diagnostic_superclass'] = df['diagnostic_superclass'].apply(ast.literal_eval) # Remove NaN records valid_mask = ~np.isnan(X).any(axis=(1, 2)) X_clean = X[valid_mask] # shape: (42500, 1000, 12) df_clean = df[valid_mask].reset_index(drop=True) ``` > **Recommended** for training: already downsampled to 100 Hz and converted to float32. ### 7.3 Method 2: Load Raw WFDB/MATLAB Files (500 Hz) ```python import wfdb base = "//TRUENAS/wtmh_dataset/chapman-shaoxing/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/" signal, meta = wfdb.rdsamp(base + "WFDBRecords/01/010/JS00001") print(signal.shape) # (5000, 12) print(meta['fs']) # 500 ``` ### 7.4 SNOMED-CT ↔ Abbreviation Mapping ```python import pandas as pd base = "//TRUENAS/wtmh_dataset/chapman-shaoxing/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/" cond = pd.read_csv(base + "ConditionNames_SNOMED-CT.csv", encoding='utf-8-sig') snomed_to_abbr = dict(zip(cond['Snomed_CT'].astype(str), cond['Acronym Name'])) abbr_to_full = dict(zip(cond['Acronym Name'], cond['Full Name'])) # Convert SNOMED codes from .hea to abbreviations snomed_codes = "164889003,59118001,164934002".split(',') abbr_list = [snomed_to_abbr.get(c, c) for c in snomed_codes] # → ['AFIB', 'RBBB', 'TWC'] ``` ### 7.5 Train/Test Split ```python test_fold = 10 X_train = X[df.strat_fold != test_fold] y_train = df[df.strat_fold != test_fold]['diagnostic_superclass'] X_test = X[df.strat_fold == test_fold] y_test = df[df.strat_fold == test_fold]['diagnostic_superclass'] print(f"Train: {len(X_train):,} Test: {len(X_test):,}") # Train: 38,324 Test: 4,259 ``` --- ## 8. Train/Test Split | Strategy | Configuration | |----------|--------------| | **Standard** (recommended) | Fold 10 = Test, Folds 1–9 = Train | | **Cross-validation** | Rotate each fold as the test set | | **Quick prototyping** | Fold 9 = Validation, Fold 10 = Test, rest = Train | > Stratification is based on `split_group_id` to prevent records from the same source appearing in multiple folds. --- ## 9. Known Issues | Issue | Severity | Description | |-------|----------|-------------| | **2,569 records not in CSV** | Medium | Valid ECGs in WFDBRecords with Age/Sex/Dx headers but no strat_fold assignment | | **83 NaN records in npy** | Low | 0.19% of records in X_cs_100hz.npy contain NaN; filter with `np.isnan` before training | | **Some dx_raw codes are raw SNOMED numbers** | Low | A few records use bare SNOMED integers instead of abbreviations; map via ConditionNames CSV | | **Age/Sex requires .hea parsing** | Medium | Demographic data not in CSV; batch reading is slower | | **Superclass is single-label but dx_raw is multi-label** | Note | `diagnostic_superclass` has exactly 1 entry; `dx_raw` may have multiple (e.g., AFIB + RBBB + TWC) | --- ## 10. Citation ```bibtex @article{zheng2020large, title = {A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients}, author = {Zheng, Jianwei and Zhang, Jianming and Danioko, Sidy and Yao, Hai and Guo, Hangyuan and Rakovski, Cyril}, journal = {Scientific Data}, volume = {7}, number = {1}, pages = {48}, year = {2020}, publisher = {Nature Publishing Group}, doi = {10.1038/s41597-020-0386-x} } ``` **PhysioNet:** > Goldberger AL, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. *Circulation* 101(23):e215–e220, 2000. --- *Last updated: 2026-03-19*

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.