# Chapman-Shaoxing ECG Dataset — Data Summary & Usage Guide
> **Path:** `\\TRUENAS\wtmh_dataset\chapman-shaoxing`
> **Integrity Status:**
> All 42,583 CSV-listed records have matching .hea/.mat files
> 2,569 valid records exist in WFDBRecords but are absent from the CSV
> 83 NaN records in the pre-built npy array
> **Last Verified:** 2026-03-19
---
## Table of Contents
1. [Dataset Overview](#1-dataset-overview)
2. [Directory Structure](#2-directory-structure)
3. [File Format](#3-file-format)
4. [Metadata Field Reference](#4-metadata-field-reference)
5. [Label System](#5-label-system)
6. [Statistical Analysis](#6-statistical-analysis)
7. [Loading the Data](#7-loading-the-data)
8. [Train/Test Split](#8-traintest-split)
9. [Known Issues](#9-known-issues)
10. [Citation](#10-citation)
---
## 1. Dataset Overview
The Chapman-Shaoxing dataset was collected at Shaoxing People's Hospital (China) in collaboration with Chapman University. It is one of the largest publicly available 12-lead ECG datasets, covering five major rhythm classes with SNOMED-CT standard diagnostic codes.
| Item | Details |
|------|---------|
| CSV Records | **42,583** |
| Actual ECG Files | **45,152** (2,569 additional files not indexed in CSV) |
| Leads | **12** (I, II, III, aVR, aVL, aVF, V1–V6) |
| Sampling Rate | **500 Hz** (WFDB original) / **100 Hz** (pre-built npy) |
| Record Length | 10 seconds (500 Hz → 5,000 pts; 100 Hz → 1,000 pts) |
| Diagnostic Superclasses | **5** (SB, SR, AFIB, ST, GSVT) |
| Diagnostic Codes | **90** raw codes (with SNOMED-CT mapping) |
| Label System | SNOMED-CT + institutional abbreviations (multi-label) |
| File Format | WFDB + MATLAB `.mat` (raw) / NumPy `.npy` (pre-built) |
| Source | [PhysioNet Chapman-Shaoxing v1.0.0](https://physionet.org/content/chapman-shaoxing/1.0.0/) |
---
## 2. Directory Structure

```
chapman-shaoxing/
├── explore_dataset.py
└── a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/
├── cs_database.csv # Metadata and labels (42,583 rows)
├── ConditionNames_SNOMED-CT.csv # 63 condition code definitions
├── X_cs_100hz.npy # Pre-built signal array (42583, 1000, 12) float32, 1.9 GB
├── RECORDS # 452 subdirectory path entries
├── SHA256SUMS.txt
└── WFDBRecords/
├── 01/
│ ├── 010/
│ │ ├── JS00001.hea # Header (Age, Sex, SNOMED-CT Dx)
│ │ ├── JS00001.mat # MATLAB signal
│ │ └── ...
│ └── ...
├── 02/ ~ 46/
└── ...
```
> **Naming convention:** `WFDBRecords/{GG}/{SSS}/JS{XXXXX}` — GG = 2-digit group, SSS = 3-digit subgroup, XXXXX = 5-digit ECG ID.
---
## 3. File Format
### 3.1 Key Differences from PTB-XL
| Item | PTB-XL | Chapman-Shaoxing |
|------|--------|-----------------|
| Signal format | `.dat` (PhysioNet binary) | `.mat` (MATLAB v5) |
| Header format | `.hea` (WFDB) | `.hea` (WFDB) |
| Metadata | Separate CSV (includes age/sex) | Only in `.hea` header |
| Sampling rates | 100 Hz / 500 Hz | 500 Hz (+ pre-built 100 Hz npy) |
| Label confidence | 0–100 numeric | Binary (present/absent) |
### 3.2 WFDB Header Format (.hea)
```
JS00001 12 500 5000
JS00001.mat 16+24 1000/mV 16 0 -254 21756 0 I
...
JS00001.mat 16+24 1000/mV 16 0 527 32579 0 V6
#Age: 85
#Sex: Male
#Dx: 164889003,59118001,164934002 ← SNOMED-CT codes
#Rx: Unknown
#Hx: Unknown
#Sx: Unknown
```
| Field | Meaning |
|-------|---------|
| `12` | 12 leads |
| `500` | Sampling rate (Hz) |
| `5000` | Samples per lead (10 s × 500 Hz) |
| `16+24` | 16-bit signal + 24-bit byte offset |
| `1000/mV` | Gain = 1000 ADU/mV |
| `#Dx:` | SNOMED-CT codes (comma-separated, multi-label) |
### 3.3 Signal Dimensions
| Method | Shape | Type |
|--------|-------|------|
| wfdb.rdsamp (500 Hz) | `(5000, 12)` | float64 (mV) |
| X_cs_100hz.npy (100 Hz) | `(1000, 12)` per record | float32 (mV) |
| Full npy array | `(42583, 1000, 12)` | float32 (mV) |
### 3.4 12-Lead ECG Sample

*Figure: JS00001, 85-year-old male. Diagnosis: AFIB (Atrial Fibrillation) + RBBB (Right Bundle Branch Block) + TWC (T-wave Change). 500 Hz, 10 seconds.*
---
## 4. Metadata Field Reference
### 4.1 cs_database.csv (7 columns)
| Field | Type | Description |
|-------|------|-------------|
| `ecg_id` | str | Unique identifier (format: `JS{5 digits}`) |
| `filename` | str | WFDB record path (no extension) |
| `dx_raw` | list | Raw diagnostic code list (abbreviations, multi-label) |
| `diagnostic_superclass` | list | Superclass label (one of 5) |
| `split_group_id` | str | Split group identifier |
| `split_group_source` | str | Split source (record_id) |
| `strat_fold` | int | Stratified CV fold number (1–10) |
> **Note:** Age, sex, and raw SNOMED-CT codes are **not in the CSV** — they must be parsed from `.hea` headers.
### 4.2 Extracting Age/Sex from Headers
```python
def parse_hea_meta(hea_path):
meta = {}
with open(hea_path) as f:
for line in f:
if line.startswith('#Age:'): meta['age'] = line.split(':')[1].strip()
elif line.startswith('#Sex:'): meta['sex'] = line.split(':')[1].strip()
elif line.startswith('#Dx:'): meta['snomed'] = line.split(':')[1].strip()
return meta
```
---
## 5. Label System
### 5.1 Two-tier Label Architecture
```
SNOMED-CT raw codes (in .hea)
↕ mapped via ConditionNames_SNOMED-CT.csv
Abbreviation codes (dx_raw in cs_database.csv)
↓ aggregated
5 diagnostic superclasses (diagnostic_superclass)
```
### 5.2 Five Diagnostic Superclasses
| Superclass | Full Name | Description |
|------------|-----------|-------------|
| **SB** | Sinus Bradycardia | HR < 60 bpm |
| **SR** | Sinus Rhythm | Normal sinus rhythm |
| **AFIB** | Atrial Fibrillation | Irregular atrial activity |
| **ST** | Sinus Tachycardia | HR > 100 bpm |
| **GSVT** | Grouped Supraventricular Tachycardia | SVT group |
> Unlike PTB-XL, superclasses are **single-label** (mutually exclusive); however `dx_raw` is multi-label.
### 5.3 Superclass & Top Dx Code Distribution


---
## 6. Statistical Analysis
### 6.1 Demographics

*Left: Age concentrated between 40–80 years (mean 59.4, median 62, range 4–89). Right: Male 56.8% (24,016), Female 43.8% (18,545), Unknown 22.*
### 6.2 Fold Distribution & NaN Analysis

*Left: 10 balanced folds (~4,257–4,259 records each). Right: 83 records (0.19%) in X_cs_100hz.npy contain NaN values.*
### 6.3 Key Statistics
| Metric | Value |
|--------|-------|
| CSV total records | 42,583 |
| WFDBRecords actual files | 45,152 |
| Records in files but not CSV | 2,569 |
| NaN records in npy | 83 (0.19%) |
| Unique diagnostic codes | 90 |
| Codes with SNOMED-CT definition | 63 |
| Age range (valid) | 4–89 years (mean=59.4, median=62, n=42,315) |
| Sex distribution | Male 56.8% (24,016) / Female 43.8% (18,545) / Unknown 22 |
---
## 7. Loading the Data
### 7.1 Requirements
```bash
pip install wfdb pandas numpy scipy
```
### 7.2 Method 1: Load Pre-built npy Array (Fastest)
```python
import numpy as np
import pandas as pd
import ast
base = "//TRUENAS/wtmh_dataset/chapman-shaoxing/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/"
# Load pre-built 100 Hz array
X = np.load(base + "X_cs_100hz.npy") # shape: (42583, 1000, 12)
df = pd.read_csv(base + "cs_database.csv")
df['dx_raw'] = df['dx_raw'].apply(ast.literal_eval)
df['diagnostic_superclass'] = df['diagnostic_superclass'].apply(ast.literal_eval)
# Remove NaN records
valid_mask = ~np.isnan(X).any(axis=(1, 2))
X_clean = X[valid_mask] # shape: (42500, 1000, 12)
df_clean = df[valid_mask].reset_index(drop=True)
```
> **Recommended** for training: already downsampled to 100 Hz and converted to float32.
### 7.3 Method 2: Load Raw WFDB/MATLAB Files (500 Hz)
```python
import wfdb
base = "//TRUENAS/wtmh_dataset/chapman-shaoxing/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/"
signal, meta = wfdb.rdsamp(base + "WFDBRecords/01/010/JS00001")
print(signal.shape) # (5000, 12)
print(meta['fs']) # 500
```
### 7.4 SNOMED-CT ↔ Abbreviation Mapping
```python
import pandas as pd
base = "//TRUENAS/wtmh_dataset/chapman-shaoxing/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/"
cond = pd.read_csv(base + "ConditionNames_SNOMED-CT.csv", encoding='utf-8-sig')
snomed_to_abbr = dict(zip(cond['Snomed_CT'].astype(str), cond['Acronym Name']))
abbr_to_full = dict(zip(cond['Acronym Name'], cond['Full Name']))
# Convert SNOMED codes from .hea to abbreviations
snomed_codes = "164889003,59118001,164934002".split(',')
abbr_list = [snomed_to_abbr.get(c, c) for c in snomed_codes]
# → ['AFIB', 'RBBB', 'TWC']
```
### 7.5 Train/Test Split
```python
test_fold = 10
X_train = X[df.strat_fold != test_fold]
y_train = df[df.strat_fold != test_fold]['diagnostic_superclass']
X_test = X[df.strat_fold == test_fold]
y_test = df[df.strat_fold == test_fold]['diagnostic_superclass']
print(f"Train: {len(X_train):,} Test: {len(X_test):,}")
# Train: 38,324 Test: 4,259
```
---
## 8. Train/Test Split
| Strategy | Configuration |
|----------|--------------|
| **Standard** (recommended) | Fold 10 = Test, Folds 1–9 = Train |
| **Cross-validation** | Rotate each fold as the test set |
| **Quick prototyping** | Fold 9 = Validation, Fold 10 = Test, rest = Train |
> Stratification is based on `split_group_id` to prevent records from the same source appearing in multiple folds.
---
## 9. Known Issues
| Issue | Severity | Description |
|-------|----------|-------------|
| **2,569 records not in CSV** | Medium | Valid ECGs in WFDBRecords with Age/Sex/Dx headers but no strat_fold assignment |
| **83 NaN records in npy** | Low | 0.19% of records in X_cs_100hz.npy contain NaN; filter with `np.isnan` before training |
| **Some dx_raw codes are raw SNOMED numbers** | Low | A few records use bare SNOMED integers instead of abbreviations; map via ConditionNames CSV |
| **Age/Sex requires .hea parsing** | Medium | Demographic data not in CSV; batch reading is slower |
| **Superclass is single-label but dx_raw is multi-label** | Note | `diagnostic_superclass` has exactly 1 entry; `dx_raw` may have multiple (e.g., AFIB + RBBB + TWC) |
---
## 10. Citation
```bibtex
@article{zheng2020large,
title = {A 12-lead electrocardiogram database for arrhythmia research
covering more than 10,000 patients},
author = {Zheng, Jianwei and Zhang, Jianming and Danioko, Sidy
and Yao, Hai and Guo, Hangyuan and Rakovski, Cyril},
journal = {Scientific Data},
volume = {7},
number = {1},
pages = {48},
year = {2020},
publisher = {Nature Publishing Group},
doi = {10.1038/s41597-020-0386-x}
}
```
**PhysioNet:**
> Goldberger AL, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. *Circulation* 101(23):e215–e220, 2000.
---
*Last updated: 2026-03-19*