---
# System prepended metadata

title: Chapman-Shaoxing ECG Dataset — Data Summary & Usage Guide

---

# Chapman-Shaoxing ECG Dataset — Data Summary & Usage Guide

> **Path:** `\\TRUENAS\wtmh_dataset\chapman-shaoxing`
> **Integrity Status:** 
> All 42,583 CSV-listed records have matching .hea/.mat files
> 2,569 valid records exist in WFDBRecords but are absent from the CSV
> 83 NaN records in the pre-built npy array
> **Last Verified:** 2026-03-19

---

## Table of Contents

1. [Dataset Overview](#1-dataset-overview)
2. [Directory Structure](#2-directory-structure)
3. [File Format](#3-file-format)
4. [Metadata Field Reference](#4-metadata-field-reference)
5. [Label System](#5-label-system)
6. [Statistical Analysis](#6-statistical-analysis)
7. [Loading the Data](#7-loading-the-data)
8. [Train/Test Split](#8-traintest-split)
9. [Known Issues](#9-known-issues)
10. [Citation](#10-citation)

---

## 1. Dataset Overview

The Chapman-Shaoxing dataset was collected at Shaoxing People's Hospital (China) in collaboration with Chapman University. It is one of the largest publicly available 12-lead ECG datasets, covering five major rhythm classes with SNOMED-CT standard diagnostic codes.

| Item | Details |
|------|---------|
| CSV Records | **42,583** |
| Actual ECG Files | **45,152** (2,569 additional files not indexed in CSV) |
| Leads | **12** (I, II, III, aVR, aVL, aVF, V1–V6) |
| Sampling Rate | **500 Hz** (WFDB original) / **100 Hz** (pre-built npy) |
| Record Length | 10 seconds (500 Hz → 5,000 pts; 100 Hz → 1,000 pts) |
| Diagnostic Superclasses | **5** (SB, SR, AFIB, ST, GSVT) |
| Diagnostic Codes | **90** raw codes (with SNOMED-CT mapping) |
| Label System | SNOMED-CT + institutional abbreviations (multi-label) |
| File Format | WFDB + MATLAB `.mat` (raw) / NumPy `.npy` (pre-built) |
| Source | [PhysioNet Chapman-Shaoxing v1.0.0](https://physionet.org/content/chapman-shaoxing/1.0.0/) |

---

## 2. Directory Structure
![fig6_structure](https://hackmd.io/_uploads/ryrLG_tq-l.png)


```
chapman-shaoxing/
├── explore_dataset.py
└── a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/
    ├── cs_database.csv           # Metadata and labels (42,583 rows)
    ├── ConditionNames_SNOMED-CT.csv  # 63 condition code definitions
    ├── X_cs_100hz.npy            # Pre-built signal array (42583, 1000, 12) float32, 1.9 GB
    ├── RECORDS                   # 452 subdirectory path entries
    ├── SHA256SUMS.txt
    └── WFDBRecords/
        ├── 01/
        │   ├── 010/
        │   │   ├── JS00001.hea   # Header (Age, Sex, SNOMED-CT Dx)
        │   │   ├── JS00001.mat   # MATLAB signal
        │   │   └── ...
        │   └── ...
        ├── 02/ ~ 46/
        └── ...
```

> **Naming convention:** `WFDBRecords/{GG}/{SSS}/JS{XXXXX}` — GG = 2-digit group, SSS = 3-digit subgroup, XXXXX = 5-digit ECG ID.

---

## 3. File Format

### 3.1 Key Differences from PTB-XL

| Item | PTB-XL | Chapman-Shaoxing |
|------|--------|-----------------|
| Signal format | `.dat` (PhysioNet binary) | `.mat` (MATLAB v5) |
| Header format | `.hea` (WFDB) | `.hea` (WFDB) |
| Metadata | Separate CSV (includes age/sex) | Only in `.hea` header |
| Sampling rates | 100 Hz / 500 Hz | 500 Hz (+ pre-built 100 Hz npy) |
| Label confidence | 0–100 numeric | Binary (present/absent) |

### 3.2 WFDB Header Format (.hea)

```
JS00001 12 500 5000
JS00001.mat 16+24 1000/mV 16 0 -254 21756 0 I
...
JS00001.mat 16+24 1000/mV 16 0  527 32579 0 V6
#Age: 85
#Sex: Male
#Dx: 164889003,59118001,164934002      ← SNOMED-CT codes
#Rx: Unknown
#Hx: Unknown
#Sx: Unknown
```

| Field | Meaning |
|-------|---------|
| `12` | 12 leads |
| `500` | Sampling rate (Hz) |
| `5000` | Samples per lead (10 s × 500 Hz) |
| `16+24` | 16-bit signal + 24-bit byte offset |
| `1000/mV` | Gain = 1000 ADU/mV |
| `#Dx:` | SNOMED-CT codes (comma-separated, multi-label) |

### 3.3 Signal Dimensions

| Method | Shape | Type |
|--------|-------|------|
| wfdb.rdsamp (500 Hz) | `(5000, 12)` | float64 (mV) |
| X_cs_100hz.npy (100 Hz) | `(1000, 12)` per record | float32 (mV) |
| Full npy array | `(42583, 1000, 12)` | float32 (mV) |

### 3.4 12-Lead ECG Sample

![fig1_ecg_sample](https://hackmd.io/_uploads/HJXPzdK5We.png)

*Figure: JS00001, 85-year-old male. Diagnosis: AFIB (Atrial Fibrillation) + RBBB (Right Bundle Branch Block) + TWC (T-wave Change). 500 Hz, 10 seconds.*

---

## 4. Metadata Field Reference

### 4.1 cs_database.csv (7 columns)

| Field | Type | Description |
|-------|------|-------------|
| `ecg_id` | str | Unique identifier (format: `JS{5 digits}`) |
| `filename` | str | WFDB record path (no extension) |
| `dx_raw` | list | Raw diagnostic code list (abbreviations, multi-label) |
| `diagnostic_superclass` | list | Superclass label (one of 5) |
| `split_group_id` | str | Split group identifier |
| `split_group_source` | str | Split source (record_id) |
| `strat_fold` | int | Stratified CV fold number (1–10) |

> **Note:** Age, sex, and raw SNOMED-CT codes are **not in the CSV** — they must be parsed from `.hea` headers.

### 4.2 Extracting Age/Sex from Headers

```python
def parse_hea_meta(hea_path):
    meta = {}
    with open(hea_path) as f:
        for line in f:
            if line.startswith('#Age:'): meta['age'] = line.split(':')[1].strip()
            elif line.startswith('#Sex:'): meta['sex'] = line.split(':')[1].strip()
            elif line.startswith('#Dx:'): meta['snomed'] = line.split(':')[1].strip()
    return meta
```

---

## 5. Label System

### 5.1 Two-tier Label Architecture

```
SNOMED-CT raw codes (in .hea)
    ↕ mapped via ConditionNames_SNOMED-CT.csv
Abbreviation codes (dx_raw in cs_database.csv)
    ↓ aggregated
5 diagnostic superclasses (diagnostic_superclass)
```

### 5.2 Five Diagnostic Superclasses

| Superclass | Full Name | Description |
|------------|-----------|-------------|
| **SB** | Sinus Bradycardia | HR < 60 bpm |
| **SR** | Sinus Rhythm | Normal sinus rhythm |
| **AFIB** | Atrial Fibrillation | Irregular atrial activity |
| **ST** | Sinus Tachycardia | HR > 100 bpm |
| **GSVT** | Grouped Supraventricular Tachycardia | SVT group |

> Unlike PTB-XL, superclasses are **single-label** (mutually exclusive); however `dx_raw` is multi-label.

### 5.3 Superclass & Top Dx Code Distribution

![fig2_superclass](https://hackmd.io/_uploads/r1Rwf_K5bx.png)

![fig3_top_dx](https://hackmd.io/_uploads/H1b_MdY5Ze.png)

---

## 6. Statistical Analysis

### 6.1 Demographics

![fig5_demographics](https://hackmd.io/_uploads/Hy_2zuY5Ze.png)

*Left: Age concentrated between 40–80 years (mean 59.4, median 62, range 4–89). Right: Male 56.8% (24,016), Female 43.8% (18,545), Unknown 22.*

### 6.2 Fold Distribution & NaN Analysis
![fig4_fold_nan](https://hackmd.io/_uploads/By0FzOY5-e.png)


*Left: 10 balanced folds (~4,257–4,259 records each). Right: 83 records (0.19%) in X_cs_100hz.npy contain NaN values.*

### 6.3 Key Statistics

| Metric | Value |
|--------|-------|
| CSV total records | 42,583 |
| WFDBRecords actual files | 45,152 |
| Records in files but not CSV | 2,569 |
| NaN records in npy | 83 (0.19%) |
| Unique diagnostic codes | 90 |
| Codes with SNOMED-CT definition | 63 |
| Age range (valid) | 4–89 years (mean=59.4, median=62, n=42,315) |
| Sex distribution | Male 56.8% (24,016) / Female 43.8% (18,545) / Unknown 22 |

---

## 7. Loading the Data

### 7.1 Requirements

```bash
pip install wfdb pandas numpy scipy
```

### 7.2 Method 1: Load Pre-built npy Array (Fastest)

```python
import numpy as np
import pandas as pd
import ast

base = "//TRUENAS/wtmh_dataset/chapman-shaoxing/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/"

# Load pre-built 100 Hz array
X = np.load(base + "X_cs_100hz.npy")   # shape: (42583, 1000, 12)
df = pd.read_csv(base + "cs_database.csv")
df['dx_raw'] = df['dx_raw'].apply(ast.literal_eval)
df['diagnostic_superclass'] = df['diagnostic_superclass'].apply(ast.literal_eval)

# Remove NaN records
valid_mask = ~np.isnan(X).any(axis=(1, 2))
X_clean = X[valid_mask]          # shape: (42500, 1000, 12)
df_clean = df[valid_mask].reset_index(drop=True)
```

> **Recommended** for training: already downsampled to 100 Hz and converted to float32.

### 7.3 Method 2: Load Raw WFDB/MATLAB Files (500 Hz)

```python
import wfdb

base = "//TRUENAS/wtmh_dataset/chapman-shaoxing/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/"

signal, meta = wfdb.rdsamp(base + "WFDBRecords/01/010/JS00001")
print(signal.shape)    # (5000, 12)
print(meta['fs'])      # 500
```

### 7.4 SNOMED-CT ↔ Abbreviation Mapping

```python
import pandas as pd

base = "//TRUENAS/wtmh_dataset/chapman-shaoxing/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/"

cond = pd.read_csv(base + "ConditionNames_SNOMED-CT.csv", encoding='utf-8-sig')
snomed_to_abbr = dict(zip(cond['Snomed_CT'].astype(str), cond['Acronym Name']))
abbr_to_full   = dict(zip(cond['Acronym Name'], cond['Full Name']))

# Convert SNOMED codes from .hea to abbreviations
snomed_codes = "164889003,59118001,164934002".split(',')
abbr_list = [snomed_to_abbr.get(c, c) for c in snomed_codes]
# → ['AFIB', 'RBBB', 'TWC']
```

### 7.5 Train/Test Split

```python
test_fold = 10

X_train = X[df.strat_fold != test_fold]
y_train  = df[df.strat_fold != test_fold]['diagnostic_superclass']

X_test   = X[df.strat_fold == test_fold]
y_test   = df[df.strat_fold == test_fold]['diagnostic_superclass']

print(f"Train: {len(X_train):,}  Test: {len(X_test):,}")
# Train: 38,324   Test: 4,259
```

---

## 8. Train/Test Split

| Strategy | Configuration |
|----------|--------------|
| **Standard** (recommended) | Fold 10 = Test, Folds 1–9 = Train |
| **Cross-validation** | Rotate each fold as the test set |
| **Quick prototyping** | Fold 9 = Validation, Fold 10 = Test, rest = Train |

> Stratification is based on `split_group_id` to prevent records from the same source appearing in multiple folds.

---

## 9. Known Issues

| Issue | Severity | Description |
|-------|----------|-------------|
| **2,569 records not in CSV** | Medium | Valid ECGs in WFDBRecords with Age/Sex/Dx headers but no strat_fold assignment |
| **83 NaN records in npy** | Low | 0.19% of records in X_cs_100hz.npy contain NaN; filter with `np.isnan` before training |
| **Some dx_raw codes are raw SNOMED numbers** | Low | A few records use bare SNOMED integers instead of abbreviations; map via ConditionNames CSV |
| **Age/Sex requires .hea parsing** | Medium | Demographic data not in CSV; batch reading is slower |
| **Superclass is single-label but dx_raw is multi-label** | Note | `diagnostic_superclass` has exactly 1 entry; `dx_raw` may have multiple (e.g., AFIB + RBBB + TWC) |

---

## 10. Citation

```bibtex
@article{zheng2020large,
  title     = {A 12-lead electrocardiogram database for arrhythmia research
               covering more than 10,000 patients},
  author    = {Zheng, Jianwei and Zhang, Jianming and Danioko, Sidy
               and Yao, Hai and Guo, Hangyuan and Rakovski, Cyril},
  journal   = {Scientific Data},
  volume    = {7},
  number    = {1},
  pages     = {48},
  year      = {2020},
  publisher = {Nature Publishing Group},
  doi       = {10.1038/s41597-020-0386-x}
}
```

**PhysioNet:**

> Goldberger AL, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. *Circulation* 101(23):e215–e220, 2000.

---

*Last updated: 2026-03-19*
