# PTB-XL ECG Dataset — Data Summary & Usage Guide
> **Path:** `\\TRUENAS\wtmh_dataset\ptb-xl`
> **Integrity Status:**
> All 21,799 records (87,196 files) present
> no missing
> zero-byte files
> **Last Verified:** 2026-03-19
---
## Table of Contents
1. [Dataset Overview](#1-dataset-overview)
2. [Directory Structure](#2-directory-structure)
3. [File Format](#3-file-format)
4. [Metadata Field Reference](#4-metadata-field-reference)
5. [Label System (SCP Codes)](#5-label-system-scp-codes)
6. [Statistical Analysis](#6-statistical-analysis)
7. [Loading the Data](#7-loading-the-data)
8. [Train/Test Split](#8-traintest-split)
9. [Known Issues](#9-known-issues)
10. [Citation](#10-citation)
---
## 1. Dataset Overview
PTB-XL is the largest publicly available 12-lead clinical ECG dataset, collected at the Physikalisch-Technische Bundesanstalt (PTB) in Berlin, Germany between 1989 and 1996.
| Item | Details |
|------|---------|
| Total Records | **21,799** 10-second ECGs |
| Unique Patients | **18,869** |
| Leads | **12** (I, II, III, aVR, aVL, aVF, V1–V6) |
| Sampling Rate | **100 Hz** (LR) and **500 Hz** (HR) |
| Collection Period | 1989–1996 |
| Label Types | 71 SCP-ECG codes (multi-label with confidence scores) |
| Diagnostic Superclasses | 5 (NORM, MI, STTC, CD, HYP) |
| File Format | WFDB (PhysioNet standard: `.dat` + `.hea`) |
| Source | [PhysioNet PTB-XL v1.0.3](https://physionet.org/content/ptb-xl/1.0.3/) |
---
## 2. Directory Structure

```
ptb-xl/
├── ptbxl_database.csv # Metadata and labels (21,799 rows)
├── scp_statements.csv # SCP code definitions and categories
├── RECORDS # WFDB record index (*)
├── SHA256SUMS.txt # File integrity checksums
├── example_physionet.py # Official loading example
├── records100/ # 100 Hz (low-resolution)
│ ├── 00000/ # ~1,000 records per subfolder
│ │ ├── 00001_lr.dat # Binary signal data
│ │ ├── 00001_lr.hea # Header (sampling rate, lead names, etc.)
│ │ └── ...
│ └── 21000/
└── records500/ # 500 Hz (high-resolution)
├── 00000/
│ ├── 00001_hr.dat
│ ├── 00001_hr.hea
│ └── ...
└── 21000/
```
> **(*) Note:** Line 21,799 of the RECORDS file is corrupted — two record paths are merged onto a single line. The actual signal files are intact and unaffected.
---
## 3. File Format
### 3.1 WFDB Format (.dat + .hea)
Each ECG record consists of a pair of files:
| File | Description |
|------|-------------|
| `.hea` | Plain-text header: sampling rate, number of leads, units, gain |
| `.dat` | Binary raw signal: 16-bit integers, decoded using the `.hea` |
**Sample header (00001_lr.hea):**
```
00001_lr 12 100 1000
00001_lr.dat 16 1000.0(0)/mV 16 0 -119 1508 0 I
00001_lr.dat 16 1000.0(0)/mV 16 0 -55 723 0 II
...
00001_lr.dat 16 1000.0(0)/mV 16 0 -79 832 0 V6
```
| Field | Meaning |
|-------|---------|
| `12` | 12 leads |
| `100` | Sampling rate (Hz) |
| `1000` | Samples per lead (10 s × 100 Hz) |
| `1000.0(0)/mV` | Gain = 1000 ADU/mV, baseline = 0 |
| `16` | 16-bit integer format |
### 3.2 Signal Dimensions
| Version | Shape | Description |
|---------|-------|-------------|
| LR (100 Hz) | `(1000, 12)` | 1,000 time steps × 12 leads |
| HR (500 Hz) | `(5000, 12)` | 5,000 time steps × 12 leads |
Values are returned in **mV** after decoding.
### 3.3 12-Lead ECG Sample

*Figure: ecg_id=1, 500 Hz version, 10-second recording. Each lead reflects a different electrical angle of the heart.*
---
## 4. Metadata Field Reference
`ptbxl_database.csv` contains one row per ECG record with **28 columns**:
### Basic Information
| Field | Type | Description | Missing |
|-------|------|-------------|---------|
| `ecg_id` | int | Unique identifier (1–21,837) | 0% |
| `patient_id` | float | Patient ID (one patient may have multiple ECGs) | 0% |
| `recording_date` | datetime | Recording timestamp (1984–1996) | 0% |
| `filename_lr` | str | Path to 100 Hz file (no extension) | 0% |
| `filename_hr` | str | Path to 500 Hz file (no extension) | 0% |
### Patient Demographics
| Field | Type | Description | Missing |
|-------|------|-------------|---------|
| `age` | float | Age in years | 0% |
| `sex` | int | Sex (0=Female, 1=Male) | 0% |
| `height` | float | Height (cm) | **68.0%** |
| `weight` | float | Weight (kg) | **56.8%** |
### Clinical Labels
| Field | Type | Description | Missing |
|-------|------|-------------|---------|
| `scp_codes` | dict | SCP diagnosis codes + confidence (0–100) | 0% |
| `heart_axis` | str | Cardiac axis direction | 38.8% |
| `infarction_stadium1` | str | Infarction stage (primary) | ~80% |
| `infarction_stadium2` | str | Infarction stage (secondary) | ~95% |
| `report` | str | Original clinical report (German) | 0% |
### Signal Quality Flags
| Field | Description |
|-------|-------------|
| `baseline_drift` | Baseline drift (affected leads) |
| `static_noise` | Static noise |
| `burst_noise` | Burst noise |
| `electrodes_problems` | Electrode issues |
| `extra_beats` | Extra beats |
| `pacemaker` | Pacemaker presence |
### Validation
| Field | Description |
|-------|-------------|
| `validated_by_human` | Verified by a cardiologist |
| `second_opinion` | Reviewed by a second physician |
| `strat_fold` | Stratified cross-validation fold (1–10) |
---
## 5. Label System (SCP Codes)
### 5.1 Multi-label Structure
The `scp_codes` field is a Python dict with `{code: confidence}`:
```python
{"NORM": 100.0} # Normal ECG
{"SR": 0.0, "AFIB": 100.0, "LVH": 80.0} # Multi-label
```
Confidence `0.0` = present but excluded; `100.0` = confirmed.
### 5.2 Five Diagnostic Superclasses
| Superclass | Full Name | Description |
|------------|-----------|-------------|
| **NORM** | Normal ECG | Normal electrocardiogram |
| **MI** | Myocardial Infarction | 14 sub-types |
| **STTC** | ST/T-Change | ST-segment and T-wave abnormalities (13 sub-types) |
| **CD** | Conduction Disturbance | Conduction disorders (11 sub-types) |
| **HYP** | Hypertrophy | Cardiac hypertrophy (5 sub-types) |
### 5.3 SCP Code Distribution (Top 20)

*Figure: Top 20 SCP diagnostic codes by record count. Colors indicate superclass.*
---
## 6. Statistical Analysis
### 6.1 Demographics

*Left: Sex distribution — Female 52.1% vs Male 47.9%. Right: Diagnostic superclass distribution; NORM is the most common, followed by MI.*
### 6.2 Age Distribution

*Age is concentrated between 50–80 years (mean 62.8, median 62), reflecting a cardiology clinic population.*
> **Note:** A small number of records have implausible age values (e.g., age=300). Clip to a valid range (e.g., 0–110) before use.
### 6.3 Stratified Fold Distribution

*10 folds are well-balanced (~2,174–2,198 records each). Stratification ensures no patient appears in more than one fold.*
---
## 7. Loading the Data
### 7.1 Requirements
```bash
pip install wfdb pandas numpy
```
### 7.2 Load a Single Record
```python
import wfdb
import numpy as np
path = "//TRUENAS/wtmh_dataset/ptb-xl/"
# Load 100 Hz version — shape: (1000, 12)
signal, meta = wfdb.rdsamp(path + "records100/00000/00001_lr")
print(signal.shape) # (1000, 12)
print(meta['sig_name']) # ['I', 'II', 'III', 'aVR', 'aVL', 'aVF', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6']
print(meta['fs']) # 100
```
### 7.3 Batch Load the Entire Dataset
```python
import pandas as pd
import numpy as np
import wfdb
import ast
path = "//TRUENAS/wtmh_dataset/ptb-xl/"
sampling_rate = 100 # or 500
# Load metadata
Y = pd.read_csv(path + "ptbxl_database.csv", index_col='ecg_id')
Y.scp_codes = Y.scp_codes.apply(ast.literal_eval)
# Load raw signals
def load_raw_data(df, sampling_rate, path):
col = 'filename_lr' if sampling_rate == 100 else 'filename_hr'
data = [wfdb.rdsamp(path + f) for f in df[col]]
return np.array([signal for signal, _ in data])
X = load_raw_data(Y, sampling_rate, path)
# X.shape = (21799, 1000, 12) for 100 Hz
# X.shape = (21799, 5000, 12) for 500 Hz
```
### 7.4 Generate Diagnostic Superclass Labels
```python
agg_df = pd.read_csv(path + "scp_statements.csv", index_col=0)
agg_df = agg_df[agg_df.diagnostic == 1]
def aggregate_diagnostic(y_dic):
return list({agg_df.loc[k].diagnostic_class
for k in y_dic if k in agg_df.index})
Y['superclass'] = Y.scp_codes.apply(aggregate_diagnostic)
# Each row's superclass is a list, e.g. ['NORM'] or ['MI', 'CD']
```
### 7.5 Train/Test Split
```python
test_fold = 10
X_train = X[Y.strat_fold != test_fold]
y_train = Y[Y.strat_fold != test_fold].superclass
X_test = X[Y.strat_fold == test_fold]
y_test = Y[Y.strat_fold == test_fold].superclass
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# Train: 19601, Test: 2198
```
---
## 8. Train/Test Split
PTB-XL provides a pre-designed **stratified 10-fold cross-validation** split:
| Strategy | Configuration |
|----------|--------------|
| **Standard evaluation** (recommended) | Fold 10 = Test, Folds 1–9 = Train |
| **Cross-validation** | Rotate each fold as the test set |
| **Quick prototyping** | Fold 9 = Validation, Fold 10 = Test, rest = Train |
> **Important:** Stratification ensures **all ECGs from the same patient are in the same fold**, preventing data leakage.
---
## 9. Known Issues
| Issue | Severity | Description |
|-------|----------|-------------|
| RECORDS file corruption at line 21,799 | Low | Two record paths merged on a single line; actual files are intact |
| `height` missing rate 68% | Medium | Not suitable as a training feature |
| `weight` missing rate 56.8% | Medium | Incomplete weight information |
| `heart_axis` missing rate 38.8% | Medium | Handle missing values if used |
| Implausible age values (max=300) | Low | Clip to [0, 110] before use |
---
## 10. Citation
```bibtex
@article{wagner2020ptb,
title = {PTB-XL, a large publicly available electrocardiography dataset},
author = {Wagner, Patrick and Strodthoff, Nils and Bousseljot, Ralf-Dieter
and Kreiseler, Dieter and Lunze, Fatima I. and Samek, Wojciech
and Schaeffter, Tobias},
journal = {Scientific Data},
volume = {7},
number = {1},
pages = {154},
year = {2020},
publisher = {Nature Publishing Group},
doi = {10.1038/s41597-020-0495-6}
}
```
**PhysioNet:**
> Goldberger AL, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. *Circulation* 101(23):e215–e220, 2000.
---
*Last updated: 2026-03-19*