---
# System prepended metadata

title: PTB-XL ECG Dataset — Data Summary & Usage Guide

---

# PTB-XL ECG Dataset — Data Summary & Usage Guide

> **Path:** `\\TRUENAS\wtmh_dataset\ptb-xl`
> **Integrity Status:**
> All 21,799 records (87,196 files) present
> no missing 
> zero-byte files
> **Last Verified:** 2026-03-19

---

## Table of Contents

1. [Dataset Overview](#1-dataset-overview)
2. [Directory Structure](#2-directory-structure)
3. [File Format](#3-file-format)
4. [Metadata Field Reference](#4-metadata-field-reference)
5. [Label System (SCP Codes)](#5-label-system-scp-codes)
6. [Statistical Analysis](#6-statistical-analysis)
7. [Loading the Data](#7-loading-the-data)
8. [Train/Test Split](#8-traintest-split)
9. [Known Issues](#9-known-issues)
10. [Citation](#10-citation)

---

## 1. Dataset Overview

PTB-XL is the largest publicly available 12-lead clinical ECG dataset, collected at the Physikalisch-Technische Bundesanstalt (PTB) in Berlin, Germany between 1989 and 1996.

| Item | Details |
|------|---------|
| Total Records | **21,799** 10-second ECGs |
| Unique Patients | **18,869** |
| Leads | **12** (I, II, III, aVR, aVL, aVF, V1–V6) |
| Sampling Rate | **100 Hz** (LR) and **500 Hz** (HR) |
| Collection Period | 1989–1996 |
| Label Types | 71 SCP-ECG codes (multi-label with confidence scores) |
| Diagnostic Superclasses | 5 (NORM, MI, STTC, CD, HYP) |
| File Format | WFDB (PhysioNet standard: `.dat` + `.hea`) |
| Source | [PhysioNet PTB-XL v1.0.3](https://physionet.org/content/ptb-xl/1.0.3/) |

---

## 2. Directory Structure
![fig6_structure](https://hackmd.io/_uploads/SknpVdK9Zx.png)


```
ptb-xl/
├── ptbxl_database.csv        # Metadata and labels (21,799 rows)
├── scp_statements.csv        # SCP code definitions and categories
├── RECORDS                   # WFDB record index (*)
├── SHA256SUMS.txt            # File integrity checksums
├── example_physionet.py      # Official loading example
├── records100/               # 100 Hz (low-resolution)
│   ├── 00000/                # ~1,000 records per subfolder
│   │   ├── 00001_lr.dat      # Binary signal data
│   │   ├── 00001_lr.hea      # Header (sampling rate, lead names, etc.)
│   │   └── ...
│   └── 21000/
└── records500/               # 500 Hz (high-resolution)
    ├── 00000/
    │   ├── 00001_hr.dat
    │   ├── 00001_hr.hea
    │   └── ...
    └── 21000/
```

> **(*) Note:** Line 21,799 of the RECORDS file is corrupted — two record paths are merged onto a single line. The actual signal files are intact and unaffected.

---

## 3. File Format

### 3.1 WFDB Format (.dat + .hea)

Each ECG record consists of a pair of files:

| File | Description |
|------|-------------|
| `.hea` | Plain-text header: sampling rate, number of leads, units, gain |
| `.dat` | Binary raw signal: 16-bit integers, decoded using the `.hea` |

**Sample header (00001_lr.hea):**

```
00001_lr 12 100 1000
00001_lr.dat 16 1000.0(0)/mV 16 0 -119 1508 0 I
00001_lr.dat 16 1000.0(0)/mV 16 0 -55  723  0 II
...
00001_lr.dat 16 1000.0(0)/mV 16 0 -79  832  0 V6
```

| Field | Meaning |
|-------|---------|
| `12` | 12 leads |
| `100` | Sampling rate (Hz) |
| `1000` | Samples per lead (10 s × 100 Hz) |
| `1000.0(0)/mV` | Gain = 1000 ADU/mV, baseline = 0 |
| `16` | 16-bit integer format |

### 3.2 Signal Dimensions

| Version | Shape | Description |
|---------|-------|-------------|
| LR (100 Hz) | `(1000, 12)` | 1,000 time steps × 12 leads |
| HR (500 Hz) | `(5000, 12)` | 5,000 time steps × 12 leads |

Values are returned in **mV** after decoding.

### 3.3 12-Lead ECG Sample

![fig1_ecg_sample](https://hackmd.io/_uploads/ByU1ruK5Ze.png)

*Figure: ecg_id=1, 500 Hz version, 10-second recording. Each lead reflects a different electrical angle of the heart.*

---

## 4. Metadata Field Reference

`ptbxl_database.csv` contains one row per ECG record with **28 columns**:

### Basic Information

| Field | Type | Description | Missing |
|-------|------|-------------|---------|
| `ecg_id` | int | Unique identifier (1–21,837) | 0% |
| `patient_id` | float | Patient ID (one patient may have multiple ECGs) | 0% |
| `recording_date` | datetime | Recording timestamp (1984–1996) | 0% |
| `filename_lr` | str | Path to 100 Hz file (no extension) | 0% |
| `filename_hr` | str | Path to 500 Hz file (no extension) | 0% |

### Patient Demographics

| Field | Type | Description | Missing |
|-------|------|-------------|---------|
| `age` | float | Age in years | 0% |
| `sex` | int | Sex (0=Female, 1=Male) | 0% |
| `height` | float | Height (cm) | **68.0%** |
| `weight` | float | Weight (kg) | **56.8%** |

### Clinical Labels

| Field | Type | Description | Missing |
|-------|------|-------------|---------|
| `scp_codes` | dict | SCP diagnosis codes + confidence (0–100) | 0% |
| `heart_axis` | str | Cardiac axis direction | 38.8% |
| `infarction_stadium1` | str | Infarction stage (primary) | ~80% |
| `infarction_stadium2` | str | Infarction stage (secondary) | ~95% |
| `report` | str | Original clinical report (German) | 0% |

### Signal Quality Flags

| Field | Description |
|-------|-------------|
| `baseline_drift` | Baseline drift (affected leads) |
| `static_noise` | Static noise |
| `burst_noise` | Burst noise |
| `electrodes_problems` | Electrode issues |
| `extra_beats` | Extra beats |
| `pacemaker` | Pacemaker presence |

### Validation

| Field | Description |
|-------|-------------|
| `validated_by_human` | Verified by a cardiologist |
| `second_opinion` | Reviewed by a second physician |
| `strat_fold` | Stratified cross-validation fold (1–10) |

---

## 5. Label System (SCP Codes)

### 5.1 Multi-label Structure

The `scp_codes` field is a Python dict with `{code: confidence}`:

```python
{"NORM": 100.0}                              # Normal ECG
{"SR": 0.0, "AFIB": 100.0, "LVH": 80.0}    # Multi-label
```

Confidence `0.0` = present but excluded; `100.0` = confirmed.

### 5.2 Five Diagnostic Superclasses

| Superclass | Full Name | Description |
|------------|-----------|-------------|
| **NORM** | Normal ECG | Normal electrocardiogram |
| **MI** | Myocardial Infarction | 14 sub-types |
| **STTC** | ST/T-Change | ST-segment and T-wave abnormalities (13 sub-types) |
| **CD** | Conduction Disturbance | Conduction disorders (11 sub-types) |
| **HYP** | Hypertrophy | Cardiac hypertrophy (5 sub-types) |

### 5.3 SCP Code Distribution (Top 20)

![fig4_top_scp_codes](https://hackmd.io/_uploads/By_lB_Yq-g.png)

*Figure: Top 20 SCP diagnostic codes by record count. Colors indicate superclass.*

---

## 6. Statistical Analysis

### 6.1 Demographics

![fig3_demographics](https://hackmd.io/_uploads/BJXfHdYqWe.png)

*Left: Sex distribution — Female 52.1% vs Male 47.9%. Right: Diagnostic superclass distribution; NORM is the most common, followed by MI.*

### 6.2 Age Distribution

![fig2_age_distribution](https://hackmd.io/_uploads/SkzmS_YcZe.png)

*Age is concentrated between 50–80 years (mean 62.8, median 62), reflecting a cardiology clinic population.*

> **Note:** A small number of records have implausible age values (e.g., age=300). Clip to a valid range (e.g., 0–110) before use.

### 6.3 Stratified Fold Distribution

![fig5_strat_fold](https://hackmd.io/_uploads/rJdXrOY9-l.png)

*10 folds are well-balanced (~2,174–2,198 records each). Stratification ensures no patient appears in more than one fold.*

---

## 7. Loading the Data

### 7.1 Requirements

```bash
pip install wfdb pandas numpy
```

### 7.2 Load a Single Record

```python
import wfdb
import numpy as np

path = "//TRUENAS/wtmh_dataset/ptb-xl/"

# Load 100 Hz version — shape: (1000, 12)
signal, meta = wfdb.rdsamp(path + "records100/00000/00001_lr")
print(signal.shape)       # (1000, 12)
print(meta['sig_name'])   # ['I', 'II', 'III', 'aVR', 'aVL', 'aVF', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6']
print(meta['fs'])         # 100
```

### 7.3 Batch Load the Entire Dataset

```python
import pandas as pd
import numpy as np
import wfdb
import ast

path = "//TRUENAS/wtmh_dataset/ptb-xl/"
sampling_rate = 100  # or 500

# Load metadata
Y = pd.read_csv(path + "ptbxl_database.csv", index_col='ecg_id')
Y.scp_codes = Y.scp_codes.apply(ast.literal_eval)

# Load raw signals
def load_raw_data(df, sampling_rate, path):
    col = 'filename_lr' if sampling_rate == 100 else 'filename_hr'
    data = [wfdb.rdsamp(path + f) for f in df[col]]
    return np.array([signal for signal, _ in data])

X = load_raw_data(Y, sampling_rate, path)
# X.shape = (21799, 1000, 12)  for 100 Hz
# X.shape = (21799, 5000, 12)  for 500 Hz
```

### 7.4 Generate Diagnostic Superclass Labels

```python
agg_df = pd.read_csv(path + "scp_statements.csv", index_col=0)
agg_df = agg_df[agg_df.diagnostic == 1]

def aggregate_diagnostic(y_dic):
    return list({agg_df.loc[k].diagnostic_class
                 for k in y_dic if k in agg_df.index})

Y['superclass'] = Y.scp_codes.apply(aggregate_diagnostic)
# Each row's superclass is a list, e.g. ['NORM'] or ['MI', 'CD']
```

### 7.5 Train/Test Split

```python
test_fold = 10

X_train = X[Y.strat_fold != test_fold]
y_train  = Y[Y.strat_fold != test_fold].superclass

X_test   = X[Y.strat_fold == test_fold]
y_test   = Y[Y.strat_fold == test_fold].superclass

print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# Train: 19601, Test: 2198
```

---

## 8. Train/Test Split

PTB-XL provides a pre-designed **stratified 10-fold cross-validation** split:

| Strategy | Configuration |
|----------|--------------|
| **Standard evaluation** (recommended) | Fold 10 = Test, Folds 1–9 = Train |
| **Cross-validation** | Rotate each fold as the test set |
| **Quick prototyping** | Fold 9 = Validation, Fold 10 = Test, rest = Train |

> **Important:** Stratification ensures **all ECGs from the same patient are in the same fold**, preventing data leakage.

---

## 9. Known Issues

| Issue | Severity | Description |
|-------|----------|-------------|
| RECORDS file corruption at line 21,799 | Low | Two record paths merged on a single line; actual files are intact |
| `height` missing rate 68% | Medium | Not suitable as a training feature |
| `weight` missing rate 56.8% | Medium | Incomplete weight information |
| `heart_axis` missing rate 38.8% | Medium | Handle missing values if used |
| Implausible age values (max=300) | Low | Clip to [0, 110] before use |

---

## 10. Citation

```bibtex
@article{wagner2020ptb,
  title     = {PTB-XL, a large publicly available electrocardiography dataset},
  author    = {Wagner, Patrick and Strodthoff, Nils and Bousseljot, Ralf-Dieter
               and Kreiseler, Dieter and Lunze, Fatima I. and Samek, Wojciech
               and Schaeffter, Tobias},
  journal   = {Scientific Data},
  volume    = {7},
  number    = {1},
  pages     = {154},
  year      = {2020},
  publisher = {Nature Publishing Group},
  doi       = {10.1038/s41597-020-0495-6}
}
```

**PhysioNet:**

> Goldberger AL, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. *Circulation* 101(23):e215–e220, 2000.

---

*Last updated: 2026-03-19*
