PTB-XL ECG Dataset — Data Summary & Usage Guide

# PTB-XL ECG Dataset — Data Summary & Usage Guide > **Path:** `\\TRUENAS\wtmh_dataset\ptb-xl` > **Integrity Status:** > All 21,799 records (87,196 files) present > no missing > zero-byte files > **Last Verified:** 2026-03-19 --- ## Table of Contents 1. [Dataset Overview](#1-dataset-overview) 2. [Directory Structure](#2-directory-structure) 3. [File Format](#3-file-format) 4. [Metadata Field Reference](#4-metadata-field-reference) 5. [Label System (SCP Codes)](#5-label-system-scp-codes) 6. [Statistical Analysis](#6-statistical-analysis) 7. [Loading the Data](#7-loading-the-data) 8. [Train/Test Split](#8-traintest-split) 9. [Known Issues](#9-known-issues) 10. [Citation](#10-citation) --- ## 1. Dataset Overview PTB-XL is the largest publicly available 12-lead clinical ECG dataset, collected at the Physikalisch-Technische Bundesanstalt (PTB) in Berlin, Germany between 1989 and 1996. | Item | Details | |------|---------| | Total Records | **21,799** 10-second ECGs | | Unique Patients | **18,869** | | Leads | **12** (I, II, III, aVR, aVL, aVF, V1–V6) | | Sampling Rate | **100 Hz** (LR) and **500 Hz** (HR) | | Collection Period | 1989–1996 | | Label Types | 71 SCP-ECG codes (multi-label with confidence scores) | | Diagnostic Superclasses | 5 (NORM, MI, STTC, CD, HYP) | | File Format | WFDB (PhysioNet standard: `.dat` + `.hea`) | | Source | [PhysioNet PTB-XL v1.0.3](https://physionet.org/content/ptb-xl/1.0.3/) | --- ## 2. Directory Structure ![fig6_structure](https://hackmd.io/_uploads/SknpVdK9Zx.png) ``` ptb-xl/ ├── ptbxl_database.csv # Metadata and labels (21,799 rows) ├── scp_statements.csv # SCP code definitions and categories ├── RECORDS # WFDB record index (*) ├── SHA256SUMS.txt # File integrity checksums ├── example_physionet.py # Official loading example ├── records100/ # 100 Hz (low-resolution) │ ├── 00000/ # ~1,000 records per subfolder │ │ ├── 00001_lr.dat # Binary signal data │ │ ├── 00001_lr.hea # Header (sampling rate, lead names, etc.) │ │ └── ... │ └── 21000/ └── records500/ # 500 Hz (high-resolution) ├── 00000/ │ ├── 00001_hr.dat │ ├── 00001_hr.hea │ └── ... └── 21000/ ``` > **(*) Note:** Line 21,799 of the RECORDS file is corrupted — two record paths are merged onto a single line. The actual signal files are intact and unaffected. --- ## 3. File Format ### 3.1 WFDB Format (.dat + .hea) Each ECG record consists of a pair of files: | File | Description | |------|-------------| | `.hea` | Plain-text header: sampling rate, number of leads, units, gain | | `.dat` | Binary raw signal: 16-bit integers, decoded using the `.hea` | **Sample header (00001_lr.hea):** ``` 00001_lr 12 100 1000 00001_lr.dat 16 1000.0(0)/mV 16 0 -119 1508 0 I 00001_lr.dat 16 1000.0(0)/mV 16 0 -55 723 0 II ... 00001_lr.dat 16 1000.0(0)/mV 16 0 -79 832 0 V6 ``` | Field | Meaning | |-------|---------| | `12` | 12 leads | | `100` | Sampling rate (Hz) | | `1000` | Samples per lead (10 s × 100 Hz) | | `1000.0(0)/mV` | Gain = 1000 ADU/mV, baseline = 0 | | `16` | 16-bit integer format | ### 3.2 Signal Dimensions | Version | Shape | Description | |---------|-------|-------------| | LR (100 Hz) | `(1000, 12)` | 1,000 time steps × 12 leads | | HR (500 Hz) | `(5000, 12)` | 5,000 time steps × 12 leads | Values are returned in **mV** after decoding. ### 3.3 12-Lead ECG Sample ![fig1_ecg_sample](https://hackmd.io/_uploads/ByU1ruK5Ze.png) *Figure: ecg_id=1, 500 Hz version, 10-second recording. Each lead reflects a different electrical angle of the heart.* --- ## 4. Metadata Field Reference `ptbxl_database.csv` contains one row per ECG record with **28 columns**: ### Basic Information | Field | Type | Description | Missing | |-------|------|-------------|---------| | `ecg_id` | int | Unique identifier (1–21,837) | 0% | | `patient_id` | float | Patient ID (one patient may have multiple ECGs) | 0% | | `recording_date` | datetime | Recording timestamp (1984–1996) | 0% | | `filename_lr` | str | Path to 100 Hz file (no extension) | 0% | | `filename_hr` | str | Path to 500 Hz file (no extension) | 0% | ### Patient Demographics | Field | Type | Description | Missing | |-------|------|-------------|---------| | `age` | float | Age in years | 0% | | `sex` | int | Sex (0=Female, 1=Male) | 0% | | `height` | float | Height (cm) | **68.0%** | | `weight` | float | Weight (kg) | **56.8%** | ### Clinical Labels | Field | Type | Description | Missing | |-------|------|-------------|---------| | `scp_codes` | dict | SCP diagnosis codes + confidence (0–100) | 0% | | `heart_axis` | str | Cardiac axis direction | 38.8% | | `infarction_stadium1` | str | Infarction stage (primary) | ~80% | | `infarction_stadium2` | str | Infarction stage (secondary) | ~95% | | `report` | str | Original clinical report (German) | 0% | ### Signal Quality Flags | Field | Description | |-------|-------------| | `baseline_drift` | Baseline drift (affected leads) | | `static_noise` | Static noise | | `burst_noise` | Burst noise | | `electrodes_problems` | Electrode issues | | `extra_beats` | Extra beats | | `pacemaker` | Pacemaker presence | ### Validation | Field | Description | |-------|-------------| | `validated_by_human` | Verified by a cardiologist | | `second_opinion` | Reviewed by a second physician | | `strat_fold` | Stratified cross-validation fold (1–10) | --- ## 5. Label System (SCP Codes) ### 5.1 Multi-label Structure The `scp_codes` field is a Python dict with `{code: confidence}`: ```python {"NORM": 100.0} # Normal ECG {"SR": 0.0, "AFIB": 100.0, "LVH": 80.0} # Multi-label ``` Confidence `0.0` = present but excluded; `100.0` = confirmed. ### 5.2 Five Diagnostic Superclasses | Superclass | Full Name | Description | |------------|-----------|-------------| | **NORM** | Normal ECG | Normal electrocardiogram | | **MI** | Myocardial Infarction | 14 sub-types | | **STTC** | ST/T-Change | ST-segment and T-wave abnormalities (13 sub-types) | | **CD** | Conduction Disturbance | Conduction disorders (11 sub-types) | | **HYP** | Hypertrophy | Cardiac hypertrophy (5 sub-types) | ### 5.3 SCP Code Distribution (Top 20) ![fig4_top_scp_codes](https://hackmd.io/_uploads/By_lB_Yq-g.png) *Figure: Top 20 SCP diagnostic codes by record count. Colors indicate superclass.* --- ## 6. Statistical Analysis ### 6.1 Demographics ![fig3_demographics](https://hackmd.io/_uploads/BJXfHdYqWe.png) *Left: Sex distribution — Female 52.1% vs Male 47.9%. Right: Diagnostic superclass distribution; NORM is the most common, followed by MI.* ### 6.2 Age Distribution ![fig2_age_distribution](https://hackmd.io/_uploads/SkzmS_YcZe.png) *Age is concentrated between 50–80 years (mean 62.8, median 62), reflecting a cardiology clinic population.* > **Note:** A small number of records have implausible age values (e.g., age=300). Clip to a valid range (e.g., 0–110) before use. ### 6.3 Stratified Fold Distribution ![fig5_strat_fold](https://hackmd.io/_uploads/rJdXrOY9-l.png) *10 folds are well-balanced (~2,174–2,198 records each). Stratification ensures no patient appears in more than one fold.* --- ## 7. Loading the Data ### 7.1 Requirements ```bash pip install wfdb pandas numpy ``` ### 7.2 Load a Single Record ```python import wfdb import numpy as np path = "//TRUENAS/wtmh_dataset/ptb-xl/" # Load 100 Hz version — shape: (1000, 12) signal, meta = wfdb.rdsamp(path + "records100/00000/00001_lr") print(signal.shape) # (1000, 12) print(meta['sig_name']) # ['I', 'II', 'III', 'aVR', 'aVL', 'aVF', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6'] print(meta['fs']) # 100 ``` ### 7.3 Batch Load the Entire Dataset ```python import pandas as pd import numpy as np import wfdb import ast path = "//TRUENAS/wtmh_dataset/ptb-xl/" sampling_rate = 100 # or 500 # Load metadata Y = pd.read_csv(path + "ptbxl_database.csv", index_col='ecg_id') Y.scp_codes = Y.scp_codes.apply(ast.literal_eval) # Load raw signals def load_raw_data(df, sampling_rate, path): col = 'filename_lr' if sampling_rate == 100 else 'filename_hr' data = [wfdb.rdsamp(path + f) for f in df[col]] return np.array([signal for signal, _ in data]) X = load_raw_data(Y, sampling_rate, path) # X.shape = (21799, 1000, 12) for 100 Hz # X.shape = (21799, 5000, 12) for 500 Hz ``` ### 7.4 Generate Diagnostic Superclass Labels ```python agg_df = pd.read_csv(path + "scp_statements.csv", index_col=0) agg_df = agg_df[agg_df.diagnostic == 1] def aggregate_diagnostic(y_dic): return list({agg_df.loc[k].diagnostic_class for k in y_dic if k in agg_df.index}) Y['superclass'] = Y.scp_codes.apply(aggregate_diagnostic) # Each row's superclass is a list, e.g. ['NORM'] or ['MI', 'CD'] ``` ### 7.5 Train/Test Split ```python test_fold = 10 X_train = X[Y.strat_fold != test_fold] y_train = Y[Y.strat_fold != test_fold].superclass X_test = X[Y.strat_fold == test_fold] y_test = Y[Y.strat_fold == test_fold].superclass print(f"Train: {len(X_train)}, Test: {len(X_test)}") # Train: 19601, Test: 2198 ``` --- ## 8. Train/Test Split PTB-XL provides a pre-designed **stratified 10-fold cross-validation** split: | Strategy | Configuration | |----------|--------------| | **Standard evaluation** (recommended) | Fold 10 = Test, Folds 1–9 = Train | | **Cross-validation** | Rotate each fold as the test set | | **Quick prototyping** | Fold 9 = Validation, Fold 10 = Test, rest = Train | > **Important:** Stratification ensures **all ECGs from the same patient are in the same fold**, preventing data leakage. --- ## 9. Known Issues | Issue | Severity | Description | |-------|----------|-------------| | RECORDS file corruption at line 21,799 | Low | Two record paths merged on a single line; actual files are intact | | `height` missing rate 68% | Medium | Not suitable as a training feature | | `weight` missing rate 56.8% | Medium | Incomplete weight information | | `heart_axis` missing rate 38.8% | Medium | Handle missing values if used | | Implausible age values (max=300) | Low | Clip to [0, 110] before use | --- ## 10. Citation ```bibtex @article{wagner2020ptb, title = {PTB-XL, a large publicly available electrocardiography dataset}, author = {Wagner, Patrick and Strodthoff, Nils and Bousseljot, Ralf-Dieter and Kreiseler, Dieter and Lunze, Fatima I. and Samek, Wojciech and Schaeffter, Tobias}, journal = {Scientific Data}, volume = {7}, number = {1}, pages = {154}, year = {2020}, publisher = {Nature Publishing Group}, doi = {10.1038/s41597-020-0495-6} } ``` **PhysioNet:** > Goldberger AL, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. *Circulation* 101(23):e215–e220, 2000. --- *Last updated: 2026-03-19*