# Table-GPT: Table-tuned GPT for Diverse Table Tasks
## Summary
* [Introduction](#introduction)
* [Dataset Structure](#dataset_structure)
* [Reference](#reference)
* [License](#license)
* [Citation](#citation)
## Introduction
This repository contains training and test datasets for the SIGMOD'24 paper [Table-GPT: Table-tuned GPT for Diverse Table Tasks](https://arxiv.org/abs/2310.09263). The source code for data generation and task evaluation are available here: [Table-GPT](https://github.com/microsoft/Table-GPT), which can be used to generate more training data for table-related tasks.
**Task Descriptions**
We collect (or synthesize) 18 diverse table-related tasks, which are summarized in the table below. There are 14 training tasks (T5 - T18) and 9 test tasks (T1 - T9). Some of these tasks (T-1 to T-4) are used as unseen hold-out tasks, to evaluate Table-GPT ability to generalize to completely new and unseen tasks. Some of these tasks (T-10 to T-18) are used for training only.
|Task Name| Task Description| Task Category| Train/Test|
|---|---|---|---|
|T-1: Missing-Value Identification (MV)| Identify the row and column position of the only missing cell in a given table| Table understanding |Test only|
|T-2: Column Finding (CF) |Identify the column-name of a specific value that appears only once in a given table| Table Understanding |Test only|
|T-3: |Table-QA (TQA) |Answer a natural-language question based on the content of a table |Table QA |Test only|
|T-4: Column Type Annotation (CTA) |Find the semantic type of a column from a given list of choices| Table understanding| Test only|
|T-5: Row-to-row Transformormation (R2R)| Transform table data based on input/output examples |Data transformation |Train/Test|
|T-6: Entity Matching (EM)| Match rows from two tables that refer to the same real-world entity |Table matching| Train/Test|
|T-7: Schema Matching (SM) |Match columns from two tables that refer to the same meaning |Table matching |Train/Test|
|T-8: Data Imputation (DI) |Predict the missing values in a cell based on the table context |Data cleaning |Train/Test
|T-9: Error Detection (ED)| Detect data values in a table that is a likely error from misspelling |Data cleaning |Train/Test|
|T-10: List Extraction (LE) |Extract a structured table from a list that lacks explicit column delimiters |Data transformation |Train only|
|T-11: Header Value Matching (HVM)| Match column-headers with its data values drawn from the same table |Table matching |Train only|
|T-12: Natural-Language to SQL (NS) |Translate a natural-language question on a table into a SQL query| NL-to-SQL |Train only|
|T-13: Table Summarization (TS) |Produce a natural-language summary for the content in a table |Data augmentation |Train only|
|T-14: Column Augmentation (CA)| Augment a table with additional columns compatible with a given table |Data augmentation |Train only|
|T-15: Row Augmentation (RA) |Augment a table with additional rows compatible with a given table |Data augmentation |Train only|
|T-16: Row/Column Swapping (RCSW)| Manipulate a given table by swapping the position of two rows or columns |Table manipulation |Train only|
|T-17: Row/Column Filtering (RCF)| Manipulate a given table by filtering on given rows or columns |Table manipulation |Train only|
|T-18: Row/Column Sorting (RCS)| Manipulate a given table by performing sorting on given rows or columns |Table manipulation |Train only|
## Dataset Structure
### Data Instances
The structure of this repository is shown as follows.
```
Table-GPT
├── train
│ ├── train_All.jsonl # the merged training data of all training tasks
│ ├── train_{task_name}.jsonl # the training data for a specific training task
│ └── ...
│
├── test
│ ├── test_All.jsonl # the merged test data of all test tasks
│ ├── test_{task_name}.jsonl # the test data for a specific test task
│ └── ...
│
└── train_large
├── train_large_All.jsonl # a larger training set with additional data
├── train_large_{task_name}.jsonl # the additional training data for a specific training task
└── ...
```
### Data Fields
Each line in the .jsonl file represents a single example, containing the following key items:
* `task`: The name of the task associated with the example.
* `dataset`: The name of the dataset from which the example originates.
* `prompt`: The input prompt provided to the model for generating a response.
* `completion`: The generated output response corresponding to the given prompt.
* `messages`: A list of messages that combine the prompt and completion, typically used in chat-oriented models.
* `metadata`: A dict for other information about the example.
## Reference
We would like to acknowledge Peng Li et al. for creating and maintaining the Table-GPT dataset as a valuable resource for the computer vision and machine learning research community. For more information about the Table-GPT dataset and its creator, please visit [the Table-GPT website](https://huggingface.co/datasets/LipengCS/Table-GPT).
## License
The dataset has been released under the MIT License.
## Citation
```
{
title={Table-GPT: Table-tuned GPT for Diverse Table Tasks},
author={Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, Surajit Chaudhuri},
journal={arXiv preprint arXiv:2310.09263},
year={2023}
}
```