Table-GPT: Table-tuned GPT for Diverse Table Tasks

# Table-GPT: Table-tuned GPT for Diverse Table Tasks ## Summary * [Introduction](#introduction) * [Dataset Structure](#dataset_structure) * [Reference](#reference) * [License](#license) * [Citation](#citation) ## Introduction This repository contains training and test datasets for the SIGMOD'24 paper [Table-GPT: Table-tuned GPT for Diverse Table Tasks](https://arxiv.org/abs/2310.09263). The source code for data generation and task evaluation are available here: [Table-GPT](https://github.com/microsoft/Table-GPT), which can be used to generate more training data for table-related tasks. **Task Descriptions** We collect (or synthesize) 18 diverse table-related tasks, which are summarized in the table below. There are 14 training tasks (T5 - T18) and 9 test tasks (T1 - T9). Some of these tasks (T-1 to T-4) are used as unseen hold-out tasks, to evaluate Table-GPT ability to generalize to completely new and unseen tasks. Some of these tasks (T-10 to T-18) are used for training only. |Task Name| Task Description| Task Category| Train/Test| |---|---|---|---| |T-1: Missing-Value Identification (MV)| Identify the row and column position of the only missing cell in a given table| Table understanding |Test only| |T-2: Column Finding (CF) |Identify the column-name of a specific value that appears only once in a given table| Table Understanding |Test only| |T-3: |Table-QA (TQA) |Answer a natural-language question based on the content of a table |Table QA |Test only| |T-4: Column Type Annotation (CTA) |Find the semantic type of a column from a given list of choices| Table understanding| Test only| |T-5: Row-to-row Transformormation (R2R)| Transform table data based on input/output examples |Data transformation |Train/Test| |T-6: Entity Matching (EM)| Match rows from two tables that refer to the same real-world entity |Table matching| Train/Test| |T-7: Schema Matching (SM) |Match columns from two tables that refer to the same meaning |Table matching |Train/Test| |T-8: Data Imputation (DI) |Predict the missing values in a cell based on the table context |Data cleaning |Train/Test |T-9: Error Detection (ED)| Detect data values in a table that is a likely error from misspelling |Data cleaning |Train/Test| |T-10: List Extraction (LE) |Extract a structured table from a list that lacks explicit column delimiters |Data transformation |Train only| |T-11: Header Value Matching (HVM)| Match column-headers with its data values drawn from the same table |Table matching |Train only| |T-12: Natural-Language to SQL (NS) |Translate a natural-language question on a table into a SQL query| NL-to-SQL |Train only| |T-13: Table Summarization (TS) |Produce a natural-language summary for the content in a table |Data augmentation |Train only| |T-14: Column Augmentation (CA)| Augment a table with additional columns compatible with a given table |Data augmentation |Train only| |T-15: Row Augmentation (RA) |Augment a table with additional rows compatible with a given table |Data augmentation |Train only| |T-16: Row/Column Swapping (RCSW)| Manipulate a given table by swapping the position of two rows or columns |Table manipulation |Train only| |T-17: Row/Column Filtering (RCF)| Manipulate a given table by filtering on given rows or columns |Table manipulation |Train only| |T-18: Row/Column Sorting (RCS)| Manipulate a given table by performing sorting on given rows or columns |Table manipulation |Train only| ## Dataset Structure ### Data Instances The structure of this repository is shown as follows. ``` Table-GPT ├── train │ ├── train_All.jsonl # the merged training data of all training tasks │ ├── train_{task_name}.jsonl # the training data for a specific training task │ └── ... │ ├── test │ ├── test_All.jsonl # the merged test data of all test tasks │ ├── test_{task_name}.jsonl # the test data for a specific test task │ └── ... │ └── train_large ├── train_large_All.jsonl # a larger training set with additional data ├── train_large_{task_name}.jsonl # the additional training data for a specific training task └── ... ``` ### Data Fields Each line in the .jsonl file represents a single example, containing the following key items: * `task`: The name of the task associated with the example. * `dataset`: The name of the dataset from which the example originates. * `prompt`: The input prompt provided to the model for generating a response. * `completion`: The generated output response corresponding to the given prompt. * `messages`: A list of messages that combine the prompt and completion, typically used in chat-oriented models. * `metadata`: A dict for other information about the example. ## Reference We would like to acknowledge Peng Li et al. for creating and maintaining the Table-GPT dataset as a valuable resource for the computer vision and machine learning research community. For more information about the Table-GPT dataset and its creator, please visit [the Table-GPT website](https://huggingface.co/datasets/LipengCS/Table-GPT). ## License The dataset has been released under the MIT License. ## Citation ``` { title={Table-GPT: Table-tuned GPT for Diverse Table Tasks}, author={Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, Surajit Chaudhuri}, journal={arXiv preprint arXiv:2310.09263}, year={2023} } ```