# Research Note - Shaowen Wang ## May 5 T5: x->y, **SET prefix at y** GPT2: Train: Table_representation, the feature is Add( F1, F2 )/ Sub (F3, F4) Test: Table_representation, the feature is Add ___ Every table -> 400 features -> (table, feature, improvement) 83*400, balance 5 level > 0.2 0.1-0.2 0-0.1 <0 T5: x: Table_representation, (good/medium/poor) feature is y: feature_representation x: Table_representation, good feature is ![](https://i.imgur.com/ks4ANF5.png) ![](https://i.imgur.com/pMgVm2N.png) ![](https://i.imgur.com/wGVh8iB.png) ![](https://i.imgur.com/KP7p4MO.png) ## April 27 - [ ] Better representation of table - [ ] Feature Text => Feature, test performance improvement (We can refer to same experiment setting on OpenFE) - [ ] Increase the feature num in the training set - [ ] Can we rank our feature by PPL? ### Scaling num_return_seq for better recall #### GPU loads Great news! When batch_size = 1, num_return_seq = 100, it only consumes < 20G memory in beam search. I can set num_return_seq = 400 if I use A100 80G. ## April 26 ### Good features are rare We count the features of each datasets in the ground truth, and we find that almost one third of datasets only have one feature. ![](https://i.imgur.com/qvfajDc.png) ### Efficient Inference Strategy for One-to-Many Neural Text Generation **Definition**: The one-to-many problem involves having n inputs(x), and generating n * num_return_seq outputs(y), with each input having num_return_seq outputs. **Batch**: The n inputs(x), are processed batch_size inputs at a time. **Output_chunk**: To generate num_return_seq outputs, y, for each input x, we might face GPU memory usage issues proportional to batch_size * num_return_seq. To resolve this, we generate output_chunk_size outputs, y, at a time, resulting in batch_size * output_chunk_size memory usage. We then combine all generated output chunks to produce a total of num_return_seq outputs, y. However, this approach raises another issue: we must wait for every x in a batch to generate num_return_seq unique y. This becomes more difficult with larger batch_size values. For example, when batch_size = 1, generating 100 y requires around 136 iterations. When batch_size = 4, it becomes 176. When batch_size = 16, it can be 5.76 in some cases. (setting top-p=0.9 and temperature=1.2 for easier generation of different features). ![](https://i.imgur.com/zu3NZja.png) Naturally, some x (datasets) tend to have more diverse feature representations, while others are more centered. To address this, consider the following tasks: - [ ] Set the maximum iteration times - [ ] Count the feature repetition score; a higher repetition time indicates a higher probability. Then choose the top repetied features Random thought about the second task: if we were to perform the experiment infinite times, would it be equivalent to performing a beam search with an infinite beam (using frequency to represent probability)? Update on later that day: If I set the temperature to 1.0 and the batch size to 4, sometimes it takes over 60 iterations to complete. ![](https://i.imgur.com/iw7K8oq.png) ## April 24 ### Sampling Methods Explore various sampling methods to determine the one that yields the best results. #### Deterministic - Greedy Search (Applicable only for one-to-many schemes) - Beam Search (Reduce the batch size to address memory concerns if generating too many sequences) - Diverse Beam Sampling: Set num_beam_groups > 1 - Contrastive Search (Applicable only for one-to-many schemes) #### Stochastic Note: According to the [source code](https://github.com/huggingface/transformers/blob/04ab5605fbb4ef207b10bf2772d88c53fc242e83/src/transformers/generation/utils.py#L771), when temperature, top-p, and top-k are set simultaneously, the applied sequence is temperature, top-p, top-k. - Top-k Sampling: Experiment with different values of k - Top-p Sampling: Experiment with different values of p - Beam Sampling: Experiment with different beam_num values; it is not necessary to set the beam_num equal to num_seq_return - Adjust temeprature for the methods above ## April 22 ### Model Training I found that 3 epochs are actually sufficient. After the 3rd epoch, the evaluation loss starts to increase ![](https://i.imgur.com/Nn0Te3Y.png) The training loss, on the other hand, exhibits some fluctuations but gradually decreases ![](https://i.imgur.com/XiNC3rb.png) The model performance: | Model | Dataset | Sample Method | Ground Truth 20 Precision | Ground Truth 50 Precision | Ground Truth 100 precision | |---------------------|-|---------------|--------------------------|--------------------------|---------------------------| | flan-t5-base w instr|infer test(top 20)|bean|0.0380|0.0723|0.1018 |flan-t5-base w instr|infer train(top 20)|bean|Out of Mem|-|- | flan-t5-base w/o instr | infer test(top 20) | beam | 0.0398 | 0.0687 | 0.1014 | | flan-t5-base w/o instr| infer train(top 20) | beam | 0.0958 | 0.1441 | 0.1852 | t5-base | infer test(top 20) | beam | 0.0325 | 0.0590 | 0.0937 | | t5-base| infer train(top 20) | beam | 0.1133 | 0.1623 | 0.2031 | To address the out of memory issue, I used half precision (fp16). The performance appears to drop marginally with fp16. | Model | Dataset | Sample Method | Ground Truth 20 Precision | Ground Truth 50 Precision | Ground Truth 100 Precision | |---------------------|-|---------------|--------------------------|--------------------------|---------------------------| |flan-t5-base w instr|infer test(top 20)|bean|0.0380|0.0723 |0.1018 |flan-t5-base w instr fp16|infer test(top 20)|bean|0.0349|0.0675|0.1018 |flan-t5-base w instr fp16|infer train(top 20)|bean|0.1108|0.1629|0.2052 Overall, using flan-t5 can improve performance, but adding instructions does not seem to have a significant impact. This might be due to the performance increase being attributed to the enhancements introduced in T5 version 1.1. The training on arithmetic reasoning could potentially help model understand statistics of the table. ![](https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png) >T5 Version 1.1 includes the following improvements compared to the original T5 model- GEGLU activation in feed-forward hidden layer, rather than ReLU - see here. >Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning. >Pre-trained on C4 only without mixing in the downstream tasks. >no parameter sharing between embedding and classifier layer >"xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger d_model and smaller num_heads and d_ff. ### Sample Methods ## April 21 (late night) I discovered that the performance report shown below was only trained for 3 epochs due to my oversight in setting the epoch parameter to 3. The loss was still decreasing. I have now adjusted the epoch to 10 and set the early stopping patience to 3 epochs. ![](https://i.imgur.com/YaXU1g7.png) - [ ] Save top-n checkpoint and average the model parameters (earlier that night) Experimenting with the FLAN-T5 model by applying a T5-style prompt. ```python def apply_t5_style(text: str) -> str: return f"Generate features based on table representation: {text}" ``` | Model | Dataset | Sample Method | Ground Truth 20 Precision | Ground Truth 50 Precision | Ground Truth 100 Precision| |---------------------|-|---------------|--------------------------|--------------------------|---------------------------| | flan-t5-base w/o instr | infer test(top 20) | beam | 0.0398 | 0.0687 | 0.1014 | | flan-t5-base w/o instr| infer train(top 20) | beam | 0.0958 | 0.1441 | 0.1852 | t5-base | infer test(top 20) | beam | 0.0325 | 0.0590 | 0.0937 | | t5-base| infer train(top 20) | beam | 0.1133 | 0.1623 | 0.2031 | ## April 20 ### Tasks Completed: #### Build One to One Dataset - Use prefix operator, e.g. Add (a, b) instead of a+b - Add space before and after the bracket, e.g Div ( DMFT.Begin , DMFT.End ) - One dataset => top_n_features (dataset, feature) pairs The "neurofe_one_to_one_dataset" is a dataset consisting of two subsets: train and test. The train set contains 4,647 samples, while the test set contains 1,116 samples. Each sample in the dataset has four features: 'table_name', 'y_col', 'x', and 'y'. An example of a data point in the train set is as follows: ``` table_name: [openml]analcatdata_dmft y_col: Prevention x: Table [openml]analcatdata_dmft has 797 rows and 5 columns. First row: DMFT.Begin is 6, DMFT.End is 3, Gender is Male, Ethnic is Black, Prevention is Health_education. Numeric column DMFT.Begin has mean 3.32, max 8, min 0, std 2.58. Numeric column DMFT.End has mean 1.85, max 6, min 0, std 1.70. Categorical column Gender has 2 unique values. Male has 408 rows, Female has 389 rows. Categorical column Ethnic has 3 unique values. White has 383 rows, Dark has 302 rows, Black has 112 rows. Categorical column Prevention has 6 unique values. Top 5 values are: Mouthwash has 155 rows, None has 136 rows, Diet_enrichment has 132 rows, All_methods has 127 rows, Health_education has 124 rows. Target column Prevention is Prevention. y: Feature is Div ( DMFT.Begin , DMFT.End ). ``` #### Result Ground truth 20 accuracy is the accuracy of the model's predictions when compared to the top 20 ground truth values, measuring the proportion of correct predictions within those top 20 values. This also holds for 50 and 100. | Dataset | Sample Method | Ground Truth 20 Precision | Ground Truth 50 Precision | Ground Truth 100 Precision | |---------------------|---------------|--------------------------|--------------------------|---------------------------| | infer test(top 20) | beam | 0.0325 | 0.0590 | 0.0937 | | infer train(top 20) | beam | 0.1133 | 0.1623 | 0.2031 | | infer test(top20) | topk | 0.0253 | 0.0488 | 0.0709 | | infer train(top20) | topk | 0.0679 | 0.1055 | 0.1366 | ### Tasks TODO - [ ] Use recall instead of precision for evaluation. For instance, if we sample 500 features and achieve over 80% recall on the top 50 features, it would significantly reduce the cost of search. - [ ] Analyze the parameters of the operators. Determine if some parameters are easier to predict than others, i.e., they have higher accuracy. - [ ] Evaluate the effectiveness of the predicted features directly on the test set. Append the predicted feature as a column and measure the model performance improvement. Additionally, we can append all features to obtain an overall performance improvement measurement. - [ ] Ensure correct column names and operator names during sampling. - [ ] Explore methods for creating synthesized datasets using web tables. - [ ] Debug and implement label smoothing. - [ ] Investigate the use of top-k/top-p search. Use temperature != 0 to sample m times until we have k unique features. - [ ] Experiment with different model versions starting from Flan T5 and T0, and provide clear instructions. If necessary, Qian can provide a subset of T5 training samples in the general domain for mixed training to prevent catastrophic forgetting.