G2G Meeting 02/11

# G2G Meeting 02/11 ## Progress - Refactoring the G2G code, I have reduced about 80% traffic time(converting the data from t2g/g2t to g2t/t2g) in cycle training. And the overhead comes from processing the string and tokenization since two models have different tokenizations. To accelerate this part, I refactored some code of data processing and introduced the multi-processing. - Have some early results of one-sided training, using a trained t2g model to teach the g2t model. |Iters|0|1k|2k|3k|4k| |-|-|-|-|-|-| |50%gold|10.87(3.59)|12.95(3.58)|11.58(3.57)|11.93(3.56)|10.31(3.56)| |50% syn|10.87(3.59)|7.56(3.74) 200 steps| |50%gold+50%syn|10.87(3.59)|12.67(3.55)|11.94(3.55)|12.94(3.55)|11.59(3.54)| we have a pre-trained t2g $t2g -> \text{syn data}$ now training the g2t gold -> g2t syn -> g2t gold+syn -> g2t Where "gold" means the human-annotated data, "syn" means the synthetic data generated from the t2g model. The performance is in the format BLEU(NLL). ## Next step: - There are some remaining issues in refactoring, such as the data format, the format of configuration, the re-use of data processing, etc. - Continue the one-sided training with extra data(crawled from wikidata). ## Discussion ### The difference between G2G and UMT - From the perspective of Data Augmentation, we can find that the necessary condition of an improvement is the augmented synthetic data could help the model optimize its loss function.  In general, we want the synthetic data is drawn from the correct underlying distribution but not included in the training set. Although we don't know the underlying distribution of data, we know some constraints/features of the data distribution. Such as the word-word translation(sharing vocabulary in related language pairs) in the UMT, each word in the source language only has a few corresponding words in the target language. And the denoising auto-encoder forces the model to cover the whole sentence without missing any meaningful token. We can see these artificial constraints in the UMT, but we only have one constraint, the supervised pre-training, in our G2G framework. - From the perspective of model-based RL, if we view the UMT trained from scratch as the model-free problem, we can say the UMT with the shared encoder-decoder and the denoising auto-encoder is an implicit model-based problem. That means constraints in the UMT limit the action space(picking word) in a relatively small area. Therefore, the optimizing process becomes easier and more robust. - As a result, we can attempt to introduce artificial constraints, such as measuring entity coverage, relation coverage, meta-info correlation, and alias matching. These scorers can learn from the annotated data by simple statistics. And we can use them in two ways, the first is taking their score as a reward in RL, the second is filtering/re-sampling the synthetic data. ## Family of scorers ### T2G - Entity coverage, counting entities in graph and text. - Relation coverage, counting edges in graph and their alias in text. - Meta-info, given two node types, the possible relation between them is limited. We can measure the probability of all triplets' meta-info. ### G2T - Entity coverage, counting entities in graph and text. - Relation coverage, counting edges in graph and their alias in text. - Borrow some from the "step by step" paper(we follow this paper for the text generation with planning) ### Misc - Finding alias of relation, we can use pos-tagging to find the predicate in the text, and a predicate between two entities should match the realtion in the corresponding triplet. - Meta-info in different domains, since the performance of the entity linking model doesn't meet the requirement, I will use the NER(wait for Tianxiang's progress) model first. A unified node type can be obtained by applying the NER model over all data sources include the annotated data and crawled data. But I didn't have a plan of unifying relation types in different sources.