cjuhwan99
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # EACL 2024 Rebuttal - 67 ## Reviewer nAPA We appreciate your insightful feedback and suggestions regarding our manuscript. We appreciate your time and insights into our work. Below, we address the reviewer's comments and suggestions. ### Response to Weakness 1: Comparison with Paraphrasing-based Approach > The authors do not compare GPT4 with SOTA paraphrasing model + machine translation model. The combination should be more cost-efficient. We acknowledge the importance of a comparison against paraphrasing + machine translation baseline. In response to the feedback, we conducted additional experiments for comparison. We adopted the HRQ-VAE [1] model, a state-of-the-art paraphrase generation model according to paperswithcode.com, to generate paraphrases. We used this model to build an English dataset on three datasets: Flickr8k, Flickr30k, and MSCOCO 2014. The results of the downstream task performance of this approach were compared with GPT Annotator w/ GPT-3.5, as shown in Table 1. Additionally, we incorporated synonym replacement [2] and Back-Translation [3] for a more comprehensive assessment. Below are the experimental results. | Flickr8k (English) | BLEU | ROUGE | METEOR | |--------------------------------------|----------|-----------|------------| | Human Annotator<br>w/ Limited Budget | 32.64 | 40.92 | 19.02 | | Synonym Replacement | 33.48 | 41.33 | 19.35 | | Back-Translation | 30.30 | 36.38 | 15.04 | | HRQ-VAE | 25.61 | 33.34 | 17.99 | | GPT Annotator w/ GPT-3.5 | 35.20 | 42.62 | 20.64 | | **Flickr30k (English)** | **BLEU**| **ROUGE** | **METEOR** | | Human Annotator<br>w/ Limited Budget | 28.13 | 37.89 | 16.21 | | Synonym Replacement | 32.21 | 40.46 | 20.18 | | Back-Translation | 30.19 | 38.88 | 19.49 | | HRQ-VAE | 27.37 | 32.38 | 16.86 | | GPT Annotator w/ GPT-3.5 | 32.28 | 40.51 | 20.59 | | **COCO 2014 (English)** | **BLEU** | **ROUGE** | **METEOR** | | Human Annotator<br>w/ Limited Budget | 40.40 | 46.60 | 18.90 | | Synonym Replacement | 45.10 | 50.40 | 23.90 | | Back-Translation | 41.35 | 46.70 | 21.80 | | HRQ-VAE | 45.59 | 50.10 | 24.20 | | GPT Annotator w/ GPT-3.5 | 46.38 | 50.40 | 24.50 | In addition to the English experiment that supplements the experiment in Table 1, we translated these data into various languages using NLLB, including Vietnamese, Polish, Latvian, Estonian, and Finnish, and trained a model with translated data. The experimental results are below. | AIDe (Polish) | BLEU | ROUGE | METEOR | |------------------------------------|----------|-----------|------------| | Original | 5.88 | 16.66 | 8.54 | | NLLB | 2.51 | 14.29 | 6.37 | | HRQ-VAE + NLLB | 3.25 | 14.48 | 6.85 | | GPT Annotator w/ GPT-4 | 4.57 | 14.68 | 7.11 | | **UiT-ViIC (Vietnamese)** | **BLEU** | **ROUGE** | **METEOR** | | Original | 50.17 | 54.53 | 33.28 | | NLLB | 26.61 | 35.16 | 24.66 | | HRQ-VAE + NLLB | 29.63 | 36.57 | 26.35 | | GPT Annotator w/ GPT-4 | 39.08 | 44.83 | 29.66 | | **Constructed Dataset (Latvian)**| **BLEU** | **ROUGE** | **METEOR** | | NLLB | 6.12 | 17.31 | 10.37 | | HRQ-VAE + NLLB | 5.82 | 17.77 | 10.84 | | GPT Annotator w/ GPT-4 | 10.34 | 18.33 | 11.08 | | **Constructed Dataset (Estonian)** | **BLEU** | **ROUGE** | **METEOR** | | NLLB | 4.61 | 11.47 | 8.80 | | HRQ-VAE + NLLB | 4.93 | 11.18 | 8.58 | | GPT Annotator w/ GPT-4 | 5.94 | 12.30 | 9.20 | | **Constructed Dataset (Finnish)** | **BLEU** | **ROUGE** | **METEOR** | | NLLB | 3.86 | 10.49 | 7.80 | | HRQ-VAE + NLLB | 4.42 | 10.24 | 8.06 | | GPT Annotator w/ GPT-4 | 5.93 | 13.92 | 10.04 | These additional experiments show the superior performance of the model using our GPT annotator compared to the paraphrasing + machine translation approach. This finding indicates that our proposed method could offer high-quality annotations compared to the paraphrasing + machine translation method. We appreciate once again for pointing out the paraphrasing-based baseline. ### Response to Weakness 2: Limitations of GPT-4 on Low-resource Languages > The authors do not include comprehensive study to show the limitations of GPT4 on low-resource languages. We sincerely appreciate your feedback enlightening the significance of the discussion of our approach to low-resource languages. In response to the feedback, we conducted an error analysis, focusing particularly on the performance of our approach in two extremely low-resource languages, Basque and Māori. Basque has a small amount of speakers, and it is also an unique language isolate, that does not have a distinct relationship with other languages such as Spanish and French, making it harder to process. Māori has a very small amount of language users, posing a challenge as an extremely low-resource language. **Limitations of GPT-4 on Extreme Low-resource Languages - Basque** - Case 1 - Source Text: A black dog and a *spotted* dog are fighting. - Google Translator: Txakur beltz bat eta txakur *orban* bat borrokan ari dira - GPT Annotator w/ GPT-4: Kolore beltzeko txakur bat eta beste bat *orbainekin* borrokan ari dira. (A black dog and another with *scars* are fighting.) - Comment: The word “spot” has changed to “scar.” - Case 2 - Source Text: Several women *wait* outside in a city. - Google Translator: Hainbat emakume kanpoan *itxaroten* hiri batean. - GPT Annotator w/ GPT-4: Emakume talde bat hiri bateko inguru batean *zutik* daude. (A group of women are *standing* in an area of a city.) - Comment: The word “wait” has changed to “standing”, which does not fully contain the original meaning. - Case 3 - Source Text: A man clad in a *safety vest* and helmet is standing in the street with a flag. - Google Translator: *Segurtasun txalekoa* eta kaskoa jantzita dagoen gizon bat kalean dago bandera batekin. - GPT Annotator w/ GPT-4: *Segurtasun gerrikoa* eta kaskoa jantzita, gizon bat bandera batekin dago kalean. (A man wearing a *seat belt* and a helmet stands on the street with a flag.) - Comment: The word “safety best” has changed to “seat belt.” **Limitations of GPT-4 on Extreme Low-resource Languages - Māori** - Case 1 - Source Text: A man in *green* holds a guitar while the other man observes his shirt. - Google Translator: Ko tetahi tangata he *kakariki* e mau ana ki te kita, ko tetahi atu tangata e titiro ana ki tana koti. - GPT Annotator w/ GPT-4: E pupuri ana te tangata i te kita, ko te tangata ke atu e matakitaki ana i tana hāte. (One is holding a guitar while the other is looking at his shirt.) - Comment: The sentence has lost the expression about the color of the clothes the man wearing. - Case 2 - Source Text: Boys *perform dances* on poles during the nighttime. - Google Translator: Ka *kanikani* nga tama ki runga pou i te po. - GPT Annotator w/ GPT-4: Tamariki tāne e *mahi* ake ana i ngā pou i te po tuturu. (Boys who work up posts in the real night.) - Comment: The word “perform dance” has changed to “work up.” - Case 3 - Source Text: A *ballet* class of five girls jumping in sequence. - Google Translator: He karaehe paoro o nga kotiro tokorima e peke ana i te raupapa. - GPT Annotator w/ GPT-4: He akomanga *parekareka* o rima kōtiro e peke tahi ana i roto i te raupapa. (It's a fun class of five girls jumping together in a series.) - Comment: The generated sentence has lost the word “ballet” and changed it to “fun”, indicating the model has less understanding of proper nouns. We are committed to including this analysis in the revised manuscript to provide a more comprehensive understanding of the limitations. Once again, thank you for noticing the importance of error analysis and discussion on limitations. ### Response to Weakness 3: Overclaimed Contributions > Some of the claimed contributions are too strong. For example, there're some prior work that employ LLM for tasks similar to image captioning and style transfer. Thank you for highlighting the concern regarding the tone of our claimed contributions. We will modify the tone in the revised manuscript and provide a more nuanced discussion, acknowledging and differentiating our approach from prior works. We will explicitly outline the distinctions between their methodologies and our proposed approach. For instance, we have found several studies that are close to our approach as below. However, their approach and our proposed approach have several differences: **Whitehouse et al. LLM-powered Data Augmentation for Enhanced Crosslingual Performance. EMNLP 2023.** - This work deals with natural language understanding tasks, including causal commonsense reasoning. Whereas, our method centralizes its application to natural language generation tasks, such as image captioning and text style transfer. - This work mainly focuses on proposing a strategy for data augmentation, while we are suggesting employing LLMs for dataset construction. **Bianco et al. Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion. arXiv preprint 2023.** - This work proposed to generate multiple candidate captions deploying multiple image captioning models, rank them, and yield the final caption by combining the two best captions with LLMs. This can be understood as a variant of ensemble learning for image captioning tasks. - While this work has utilized LLMs for image captioning, there are distinct differences between this study and ours. While we are focusing on constructing a dataset for various languages, this work utilizes an LLM for combining the produced captions from numerous SOTA image captioning models. This approach could be costly as it requires deploying multiple downstream task models, as well as the LLM to combine the result. - Moreover, this work does not explore the multilingual ability of LLMs. ### Response to Weakness 4: Organization and Writing of the Paper > The organization and writing of the paper could be improved. We understand your perspective about the organization and writing of the manuscript. We will revise the organization and writing of the paper to enhance clarity and readability, as well as incorporate the additional experiment on comparison between paraphrase generation + machine translation method. Specific feedback on areas that need improvement would be valuable for refining the manuscript. ### Response to Comments: Missing References > There're some missing citations, such as AiHub, AdamW, and CosineAnnealingLR. Thank you for noticing the missing citations. We will carefully check and ensure to include proper references to AiHub, AdamW, and CosineAnnealingLR in the revised manuscript. We appreciate the thorough review and constructive feedback from the reviewer. These insights will be instrumental in refining and strengthening our paper. ``` [1] Hosking et al. Hierarchical Sketch Induction for Paraphrase Generation. ACL 2022. [2] Zhang et al. Character-level Convolutional Networks for Text Classification. NeurIPS 2015. [3] Sennrich et al. Improving Neural Machine Translation Models with Monolingual Data. ACL 2016. ``` ## Reviewer T4eM Thank you for your review and thoughtful feedback on our manuscript. We appreciate your time and insights into our work. We are glad that you recognized the main contributions of our paper: exploring the possibility of LLMs as a multilingual annotator, which is autonomous, cost-efficient, and applicable to low-resource languages. Below, we address the reviewer's comments and suggestions. ### Response to Weakness 1: Necessity of Human Evaluation > The evaluation is mainly automated metrics. Human evaluation on a sample would give a better sense of output quality. We appreciate your feedback apprising the necessity of human evaluation. Our evaluation design based on the performance of the downstream task model is affected by a previous study, which suggested the performance of a downstream task model can serve as an indirect measurement of the quality of the generation algorithm [1]. We will clarify the connection between previous work and our evaluation design in the updated version. Nonetheless, we deeply acknowledge the significance of human evaluation on generated data to validate the quality of generated data. Although it is difficult to present the result of human evaluation in the rebuttal process owing to the limited period, we are planning to incorporate the result in the camera-ready version. Once again, we sincerely appreciate for enlightening the importance of human judgment. ### Response to Weakness 2: More Comparison Examples > More examples comparing system outputs versus human and machine translated outputs would be insightful. Thank you for raising the importance of suggesting more examples comparing system output versus human and machine translated outputs. In response to your feedback, we conducted a more extensive comparison between captions from various datasets above Korean. **Polish Comparison** - Context of Generated Sentence - Flickr File Name: 1153704539_542f7aa3a5 - English Reference: A girl playing trumpet in a marching band. - Polish Human-annotated Reference: Dziewczyna w sportowym stroju i czapce z daszkiem stoi na trawniku i gra na trąbce w towarzystwie innych muzyków. (A girl in sports clothes and a baseball cap stands on the lawn and plays the trumpet in the company of other musicians.) - Machine-translated: Dziewczyna grająca na trąbce w zespole. (A girl playing the trumpet in a band.) - GPT Annotator w/ GPT-4: Dziewczyna grająca na trąbce w orkiestrze marszowej. (A girl playing the trumpet in the march orchestra.) - Quality of Generated Sentence - Flickr File Name: 1386251841_5f384a0fea - English Reference: A woman is looking at dressed, headless mannequins in a store display. - Polish Human-annotated Reference: Kobieta ogląda wystawę z ubranymi w damskie stroje manekinami. (A woman looks at an exhibition with mannequins dressed in women's clothes.) - Machine-translated: Kobieta patrzy na ubrane, bezgłowe manieki w sklepach. (A woman looks at clothed, headless maniacs in stores.) - GPT Annotator w/ GPT-4: Kobieta patrzy na ubrane, bezgłowe manekiny w wystawie sklepowej. (A woman looks at clothed, headless mannequins in a store window.) - Flickr File Name: 1387785218_cee67735f5 - English Reference: A child pushes a doll in a baby carriage. - Polish Human-annotated Reference: Dziecko idzie drogą, prowadząc przed sobą mały wózek z lalką. (A child walks along the road, pushing a small stroller with a doll in front of him.) - Machine-translated: Mała dziewczynka wpychająca koło lalki (A little girl pushing a doll wheel) - GPT Annotator w/ GPT-4: Dziecko popycha lalkę w wózku dla dzieci. (A child pushes a doll in a baby stroller.) - Flickr File Name: 1499495021_d295ce577c - English Reference: A dark haired woman wearing a brown jacket and fatigue bottoms and a balding man wearing a green sweater and blue jeans with a fishing pole, stand at the foot of the surf. - Polish Human-annotated Reference: Mężczyzna stoi z wędką nad brzegiem wody, a obok niego stoi kobieta. (A man stands with a fishing rod at the water's edge and a woman stands next to him.) - Machine-translated: Czarnowłosa kobieta w brązowej kurtce i zmęczonym dnie i łysy mężczyzna w zielonej swetrze i niebieskich dżinsów z palcem rybaczem, stoją u stóp pływu. (A black-haired woman in a brown jacket and a tired bottom and a bald man in a green sweater and blue jeans with a fisherman's finger, stand at the foot of the tide.) - GPT Annotator w/ GPT-4: Kobieta o ciemnych włosach, ubrana w brązową kurtkę i spodnie w kamuflaż, oraz łysiejący mężczyzna w zielonym swetrze i niebieskich dżinsach z wędką, stoją u podnóża fali. (A woman with dark hair, wearing a brown jacket and camouflage pants, and a balding man in a green sweater and blue jeans with a fishing rod, stand at the foot of a wave.) - Flickr File Name: 146098876_0d99d7fb98 - English Reference: A boy and three girls in blue school uniforms walk down a dirt-covered road. - Polish Human-annotated Reference: Chłopiec i trzy dziewczynki w mundurkach idą, niosąc zeszyty. (A boy and three girls in uniforms are walking, carrying notebooks.) - Machine-translated: Chłopak i trzy dziewczyny w niebieskich mundurach szli po błędnej drodze. (A boy and three girls in blue uniforms were walking on the wrong path.) - GPT Annotator w/ GPT-4: Chłopiec i trzy dziewczyny w niebieskich mundurkach szkolnych idą po drodze pokrytej brudem. (A boy and three girls in blue school uniforms are walking on a road covered with dirt.) **Vietnamese Comparison** - Quality of Generated Sentence - MSCOCO Image ID: 213669 - English Reference: A young man holding a tennis racquet on a tennis court. - Vietnamese Human-annotated Reference: Người đàn ông đang cầm vợt tennis chạy tới đánh bóng. (A man holding a tennis racket runs to hit the ball.) - Machine-translated: một người đàn ông đứng trên một thức ăn với một tên lửa (a man standing on a food with a rocket) - GPT Annotator w/ GPT-4: Một người trẻ tuổi đang ở trên sân tennis với cây vợt trong tay. (A young person is on the tennis court with a racket in his hand.) **Latvian Comparison** - Quality of Generated Sentence - MSCOCO Image ID: 46544 - English Reference: A woman playing tennis on a tennis court. - Machine-translated: Sieva tenisā tenisā. (Tennis wife in tennis.) - GPT Annotator w/ GPT-4: Sieviete spēlē tenisu tenisa kortā. (A woman plays tennis on a tennis court.) - MSCOCO Image ID: 43960 - English Reference: A boy catching a ball while another boy holds a bat. - Machine-translated: Puikas, kas ieņem lopu, kamēr cits puikas, kas drīkst pieņemt lopu. (Boys who take livestock, while other boys who are allowed to accept livestock.) - GPT Annotator w/ GPT-4: Zēns noķer balls, kamēr cits zēns tur nūju. (A boy catches the ball while another boy holds the stick.) - MSCOCO Image ID: 47813 - English Reference: There are four people playing tennis in doubles. - Machine-translated: Divās grupās spēlē četri cilvēki. (Four people play in two groups.) - GPT Annotator w/ GPT-4: Četri cilvēki spēlē tenisu dubultspēlēs. (Four people play tennis in doubles.) We are committed to incorporating these analysis results into the updated version. We deeply appreciate your valuable suggestion for adding more examples. ### Response to Weakness 3: Deeper Discussion about Ethical Bias > Ethical considerations around bias in LLM outputs should be discussed in more depth. We acknowledge the significance of a deeper discussion about the ethical bias of the proposed method, which utilizes LLMs. In response to your feedback, we conducted an error analysis focusing on the ethical bias of the GPT annotator. We conducted an error analysis focusing on ethical bias in two extremely low-resource languages, Basque and Māori. Basque has a small amount of speakers, and it is also a language isolate, that does not have a distinct relationship with other languages such as Spanish and French, making it harder to process. Māori has a very small amount of language users, posing a challenge as an extremely low-resource language. **Error Analysis on Ethical Bias** - Basque - Source Text: Two young, White males are outside near many bushes. - Google Translator: Bi ar zuri kanpoan daude zuhaixka askoren ondoan. - GPT Annotator w/ GPT-4: Gazte *kaukasiko* gizon pare bat kanpoan daude, zuhaixka ugariz inguratuta. (A couple of young *Caucasian* men are outside, surrounded by many bushes.) - Comment: The output from GPT annotator contains the word “Caucasian”, which is a word based on obsolete racial classification. - Māori - Source Text: Two young, White males are outside near many bushes. - Google Translator: Tokorua nga tama tane ma, kei waho tata ki nga ngahere maha. - GPT Annotator w/ GPT-4: E rua ngā tāne rangatahi, *Pākehā*, kei waho e tū tata ana ki ngā pūrei. (Two young *European* men are standing outside near the playground.) - Comment: The output from GPT annotator contains the word “European” instead of “White”, which can be interpreted as a geographical bias. These errors were generated in spite of the presence of the prompt that forced the model to generate debiased sentences. We hypothesize these phenomena may occur in extremely low-resource languages, as the GPT-4 model has less knowledge about the language compared to other languages. We are committed to updating the limitation part of the manuscript by incorporating these analyses, offering a balanced view of the potential ethical bias of the proposed method. ### Response to Comments: Relevant Recent Work > Solid paper overall, and demonstrates practical results. The results are not very new but still quite novel. Relevant recent work : [https://arxiv.org/pdf/2305.14288.pdf](https://arxiv.org/pdf/2305.14288.pdf) We appreciate the reviewer for appraising our work as a solid paper and suggesting a relevant recent study. We will discuss this work in the updated version of the manuscript. We also conducted a thorough literature review and found another related work. Below, we briefly discuss the differences between these studies and our manuscript. **Whitehouse et al. LLM-powered Data Augmentation for Enhanced Crosslingual Performance. EMNLP 2023.** - This work deals with natural language understanding tasks, including causal commonsense reasoning. Whereas, our method centralizes its application to natural language generation tasks, such as image captioning and text style transfer. - This work mainly focuses on proposing a strategy for data augmentation, while we are suggesting employing LLMs for dataset construction. **Bianco et al. Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion. arXiv preprint 2023.** - This work proposed to generate multiple candidate captions deploying multiple image captioning models, rank them, and yield the final caption by combining the two best captions with LLMs. This can be understood as a variant of ensemble learning for image captioning tasks. - While this work has utilized LLMs for image captioning, there are distinct differences between this study and ours. While we are focusing on constructing a dataset for various languages, this work utilizes an LLM for combining the produced captions from numerous SOTA image captioning models. This approach could be costly as it requires deploying multiple downstream task models, as well as the LLM to combine the result. - Moreover, this work does not explore the multilingual ability of LLMs. We admire your detailed review and your attention to detail, which helps us enhance the quality and clarity of our work. ### References ``` [1] Ye et al. ZeroGen: Efficient Zero-shot Learning via Dataset Generation. EMNLP 2022. ``` ## Reviewer B2A7 Thank you for dedicating your time to a comprehensive review of our manuscript. We deeply appreciate your recognition of the significance of multilingual and low-resource language annotation, which is the challenge that we are proposing to address. We have carefully considered your comments and concerns and would like to address them as follows: ### Response to Weakness 1: Importance of “Gold” Annotation > First and foremost, the methodology relies on the availability of initial "gold" annotations provided by human annotators. This dependence means that the quality and accuracy of the GPT-generated "silver" annotations are contingent on the initial human-provided data. Using models as annotators often suffers from the main issues of lack of diversity and bias. These two topics are not addressed in the paper. Relying on only one human annotator to generate the seed of a caption in the experiment of caption generation, for example, may not extract rich information from an image or reflect bias in certain aspects reflecting the cohorts of the assigned human annotator. It might be helpful to ask human annotators to enumerate as much detailed and diverse information as possible, but how effective the approach is in terms of increasing diversity and mitigating bias is unknown. Increasing costs can be a problem too. We thank you for highlighting the significance of the initial "gold" annotations in our methodology. We are aware of the consequences this reliance can have on the diversity and potential partiality of the "silver" annotations generated by GPT models. To mitigate the potential reduction in diversity and introduction of bias from models as annotators, we have undertaken methodical prompt engineering. This approach aims to encourage a diverse range of sentence structures and information content as well as debiased annotations, striving to generate high-quality and neutral "silver" annotations from individual gold captions. We also agree that increasing the detail and diversity of information provided by human annotators could improve the model's annotations. Consequently, we will update the manuscript outlining the importance of rigorous guidelines for human annotators to ensure the capture of comprehensive and varied information while drafting the "gold" annotations. Nonetheless, we acknowledge the potential bias of our proposed method. To offer a balanced view and deeper insights, we conducted an error analysis focusing on the bias of generated sentences using the GPT annotator in two extremely low-resource languages, Basque and Māori. Basque has a small amount of speakers, and it is also an isolated language, that does not have a distinct relationship with other languages such as Spanish and French, making it harder to process. Māori has a very small amount of language users, posing a challenge as an extremely low-resource language. These errors were generated in spite of the presence of the prompt that forced the model to generate debiased sentences. We hypothesize these phenomena may occur in extremely low-resource languages, as the GPT-4 model has less knowledge about the language compared to other languages. **Error Analysis on Ethical Bias** - Basque - Source Text: Two young, White males are outside near many bushes. - Google Translator: Bi ar zuri kanpoan daude zuhaixka askoren ondoan. - GPT Annotator w/ GPT-4: Gazte *kaukasiko* gizon pare bat kanpoan daude, zuhaixka ugariz inguratuta. (A couple of young *Caucasian* men are outside, surrounded by many bushes.) - Comment: The output from GPT annotator contains the word “Caucasian”, which is a word based on obsolete racial classification. - Māori - Source Text: Two young, White males are outside near many bushes. - Google Translator: Tokorua nga tama tane ma, kei waho tata ki nga ngahere maha. - GPT Annotator w/ GPT-4: E rua ngā tāne rangatahi, *Pākehā*, kei waho e tū tata ana ki ngā pūrei. (Two young *European* men are standing outside near the playground.) - Comment: The output from GPT annotator contains the word “European” instead of “White”, which can be interpreted as a geographical bias. Moreover, we intend to conduct human evaluation studies to assess the diversity and quality of the generated annotations. This additional evaluation step will enable us to provide further empirical evidence regarding these critical matters. We appreciate once again for raising essential topics for our approach. Additionally, we are planning to perform a human evaluation to assess the quality and diversity of generated data. Thank you once again for raising essential topics for our approach. ### Response to Weakness 2: Exploration in Broader Range of Low-resource Languages > Second, although the paper addresses multilingual capabilities, there appears to be a stronger focus on high-resource languages, with less emphasis on truly low-resource or underrepresented languages. While it does mention the creation of a dataset in Latvian, the paper could benefit from more extensive exploration and validation in a broader range of low-resource languages. We appreciate your suggestion regarding the exploration of a broader range of low-resource languages. In response to your feedback, we conducted additional experiments on Estonian and Finnish, aiming to diversify the set of languages in our study. We followed the same experimental design with Section 4.5 and added another baseline that utilizes HRQ-VAE [1], a paraphrasing model, to generate paraphrases of a given gold caption and translate them into the target language using NLLB model. The experimental results are as below: | **Constructed Dataset (Estonian)** | **BLEU** | **ROUGE** | **METEOR** | |------------------------------------|----------|-----------|------------| | NLLB | 4.61 | 11.47 | 8.80 | | HRQ-VAE + NLLB | 4.93 | 11.18 | 8.58 | | GPT Annotator w/ GPT-4 | 5.94 | 12.30 | 9.20 | | **Constructed Dataset (Finnish)** | **BLEU** | **ROUGE** | **METEOR** | | NLLB | 3.86 | 10.49 | 7.80 | | HRQ-VAE + NLLB | 4.42 | 10.24 | 8.06 | | GPT Annotator w/ GPT-4 | 5.93 | 13.92 | 10.04 | These results indicate the potential benefits of our proposed GPT annotator in other low-resource languages. We will incorporate these findings into the revised manuscript and release these newly crafted datasets for future research in Estonian and Finnish. Thank you for prompting these additional experiments, and we look forward to enhancing the inclusivity of our study. ### Response to Weakness 3: Presentation of Table 2 > Lastly, it's confusing to present Table 2. "Quantitative experimental results of the machine-translated dataset," since it's not a fair comparison. To my understanding, it's just an intermediate trial. Showing table 3. only serve the purpose of the Korean experiment. We offer our apologies for any confusion surrounding Table 2. As you rightly pointed out, the experimental design made a fair comparison challenging due to the absence of an official Korean test dataset. We understand that without clear exposition, this intention can be obscured. We will revise our manuscript to clarify the role of Table 2 within the scope of the study and to provide a more lucid explanation of how it complements the human evaluation results detailed in Table 3. Your critical insights have greatly assisted us in improving the clarity of our experimental results. Your constructive feedback has significantly contributed to the refinement of our work. We are committed to implementing these revisions to ensure our manuscript meets the high standards expected by the community. Thank you once again for your insightful comments. ### References ``` [1] Hosking et al. Hierarchical Sketch Induction for Paraphrase Generation. ACL 2022. ```

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully