TS280-machine-translation-workshop

# TS280-machine-translation-workshop ### What is the landscape this year? * Here are the currently [recommended Harvard course policies](https://oue.fas.harvard.edu/faculty-resources/generative-ai-guidance/) from the Office of Undergraduate Education * Here is [the advice the Bok Center is providing your Faculty](https://bokcenter.harvard.edu/artificial-intelligence) * There are two AI tools that HUIT is supporting. Let's get you connected to them before we move on with the workshop! * here is your link to [Google Gemini](https://gemini.google.com/app) * and here is your link to [the HUIT AI Sandbox](https://sandbox.ai.huit.harvard.edu/) * **Important privacy note:** These HUIT-supported tools have built-in privacy safeguards. Harvard has contracts with these providers ensuring that anything you share won't be used to train their models or be shared with third parties. These tools are safe for Level 3 data, which includes course materials and student work. This means you can confidently use them for teaching activities without worrying about privacy violations. --- ### Activity 1: Google Translate vs. LLMs Use [Google Translate](https://translate.google.com/) to translate the following sentence, changing out the word "rabbit" for different animals or objects (e.g. pigeon, crow, dog, man, knife, etc.) ``` I saw the rabbit by the bank. ``` ![alt text](https://files.slack.com/files-pri/T0HTW3H0V-F09KE326YGH/googletranslate.gif?pub_secret=e96350c5c6) * When does Google Translate translate "the bank" to mean "financial institution" vs. "river bank"? * What does this show us about how Google Translate works? Next, translate Bar-Hillel's (1960) famous example using Google Translate and Gemini (or your LLM of choice). ``` The box is in the pen. ``` The key detail to note, of course, is the translation of "pen" (as "writing tool" vs. "cage"). * How do the results from Google Translate and Gemini compare? #### Takeaway LLMs are trained on vast amounts of text, allowing them to recognize patterns and relationships that help them better grasp context and resolve ambiguities based on subtle cues. This represents a major advance in the machine translation of natural language. --- ### Activity 2: Tokenization Paste the text below into [tiktokenizer](https://tiktokenizer.vercel.app/). ``` Unsurprisingly, they had to cancel the show. The crowd went home unhappily. ``` * Notice how the model breaks words into tokens. * Try putting in a sentence or a word with complex morphology in your language of choice * Discuss: What does this reveal about how AI “reads” text differently from humans? #### Takeaway AI doesn’t “read” words like humans do. It breaks text into tokens—numbers representing pieces of words. This shows that LLMs process language as math, predicting the next number in a sequence rather than reasoning about meaning. --- ### Activity 3: Language Representation in AI Large language models (LLMs) learn from vast amounts of text collected from the internet. But not all languages are represented equally online. Some languages with millions of speakers have relatively little digital content, while others with fewer speakers dominate the web. This imbalance shapes how well AI can generate useful materials for teachers and learners of different languages. **Click through the visualization to compare:** - The proportion of content available in each language - The number of people worldwide who speak it. <iframe src="https://claude.site/public/artifacts/2b9d7a13-c0de-4433-b13a-479f0aa81006/embed" title="Claude Artifact" width="100%" height="600" frameborder="0" allow="clipboard-write" allowfullscreen></iframe> **Discuss**: * Which languages appear “overrepresented” (lots of internet data relative to speakers)? Which are “underrepresented”? * How might these imbalances affect translators working in less digitally-represented languages? #### Takeaway: AI outputs tend to be stronger for languages with abundant digital data and weaker for those with less online presence. This gap reflects broader inequities in the accessibility of digital tools and highlights the importance of human review, adaptation, and creativity—especially when working with underrepresented languages. --- ### Activity 4: Prompt Engineering The quality of an AI’s output depends heavily on **how you ask** and **what context it has**. To help you get better results, here are some prompt engineering tips you can use. Based on these tips, try creating a custom Gem prompt. Here's an example of the interface, filled in with the elements you will input: ![Screenshot 2025-10-07 at 9.25.59 AM](https://hackmd.io/_uploads/Byo5Scz6gx.png) For this activity, you can work on translating an excerpt from [this sample text](https://docs.google.com/document/d/1CfwQnIDhW0QjHnmLIkM37AOjpEtkyRx67BGIdT6AWkQ/edit?usp=sharing) into your language of choice. Alternatively, translate a short poem in a different language into English. - **Adopt a translator persona:** Prompting with a "translator" persona yields better results for machine translation ([He, 2024](https://arxiv.org/pdf/2403.00127)). ``` [PERSONA] You are a professional literary translator with training in early modern English poetry, heroic couplets, and Restoration politics. Your priorities: (1) preserve core semantics and named entities, (2) respect meter/rhyme when feasible, (3) surface cultural/biblical allusions, (4) produce clean, publishable {TARGET_LANG} with concise footnotes for allusions. ``` - **Specify purpose and audience:** Explicitly stating the translation's purpose and target audience enhances quality ([Yamada, 2023](https://aclanthology.org/2023.mtsummit-users.19.pdf)) ``` [PURPOSE] Produce a literary translation suitable for {AUDIENCE} (e.g., undergraduates in comparative literature / general readers). Tone: {TONE} (e.g., “elevated but transparent” / “scholarly with light glosses” / “performative and metrical”). Register: {REGISTER} (e.g., neutral-formal). Output format: 1) Final translation in {TARGET_LANG} 2) Minimal footnotes/endnotes for proper nouns, allegorical identities, metaphors, or meter choices 3) A 3–7 bullet “source-aware gloss” highlighting: tense/aspect, modality, idioms/fixed expressions, morphology that mattered, rhyme/meter decisions 4) 1–2 alternative phrasings for any tricky lines Do NOT include hidden chain-of-thought; give only concise justifications. ``` - **Include example translations:** Adding several sample source-target pairs in the prompt improves output quality ([Gao et al., 2025](https://dl.acm.org/doi/abs/10.1145/3700410.3702123?casa_token=9F0oUa3kkPQAAAAA%3AyaflXV_5z8bjRmz4qEK05na6ntRHNTfMjA8JeMZiyfSMLV5KKBqHK6txGn6vZmuIuaU8UaJkr30)) ``` [EXEMPLARS] (Use these as style guides, not hard constraints.) "Of these the false Achitophel was first: A name to all succeeding ages curst."→{TARGET LANGUAGE TRANSLATION} "For close designs, and crooked counsels fit; Sagacious, bold and turbulent of wit:"→{TARGET LANGUAGE TRANSLATION} "Restless, unfixt in principles and place; In pow'r unpleas'd, impatient of disgrace."→{TARGET LANGUAGE TRANSLATION} ``` - **Integrate retrieval-augmented generation (RAG):** Retrieve 1–3 concise, relevant snippets (such as glossary entries, definitions, or prior context) and include them as background context ([Stasimioti, 2025](https://slator.com/how-large-language-models-improve-document-level-ai-translation/?utm_source=chatgpt.com); [Yamada, 2023](https://aclanthology.org/2023.mtsummit-users.19.pdf)) * You can use this [context](https://docs.google.com/document/d/1qz6Mtdmnxlgj6xcn15vtOkyWJycig6KBVZ9UBToVBXY/edit?usp=sharing) document to give your Gem additional information on the sample text. ``` [GLOSSARY / TERM BASE] Maintain consistent rendering for key entities. If target language has established exonyms, prefer those. Otherwise transliterate and gloss once. - Absalom → {ABSALOM_EQUIV} [biblical allusion; in Dryden: Duke of Monmouth] - Achitophel → {ACHITOPHEL_EQUIV} [Earl of Shaftesbury] - David → {DAVID_EQUIV} [King Charles II] - Israel/Zion → {ISRAEL_EQUIV} [England] - Sanhedrin → {SANHEDRIN_EQUIV} [Parliament] - Sion → {SION_EQUIV} - Amalekites/Philistines → {AMAL_PHI_EQUIV} Add new entries you infer; list them before translating. If a new name appears, ask for confirmation unless obvious. ``` #### Takeaway: Great machine translations don’t just “happen”, they’re engineered. Give the model a clear translator persona, specify purpose/audience, include a few style exemplars, add a tiny glossary/context, and require a structured output, you’ll get translations that are faithful, readable, and consistent. --- ### Activity 5: Building a Chatbot For this activity, you'll work in groups to build a custom Gem to accomplish the following tasks. Here are your groups, as sent by Spencer: * **East Asian 1** * Elli Ahn * Crystal Cheng * Yoojung Chun * Zari Smith * **East Asian 2** * Phoenix Ali * Edward Kim * Levi Lee * **European** * Jonah Lubin * Adam Mahler * Ilya Nemirovsky * Alexia Wang For our pod of students working mainly on European languages: > As part of your job as a postdoctoral fellow, you’re teaching and lecturing in a “core curriculum”-style world literature undergraduate program. Each of you are responsible for teaching two works of your choice in translation to undergrads who don’t know the language. You want to create a bot that will help both students taking the class and your colleagues leading sections to close-read passages of text with some appreciation of what’s going on in the source language on the level of grammar and vocabulary. The goal is to get them to critically reflect on, and aesthetically appreciate, the translation you’re reading (your choice!) rather than just treating it as a neutral windowpane for viewing the text. For our pods of students working mainly on East Asian languages: > You’re teaching as part of a team in an introductory East Asian Studies class of roughly 100 undergraduate students. The students all come from different language backgrounds, and some have no East Asian language background at all (but they’re excited about the material). Your curriculum includes some classics from cultures of the region: for example, the Analects, Sima Qian’s Records of the Grand Historian, major Tang dynasty poets, and writings by Lu Xun; the Tale of Hong Kildong, poetry by Han Yong-un, and Han Kang’s Human Acts; the Pillow Book, poetry of Bashō, and Natsume Sōseki. Your team wants to create a custom tool that will allow students to explore the linguistic texture of the source-language texts of passages from these works even without knowing the languages, as a way of being incentivized to actually take classes in those languages down the line. --- ### BONUS ACTIVITY: Prompt Chain on Cards Now that you’ve seen how translation can be shaped by the structure of a prompt, let’s design our own "[prompt chains](https://www.ibm.com/think/topics/prompt-chaining)." In this activity, you’ll prototype an LLM translation pipeline on paper, using color-coded cards to represent different parts of a machine translation process. You'll build a paper version of this [example python notebook](https://colab.research.google.com/drive/1ASESbrSBORd7WFu20TzbaA8Gd5GQToJu?usp=sharing). 1. In small groups, with one card color, write out some *pre-processing* steps: * like text cleaning, context retrieval, etc. 2. With another card color, write out some *translation* steps * like different translation approaches, target audience specification, model selection, etc. * These can also be recusive steps, like translating a work, getting a "judge LLM" to provide feedback, before returning those feedback notes and the first translation back to a different LLM call. You can do this in a loop *n* times. 3. Finally, write out *post-processing* steps * back-translation checks, stylistic revisions, etc On each card, write a short prompt or operation that your LLM agent would perform. (For example: “Normalize punctuation,” “Retrieve relevant term base,” “Translate using neutral-formal tone.”) Arrange your cards into a logical sequence or “flow.” When you’re done, photograph your chain (from above!)—these images will be used to turn your paper prototypes into Python notebooks during the October 29 workshop. ---