## UA Data Lab Project: Knowledge Transfer for Computational Linguistics Collaboration between Prof Mike Hammond (Department of Linguistics) and the UA DataLab. The [Masters in Human Language Technology](https://linguistics.arizona.edu/ms-hlt) offers an industry-oriented degree in computational linguistics, which focuses on offering linguistics training to engineers and programmers, and programming skills to linguists with low/no background in software application. This creates an overlap with the objectives of the UA DataLab, particularly for NLP and AI applications. ### Objectives: 1. Provide small knowledge modules for onboarding students beginning their journey into NLP 2. Provide front-end support for building working models for NLP projects 3. Offer dedicated office hours in order to assist students with developing and debugging their codebase ### Knowledge modules: `Note-1: These suggestions can be further workshopped, based on Mike's specific needs.` `Note-2: MS-HLT is also offered asynchronous online, so all modules must have recordings and materials, and need to be self-paced.` - Introduction to HPC - What is singularity? - How to set up HPC and using sponsor's information to access hours - Writing and running your first HPC slurm request - How to set up your IDE for the HPC pipelines - How to run Jupyter Notebooks and the OOD interface (web interface) - Debugging and reading error messages - Version control - Advanced git commands and best practices (the degree includes classes on git basics) - Github debugging- how to fix commits, delete accidentally uploaded data - Documentation - How to create Wikis - Introduction to metadata - Self-organizing tools, data and resources with a focus on data soverignty - Dataset management for NLP - How to find and access datasets - Data engineering for NLP and Speech Technology - Working with metadata - How to use and fine-tune pre-trained speech models - Setting up a speech model from scratch - [OPTIONAL] Managing Data Pipelines and Workflows - Why do we need a data pipeline? - Popular tools - Airflow ### From project ideas to working demos - Coordinate with the instructor at an appropriate juncture to workshop students' research project ideas - Creating a repository of reproducible, implementable nd well-documented previous projects that other students can advance, restructure and improve on as their course submission - Help students choose the right front-end tool for demonstrating their projects - Programming support - Reproducibility basics - Helping students build a working showcase for their project - Making Jupyter notebooks searchable by metadata ### Office Hours - Weekly 1 hour: This the classes are asynchronous, this can be by appointment - Encouraging students to come to DSI events (Coffee and Code,etc.) - Offer the option of scheduling short meetings with facilitators to accomodate students in other timezones ### Potential Timeline - Develop 2-3 hours of teaching materials (before beginning of classes and during Week 1-3 of the course) - Meet with students for workshoping research ideas (mid-point of the course) - Office hours and software carpentry appointments (final weeks of the course) ### Comments - Meeting a common university-wide expectations for what students should know when it comes to software carpentry - Linguistics students work with coursework across multiple departments (iSchool) - Onboarding students and offering beginner-friendly modules on things such as neural networks- creates a least common denominator issue which slows coursework progress - 3 HLT tracks- online HLT, in-person professional HLT (regular masters +accelerated masters for undergrad seniors), PhD students - "Multi-functional programming" Differences in how students learn their programming skills- functional, vs optimal vs organized ### Course Specifics: - Name: LING696G - 7.5 week, asynchronous online - Materials needed for January (Jan 10-Jan31) - Jupyter notebook (high familiarity), HPC access (singularity-based ASR work), Docker (Mike spends his office hours debugging installation fails) - Jan 10-20: office hours for HPC access, and general software installation issues - mid-Feb- a mini project, a speech system built for some natural language - Mozilla Common Voice dataset for data, plus Mike's codebase - same metadata, same format, same everything - Last 2 weeks of classes- final project - Other Courses - HLT Professionalism ### Meeting with Gus (02/20) Agenda - HLT Professionalism - Find alignment - DataLab's existing workshop offers - Cracking - Graph search VFS, DFS- talking through the approach - Live coding and scenario - Dynamic programming- edit distance - The "homework" problem- ML problem, and what elements to include - Coding template - Portfolio, Networking - Any requests from HLT faculty - Assess if there are overlapping materials and domain experts - Software learning badges - Experiment tracking - Weights and biases - Comparing different runs of an ML models - Tensorboard - Ablation and compare architecture - End-to-end-end - HLT Bootcamp - Summer 2025 potentially - Example from 2023 software carpentry - Demonstation of completion adds favour to applicants - Building on Software carpentry- lean version, feels more HLT - Expectations - Labor - Funding (?) - Notebooks - How to add, delete, edit cells - Markdown vs code - Onboarding students - Software carpentry + HLT bent - Making something accessible to potential students - Hugging face - QLORA (decoder-only LLM) - Guest lectures from faculty - project management - Agile lite - Github projects - Test for 508: test for project management - Give this a think - Portfolio creation - Test for streamlit and portfolio: 582 for next fall - This is a group project - Teaching it with github projects - https://uazhlt-ms-program.github.io/ling-582-course-blog/assignments/course-project - Can add videos on Hugging Face - Compentency tests for self-evaluation