## UA Data Lab Project: Knowledge Transfer for Computational Linguistics
Collaboration between Prof Mike Hammond (Department of Linguistics) and the UA DataLab.
The [Masters in Human Language Technology](https://linguistics.arizona.edu/ms-hlt) offers an industry-oriented degree in computational linguistics, which focuses on offering linguistics training to engineers and programmers, and programming skills to linguists with low/no background in software application. This creates an overlap with the objectives of the UA DataLab, particularly for NLP and AI applications.
### Objectives:
1. Provide small knowledge modules for onboarding students beginning their journey into NLP
2. Provide front-end support for building working models for NLP projects
3. Offer dedicated office hours in order to assist students with developing and debugging their codebase
### Knowledge modules:
`Note-1: These suggestions can be further workshopped, based on Mike's specific needs.`
`Note-2: MS-HLT is also offered asynchronous online, so all modules must have recordings and materials, and need to be self-paced.`
- Introduction to HPC
- What is singularity?
- How to set up HPC and using sponsor's information to access hours
- Writing and running your first HPC slurm request
- How to set up your IDE for the HPC pipelines
- How to run Jupyter Notebooks and the OOD interface (web interface)
- Debugging and reading error messages
- Version control
- Advanced git commands and best practices (the degree includes classes on git basics)
- Github debugging- how to fix commits, delete accidentally uploaded data
- Documentation
- How to create Wikis
- Introduction to metadata
- Self-organizing tools, data and resources with a focus on data soverignty
- Dataset management for NLP
- How to find and access datasets
- Data engineering for NLP and Speech Technology
- Working with metadata
- How to use and fine-tune pre-trained speech models
- Setting up a speech model from scratch
- [OPTIONAL] Managing Data Pipelines and Workflows
- Why do we need a data pipeline?
- Popular tools
- Airflow
### From project ideas to working demos
- Coordinate with the instructor at an appropriate juncture to workshop students' research project ideas
- Creating a repository of reproducible, implementable nd well-documented previous projects that other students can advance, restructure and improve on as their course submission
- Help students choose the right front-end tool for demonstrating their projects
- Programming support
- Reproducibility basics
- Helping students build a working showcase for their project
- Making Jupyter notebooks searchable by metadata
### Office Hours
- Weekly 1 hour: This the classes are asynchronous, this can be by appointment
- Encouraging students to come to DSI events (Coffee and Code,etc.)
- Offer the option of scheduling short meetings with facilitators to accomodate students in other timezones
### Potential Timeline
- Develop 2-3 hours of teaching materials (before beginning of classes and during Week 1-3 of the course)
- Meet with students for workshoping research ideas (mid-point of the course)
- Office hours and software carpentry appointments (final weeks of the course)
### Comments
- Meeting a common university-wide expectations for what students should know when it comes to software carpentry
- Linguistics students work with coursework across multiple departments (iSchool)
- Onboarding students and offering beginner-friendly modules on things such as neural networks- creates a least common denominator issue which slows coursework progress
- 3 HLT tracks- online HLT, in-person professional HLT (regular masters +accelerated masters for undergrad seniors), PhD students
- "Multi-functional programming" Differences in how students learn their programming skills- functional, vs optimal vs organized
### Course Specifics:
- Name: LING696G
- 7.5 week, asynchronous online
- Materials needed for January (Jan 10-Jan31)
- Jupyter notebook (high familiarity), HPC access (singularity-based ASR work), Docker (Mike spends his office hours debugging installation fails)
- Jan 10-20: office hours for HPC access, and general software installation issues
- mid-Feb- a mini project, a speech system built for some natural language
- Mozilla Common Voice dataset for data, plus Mike's codebase
- same metadata, same format, same everything
- Last 2 weeks of classes- final project
- Other Courses
- HLT Professionalism
### Meeting with Gus (02/20)
Agenda
- HLT Professionalism
- Find alignment
- DataLab's existing workshop offers
- Cracking
- Graph search VFS, DFS- talking through the approach
- Live coding and scenario
- Dynamic programming- edit distance
- The "homework" problem- ML problem, and what elements to include
- Coding template
- Portfolio, Networking
- Any requests from HLT faculty
- Assess if there are overlapping materials and domain experts
- Software learning badges
- Experiment tracking
- Weights and biases
- Comparing different runs of an ML models
- Tensorboard
- Ablation and compare architecture
- End-to-end-end
- HLT Bootcamp
- Summer 2025 potentially
- Example from 2023 software carpentry
- Demonstation of completion adds favour to applicants
- Building on Software carpentry- lean version, feels more HLT
- Expectations
- Labor
- Funding (?)
- Notebooks
- How to add, delete, edit cells
- Markdown vs code
- Onboarding students
- Software carpentry + HLT bent
- Making something accessible to potential students
- Hugging face
- QLORA (decoder-only LLM)
- Guest lectures from faculty
- project management
- Agile lite
- Github projects
- Test for 508: test for project management
- Give this a think
- Portfolio creation
- Test for streamlit and portfolio: 582 for next fall
- This is a group project
- Teaching it with github projects
- https://uazhlt-ms-program.github.io/ling-582-course-blog/assignments/course-project
- Can add videos on Hugging Face
- Compentency tests for self-evaluation