# Study Group Matcher Algorithm Overview
## Full Process Outline
- Create survey(s)
- Collect and collate information into a single `.csv` file (input data_)
- Create a config file
- Using config file and a models file, convert input data into compatible data structures
- handle pregrouping and existing groups
- **On remaining students, execute main group formation (considered the 'Core Matcher')**
- multi-partioning
- MCTS best-effort
- **execute post-processing**
- currently contains ALL singleton handling for race/gender
- Place final matches into csv, labeling students (in existing and new groups) by their unique group ID
- Using YAMM or similar templated emailer, email students
- if applicable, includes scheduling help (meeting day/time that everyone in group checked off)
- LATER: run reassignment round(s)
## Individual Steps
### Create a Survey
- emphasis on checkbox or radio-button selections
- current survey logic
- Consent details
- Do you want any group?
- if so: proceed
- else: immediate END
- Do you have a group? *{Note: want to change to make all students fill out info-collection survey questions, even if they have existing groups.}*
- if so:
- name their emails (error prone but SIDs are privileged.)
- select if your group is open to additions or not. {Ana's feature in pregrouping.}
- END
- if not: proceed
- Fill out required matching questions {Note: want to ensure relevant questions are required.}
- Fill out optional matching questions
- Other student information
### Collate information
- simple `pandas` csv merge on students' SID (considered the single unique identifier where possible. Else use email.)
- Used for demographic/preferences info combining.
### Create a config file
- Options:
- Specify ordering of 'hard' (partition) criteria.
- Specify ordering of 'best-effort' criteria.
- Specify mix/max group size.
- Implement bucketization for variables as needed (ex: treat junior/senior transfers as same for matching despite having separate options in survey.)
- Create map of survey questions -> Column objects (questions with associated types or text/email/SID/etc, along with associated bucketizing function pointers.)
- Purpose:
- specify the way the group matcher will work in terms of sizes, criteria considered, amount of flexiblity, etc.
- Also includes pre-processing (pair up 'open' existing groups, etc.) and post-processing code (singleton handling).
### Core Matching Algorithm (Summary of `partitioning.py`)
Note that 'student' below can mean multiple students if they were pre-paired or combined in some way. These are 'Nodes'.
#### Multi-Partitioning (MP) (Function `dfs`)
Suppose for simplicity that we have 1000 students, and 3 partitioning criteria. For each criteria, split up current students into partitions. Suppose:
- C1 (checkbox): remoteness (remote only, in-person only, or both).
- C2 (radio): year (year 1, 2, 3, 4, or transfer).
- C3 (checkbox): meeting day (any of the 7 days they can meet)
Suppose 100 students say remote, 700 say in-person only, 200 say both. Form 2 groups; remote group size 300, in-person group size 900. For 'both'-selecting students, create copies, since they can in-theory be placed into either type. Bookkeeping comes later to make sure no duplicate students exist.
Repeat this for other 2 criteria, creating copies as needed. If a student is very flexible, they may have many copies by the end of MP!
Groups formed will look like:
(40, (remote only, year 1, monday))
(20, (remote only, year 2, tuesday))
etc. etc. for the various combinations.
##### Stopping partitioning
Small partitions may need to stop early; otherwise, it violates minimum group size. For a given partition level, if a subgroup cannot be split without each one being >= min_partition_size (set in config), then stop. These groups may look like (3, (remote only, year 4, sunday)).
#### Winnowing *{Note: maybe have a ranking for days? Such as: Must select all days, assign numbers 1-7 to days, with 1 being top choice and 7 being veto. Can assign multiple days the same number.}*
Notice by the end of MP, we may have many duplicate/copied students from checkbox questions. We need to restrict what they actually belong to. Doing so requires eliminating copies intelligently. Considerations:
- handle smallest subgroups first: too heavily prioritizes uncommon preferences, penalizes accomodating students.
- Ex: suppose top 2 partitioning criteria match for some students, subgroup size is 60.
- student A wants M, but says they can make M, T, W and Sunday in theory.
- in subgroup, only students B-D say they can only make Sunday. Remaining students don't say Sunday at all, and min group size is 4.
- RESULT: Student A may have preferred M, T, W but now gets Sunday simply because students D onwards didn't say Sunday (for the sake of argument, suppose they each selected a single day to "ensure" they get their top choice.)
- handle largest subgroups first: uncommon/'fringe' preferences cannot be matched well. Doesn't allow 'pull-in' of students' copies from large to smaller groups to even sizes out. Empirically observed.
- handle most flexible students first: unable to do this algorithmically; coalescing is done at subgroup level.
We strike a balance by starting with a window size (say, group size [MIN, MAX] from config). Iterate through subgroups while any unassigned students remain:
- delete already assigned students.
- out of all subgroups, select the smallest one where size fits in window.
- if such group exists:
- set each student in there to assigned
- remove those students from matching consideration
- append students to final output list of groups.
- if such group doesn't exist:
- this means window was too small to find groups; make it wider on both sides (make min smaller, make max bigger). This now will include more subgroups, especially larger ones. {Note: this is where we enforce consideration of larger subgroups later, only once smaller subgroups are matched.} *{Note: Can tune window sizes with hyperparameters.}*
- rerun with larger window
- widen window in 3 total phases, until all students in groups.
#### Bottom-Up Merging [del] (Size Constraint Resolution)
Solidification of groups causes copies to be removed; this means some subgroups may now violate size constraints (not have enough students to form a group within [MIN, MAX]).
The solution is to find closest neighbors in tree-structure and form larger groups (maintain maximum similarity).
### Post-Processing *{Note: Defines how singleton criteria get enforced. This is the current algo state, not desired goal.}*
- For each final group in output list, form map of counts between races and genders.
- If existing group, we don't intervene (skip)
- If for some reason, match size is only 2, then combine to form groups of 4.
- Handle special case (existing solos) {Note: can be fixed by having ALL students fill out full survey}:
- in some situations, students miscommunicated in existing groups, and so ended up alone (ex: everyone they named didn't fill out the form.)
- Here, we only have race/gender, so we use this alone to form a match. These solos end up with other solos.
- Pair up hispanic students
- in each final group, if there's only 1 hispanic student, add that group to a list.
- from this list of singleton-Hispanic groups, move 1 Hispanic student from one group to another, converting 1,1 into 0,2. Repeat until all Hispanic students are paired. Odd case not handled.
- Pair up black students (same as above)
- Pair up female students (same as above) **{Note: the criteria that comes last is *most strongly* enforced. Therefore in this case, it is guaranteed that at MAX 1 group has a solo female.}**
- Compute match depths for adjusted groups to aid in emailing.
- incredibly non-extensible, manual/hardcoded segments of code here :(
New approach:
- pair up all demographics at some tunable depth of partition.
- Do in some reasonable order of reverse based on size.
- don't want specific ways of allowing/disallowing gender/race pairings
- can compute various freqs within subgroup; more priority to lower freq categories, which doesn't make this connection so explicit
- use the data to determine what a minority "actually" is
- issue is whether we can assume that forming mixed-race pairings gives same comfort as same race, etc.
- approach: won't do cross-race matching. do hard-hard (HH) partitions first (maybe not scheduling, but something else like bucketed year). after HH partitions, in every subgroup partition, do the pairings, and do so by looking at size of each demographic group. get frequency count statistics and go in reverse order. make pairs. if no pair exists in a group, can't handle this and they're a singleton by circumstance. if 3, then make a triplet. aka, handle all odds except 1 by allowing a triplet.
- can extend by considering (race, gender) tuple equality (both features) on first pass, then relax constraint to being one of the same group.
- but, start with strict equality; asian male == asian male, and nobody else. then once done for all tuples in a set of tuples, relax constraint to allow asian male == asian female, black male == hispanic male, etc. or something
- complexity worth? might need to try.
- have paired people up, one person's preferences will be overwritten by another's. decide the scheme ahead of time, to make sure the characteristics of a pair are well-selected. can do this for non-schedule based criteria.
- suppose partitions for year and timezone.
- make hard hard partitions on year.
- then add question to intake survey asking 'do you consider yourself organizer or leader, would you like to initiate or be reached out to.'
- try to make sure each group has at least one leader
- pair people up on demographics, to try and make sure each pair has leader and organizer. -> this would be best-effort, *try* to make sure every group has someone who self-identifies as leader.
- can adjust questions also to be 'softer'.
- give a list of office hours the class is hosting, and they select their preferred hw party and OH. you all go to HW party at some time, etc.
- can do similar thing for discussion
- differentiate; going to do optimization, but new: HH partitions, and also do the pairings as hard constraints (with iterative relaxation on tuple components).
- writing algorithm where we can choose what category each constraint is [done as current implementation]
- action item: write up LP optimization-based algorithm plan [start with CATME]
- linear program with configurable weights for everything (hard hard = massive weight, still fulfulls goal)
## Paradigm-shift ideas
- ask 2 questions. first: what is your committment level to this group? scale of 1 to 4 (1 = casual maybe, 4 = serious attendance-based approach) second: between 2 and 5 global % in the class, what penalty are you personally willing to accept if you personally do not meet your group's desired attendance target? Both can be used for matching in a weighted way.
- We generate a spreadsheet of attendances, a template thing that each group can fill out. Each group must coordinate and within 1 week of getting their first group, share an agreed upon 'structure' (structure includes a meeting frequency, an agreed-upon penalty for not following through like 2 % in the class under attendance/participation, etc., and any other things like 'slip days', permissible count of missed meetings before penalty, whatever. students have flexibility for their own group, and must mutually discuss.
- structure also includes a designated leader who each meeting will do the following: 1) maintain the spreadsheet, 2) take attendance, and 3) generally coordinate meeting places and times. This is also a good way to coordinate roles/contributions.
- Spreadsheet may include a simple midterms/exams timeline, where each person notes their weeks where they may be busier than usual due to exams. High-impact weeks can be skipped, based on group preference.
- The leader may be the person from the group who answered leader in the survey question for 'leader vs member'. That question should be clear that being a leader may require an additional 0.5-1 hours of work each week to communicate and make things happen.
- Immediately after this initial meeting, we could(?) have a "reassignment" round. Why? suppose students had this spreadsheet-meeting and realized the people wouldn't work for them because they can't agree on some things. They would then fill out a smaller version of the initial survey specifically naming what didn't work for them, and we try to match them again. This may not be worth the effort, publicizing this may encourage 'group-shopping'.
- Ideally, each class has a study-groups manager who may or may not be the same person that runs groups for the class (aka, software vs PM roles may be same or split.) They are the staff member contact (SC) for issues groups have, NOT piazza. This means when someone complains about communication, etc. the SC has their spreadsheet and survey stuff available to check. Aka, is the group casual or serious? What was the agreed upon penalty for missing meetings? How long has it been?
- This is also the main contact for other issues; feeling left out, wanting reassignment, needing to discuss changes to group penalty, etc. Balance between internal discussion and SC role as mediator (lower hours = SC does only the most critical stuff.)
- Perhaps 3 weeks after initial assignment, and 1 week before planned reassignment, we can help describe how to organize a 'groups meetup'. It's like each group shows up with their new friends, and meets other groups. This maybe(?) helps any students who are planning on asking for reassignment actually be concrete and meet others who they specifically might enjoy working with. Could also help existing solid groups merge and form something larger. Might be hard to make happen logistically, reliably. Might not fit with above idea.
- Have a dedicated room booking that each class should make, where students can use that room/space at given times for the explicit purpose of a class' study groups. Might not fit every group, but just as an option in case location is an issue.
- Have the class form specific study plans for each week that groups can follow. Can dual as a general study plan also for non-study-group students.
- may require varying amounts of work to prep for
- I'm thinking of something like the 16B online practice problems, or like a CSM worksheet / guerilla section problems. So that the meeting agenda has some pre-set content and it's not purely for conceptual questions.
- TAs can rotate 1-1 15/30-min meetings each week with study groups to hold basically a small-group session with different groups, on some sign-up basis. Helps give the process a more official feel.
- con: leaves out students not in groups, not fair probably so is there a way this could be adapted?
- We may want to leverage Piazza to be even more clear about who should be using these groups (almost everyone for intro classes would benefit from such groups and so we should be encouraging students to sign up. BUT we should be clear that signing up is only encouraged if you have discpline to set standards and then meet them.)
Survey Changes
- We could adapt the existing question which asks for what classes they're taking, and be more precise in saying which classes they'd be open to studying for with this new group. Then that can be used as a best-effort metric to try and maximize overlap, if it comes down to that level of precision.
- We need to be sure the survey collects preferences info from ALL students, including existing groups (open or closed). This is good for data collection but also helps when combining pairs/triplets of existing groups into larger ones.
- Shorten survey by removing questions that we can't powerfully/effectively use at the moment. ex: discussion section times.
- personality-based questions
- Divija mentioned she wanted to think about including some of these on the intake survey; the difficulty is in algorithmically using the results, because we don't know off-hand (for example) if we expect students with similar or different responses to certain questions to match better. This makes programming harder. We could encode a simple metric of 1) try to match similar vs 2) try to match different for a given question, but there's nuance to implement.
### Emailing Groups
From the output csv, use YAMM to auto-template email-sending. [done, tested and functional. no complaints]