COURSE SCRAPER

# COURSE SCRAPER - Requirements : https://hackmd.io/ZwKIk2c8SkOFmVTuYF68ww - Meeting record : https://hackmd.io/@FiO-Internal/Hy5qXh1Ou - Trello : https://trello.com/b/QDl5vQR3/ntpu-course-recommendation ## Current Instruction 我們先把「Tagging System」的「最小雛形」做出來，從兩個面向找交集： 1. （你已完成）北大課程的 tagging 2. （需要進行）學生的 tagging 再接著做到：推薦學生課程/科系再來才是：爬其他學校的資料我稍早有與 TW Jack 討論到這一個問題。請這樣做： 1. 請 Ivan 提供：fio app ids which are “Learning Curve” relate apps and the apps have data @IvanTsou 2. 請 Eason 提供： fio app ids which are 靜心高中 Learning Curve relate apps and the apps have data @EasonC13 請提供出來，讓 Tony 可以根據這些資料進行後續事情。也請 Eason 教一下 Tony：透過 FiO API 取得指定 fio app id 的上鏈內容，以用於 tagging system - access highschool data curl command: - [prototype api docs for getting data](https://dev.fio.one/api-docs/bif-api/#/) - curl -X GET "https://dev.fio.one/my-apps/772/data" -H "accept: application/json" -H "X-API-KEY: kCXRqaJxiNyuj5zbVgYLidGtBEMtp8ns" ## Abstract According to past discussions, the process is divided into - Collection of information: curriculum information of designated schools (one/several rooms), FiO Learning Curve - Discuss the relevance of different Tag/Hash - Tagging: tags created by students themselves, tags run through our system ## Purpose Create reusable Data collection and extraction preprocessing for NTPU so as to abstract away preprocessing step and keep data format standardized. **Tasks** 1. Establish correlation between different tag or hash 2. The database of the default (main) tag classification and the student's own tag (for reference) will be built separately 3. Set different weights for different tags 4. In the initial stage, some keywords will be collected from the curriculum syllabus or department establishment, class name, etc. of each school (both Chinese and English). Community Verified icon ## Technologies - Trello - HackMD - Excel - SqlDBM app - Spyder - Jupyter Notebook - D3.js - Anaconda - FastAPI - SQL - Sqlite3 - MySQL - Docker compose - valentina studio ## Bare Bones course recommendation ![](https://i.imgur.com/iX330ti.png) - /index (get request): - returns "Adhoc course recommendation" message - /docs (get request): <- should be pretty intuitive to use as a GUI - shows fastapis docs interface to play with the apis - /manual_match/{user_email} (get request) - input user email as query - if no email match return {"error":"no data"} - simple greedy algo of matching TFIDF max words counts with student and university departments - /highschool/{user_email} (get request) - input {all} to match all highschool data as its clusters to match with university clusters.(K means is unsupervised, so you need to determine what each clusters as feature vectors mean) - query by user email, do perform TFIDF weighted clustering of student words and with university TFIDF weighted clusters - if no email match return {"error":"no data"} - Python libraries: - requests,urllib,time,re,os,bs4,time,chardet,lxml - sqlite3,pymysql, pandas,numpy,sqlalchemy - sklearn.preprocessing - **custom built functions - [jieba for chinese word parsing](https://investigate.ai/text-analysis/using-tf-idf-with-chinese/) - [Online marketing productivity and analysis tools](https://advertools.readthedocs.io/en/master/advertools.stopwords.html) ## Entity Relational Model **DB Schema** - design a standardized database schema **Data Base Table Description** <- currently in production DB - chinese_course_description_bulletins_tb: - each individual courses desecription bulletins - chinese_course_prerequisites_tb: - prerequisites for taking each course written in chinese language - chinese_query_guide_tb: - inidividual course descriptions superficial information - chinese_tec_tb: - aggregation of all instructors weekly course/office hours schedule - c_general_courses_tb: - aggregation of all course catalogs from all departments in chinese language - english_course_prerequisites_tb: - prerequisites of courses written in english - course_supposed_for_elective_required_tb: - course serial number mapped with its target majors group and whether its elective or not - e_general_courses_tb: - aggregation of all department catalogs in english - e_general_remarks_tb: - remarks only available in english deparment catalogs - identity_course_limitations_tb: - table that shows which course serial No. is for what kind of person to take - major_course_limitations_tb: - shows serial NO. course for student from what major --- [adhoc ERM excel](https://drive.google.com/file/d/1WYtk1c6joR-sdxkx2dEcZBW6DumBhCY0/view?usp=sharing) [sample normalization](https://docs.google.com/spreadsheets/d/1eaE2zPFBxH_4kCqw9tOmS09IOcJ4yUXFB_i2IOoCk68/edit?usp=sharing) [drawio diagram](https://drive.google.com/file/d/1XYvdfVLdyxfbzNHI4w4yPxPWk71x7d8k/view?usp=sharing) [SqlDBM env](https://app.sqldbm.com/MySQL/Edit/p179701/#) [Sample pdf that can be parsed with webservice to excel](https://drive.google.com/file/d/1nY_5KglaFIlY8-txGKmghgQYgCJo6PPG/view?usp=sharing) ![](https://i.imgur.com/DtyzWEd.png) - Tables that have more general data fields ![](https://i.imgur.com/IOkQllH.png) - possible normalization plans **Basic recommendation** - matching different clusters of tags to each other - Gdsc club may be in cluster of 1 "programming" <- from Ntpu tag extraction - student tag programming clubs is also in cluster 1 of "programming" - High student tag extraction - Tag Matching - match top 100 words of weighted TFIDF of queried student and univserity department - or can match highschool student cluster(with manual labeled clusters) to university information clusters(also manually labeled) - Provide Data - High School Learn Curve (text). - Process the data - TFIDF - Keyword **Deployment and Packing of NTPU recommendation** - use FastAPI then provide API Document to prototype and test recommendation. - only prototype as this stage - will need repeated analysis improvement and testing validation of real data for production use - FastAPI is written in python so also need further project design and requirements of how to integrate to TMS learning curve **Where to scrape first?** NTPU: [chinese undergraduate courses](https://sea.cc.ntpu.edu.tw/pls/dev_stud/course_query_all.queryByReOp?qCollege=&qDept=&qDept2=GU15&qkind=%A5%B2%BF%EF%AD%D7&qYear=&qTerm=&qGrade=&qClass=&week=&seq1=A&seq2=M) [english undergraduate courses](https://sea.cc.ntpu.edu.tw/pls/dev_stud/COURSE_QUERY_ENG.queryByReOp?qCollege=&qDept=&qDept2=GU15&qkind=%A5%B2%BF%EF%AD%D7&qYear=&qTerm=1&qGrade=&qClass=&week=&seq1=A&seq2=M) - Faculty diversity: - NTPU, National Taiwan University, National Chengchi University, National Tsing Hua University, Jiaotong University, Normal University - Taike, Yunke, Pingke ## 2021/07/27 Course Recommendation(preprocessing) note - Characteristics of the student <=> Characteristics of the course - (Characteristics of the student) The record of the student CP1, APAC implied Good logic => (Characteristics of the course) Probability, algorithm - Recommended student departments - Method: TF-IDF - Simple way: through the course, find the representative words of the department - Difficult way: find the representative words of the department through the introduction of the department - The easiest way currently available: - step 1. cut term / Chinese word tokenize / segmentation - step 2. Word frequency - step 3. remove auxiliary words NOTICE - Do not mix up the courses of the graduate school and doctoral class - The serial number that starts with N must be removed first, and the one that starts with U is left - First understand the rules of serial number or course number - M: master - U: undergraduate - N: Bachelor of Advanced Studies - P: Master's in-service special class - qYear=109&qTerm=2 -> year & semester - Curriculum - PDF: Courses that the department can offer - Checked on the webpage: There are actually courses offered in the semester - Mainly "found on the website" (there are courses opened in the year) - Do you need to be recommended to the "school" level, or only recommend the "department" - Example: CS vs CSIE - "School" and "Department" should be kept separately - Conclusion: "School" and "Department" should be saved **(Important)** - Department: Keep the fields of "English name of department", "English abbreviation of department", "Chinese name of department", and "Chinese abbreviation of department" - Purpose: Use "English abbreviations" in cross-school data merge is less prone to errors - no masters, U stands for undergraduate - separate tables from U and M,N - scrap courses that are active? - if in english attach department to the course name - make every table specific to its subject - alternative word for course and department - IF-IDF scoring word feature importances - course description/ history e.g. algorithms course number may have different coures numbers through times? - http://sea.cc.ntpu.edu.tw/pls/dev_stud/course_query_eng.query_frame?flag=6 [Chinese course sample](https://sea.cc.ntpu.edu.tw/pls/dev_stud/course_query_all.queryByAllConditions?seq1=A&qCollege=%AAk%AB%DF%BE%C7%B0%7C&qYear=109&qTerm=2) [English course catalogs](https://sea.cc.ntpu.edu.tw/pls/dev_stud/course_query_eng.query_frame?flag=1) [TF IDF feature importance preprocess](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089) [natural language optimized DB](https://towardsdatascience.com/return-clauses-in-natural-language-queries-74a4a2fd53e6) [hacker news recommendation example](http://www.righto.com/2013/11/how-hacker-news-ranking-really-works.html?m=1) ## 2021/03/12 Tagging System & AI - The relevance of different Tag or Hash - What the system needs (collected secretly) - Collectible data - Uncollectible data - The student's own tag - Save separately, calculate separately - Refer to use data, use weight to divide - Set different weights for different tags - Each entity has its own tag -> database 1 - Tag entered by each user -> database 2 - AI's tag should be constantly updated - School vs. student keywords - Develop a basic guideline for the school and teachers to order tags - Course name, students, suggest keywords, let the other party choose - Method 1: Find a school teacher - Method 2: Parsing course outline or department establishment, course name (this method is more feasible) - The behavior cannot be deceived, succession, activity service, participation time and place - Target - Departments that can recommend students - Course selection system of each school -> Recommended courses, course of study - School, department, outline - Tag classification - system - Custom - First catch both Chinese and English -> then translate into Chinese - Allow schools to upload "finished orders", FiO provides format - Name of the event party, introduction, experience - Check entity, how to get their tag - Make a small prototype first ## 2021/03/03 About Tagging System About Tagging System(Learning Curve + AI, discussed w/ Hana): - Suggest before the meeting 1. List "specific" (sub)goals, the specific meaning is... what do you want to see? It cannot be too general; it is best if there is a quantitative definition 2. For each target, list all (possibly) required data fields and attributes - Meeting minutes The purpose must be mastered: - Student learning pattern - Student's future learning direction - Student learning behavior - Student's learning preferences - The strengths and weaknesses of students - After analysis, give students suggestions for learning - Achieve through Tagging system - Points to consider: - The analysis object needs to be divided into batches (samples in different periods and different batches will be different) - Questions to be added during form design: - Preference for activities - Will you participate again - How to obtain the background of the participants? -> Related to the logic of horizontal analysis - First develop a preliminary tagging model - Various types of descriptions (confirm various categories, that is, the results of personality analysis) - Keyword used - How high is the student's intention - What each department's position would like to see - Appeals and cross-references of various departments to students (LMS+LRS) - Preliminary conception (there are three roles: activity party, student, school) - Collect data - Activity Party (Department) - Tag classification (department, course name orientation) - Theme of the event - Keyword - Narrate - student - Degree of preference - Experience - Like/dislike points - Rating (1-5 stars) - Recommended level - Will you participate again - Cross-validation option (to avoid data errors) - satisfaction level - School - Upload learning journey - Upload of Student Sexuality Test (Certified) ## FiO 台北大學課程爬蟲 **Abstract** We hope to build a system that will refer to the high school students’ learning history and give advice on selecting technology. For example, many people may not know that there is a department such as the "Theatre Therapy Department" abroad. We hope that this system has read the student's study history (participated in activities, reads, participated in camps, etc.) ), you can suggest that this student may be suitable for studying drama therapy. In order to do this, the first step we need is the course materials for the bachelor's class at Taipei University. **Background Information** - Very early experiment - Relations: Ivan, Joe, Karl - Why use NTPU University: - Because the collaborating professor is a NTPU University professor - Because NTPU University's department-level courses are easier to climb **Caution** - Because it is still a very early experiment, data storage needs to maintain a certain degree of flexibility **References** [台北大學課程查詢系統](https://sea.cc.ntpu.edu.tw/pls/dev_stud/course_query_all.query_frame?flag=1) - [Query example](https://sea.cc.ntpu.edu.tw/pls/dev_stud/course_query_all.queryByAllConditions?seq1=A&qCollege=%AAk%AB%DF%BE%C7%B0%7C&qYear=109&qTerm=2) - key list qEdu: qCollege: qdept: qYear: 109 qTerm: 2 qGrade: qClass: 應修系級 cour: teach: qMemo: week: seq1: A seq2: M **Index Definition** - Course serial number - In principle, it can be cross-yearly. A few courses have the same serial number but different Chinese course names (the English is the same) - May have to pull out the data and observe - Department of Courses - Now it’s the abbreviation, and then the full name will be changed (to make a connection). - Law Section, Faculty of Law -> Maintain - Language -> Language Center - to be confirmed - Consistent with the term "visual inspection" used by the Department of Education - It was beaten by people, but there is still regularization - Is there any strange course name (group by confirmed) - Course Requirements - Corresponding to his compulsory elective courses together, it may be more troublesome here, it depends on which line - Check for special conditions - Business Management Department 1A 2B - 1: Grade - A: Grouping, grouping by the number of people - If there is a "department" at the end, it should be the course taught by the department (guess) - Others: General Education Center, Language Center… etc - Special situation: Master of enterprise, master of state-owned enterprise, summer school, business school 1, Taipei University of Science and Technology? ? ? U2228? ? ? ? ? Taipei University of Science and Technology? [link](http://sea.cc.ntpu.edu.tw/pls/dev_stud/course_query.queryGuide?g_serial=U2228&g_year=109&g_term=2&show_info=part) ![](https://i.imgur.com/GnBCV7w.jpg) - Limited number of repairs, selected number: need to climb - Course Name - All catch, but mainly in English, turn to lowercase and remove blanks and symbols - Different departments, choose different calculus, but in fact the same content, but separate (the English course name is the same), but the course name may be the same but the content is different (program language python / c) - same name different content? ## OTHER - Enrollment is different from the beginning of the course, which may represent a non-main subject and have low weights - Combine different introductions of the same English course name? - Block repair limit Not deal with ![](https://i.imgur.com/6qVJXdC.png) ### Githubs - [pdf to api](https://github.com/pdftables/python-pdftables-api) - [sqlite-web](https://github.com/coleifer/sqlite-web)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.