# Database Demo **Recommended Leanring Resources** Materials provided by Prof Chiu **Outline** 1. Database Background 2. Raw Data Overview 3. Normalize Database Using Python 4. Normalization Check 5. Connect to SQLite 6. Check SQLite Connection **Color Notation** :::warning πŸ€” **Think** ::: :::danger ‼️ **Important Concept** ::: :::info πŸ“ **Case Study, Scenarios** ::: :::success πŸ€– **AI Integration** ::: --- ## 01 Database Background **The board** asked the executive leaders to propose methods to improve hospital profit by increasing revenue or reducing the cost. **The executive leaders** of the hospital follow a data-driven approach and decide to find out the proposals through the hospital management system. **As a data analyst**, you receive the following tables and you notice that the tables are not normalized. Thus, you need to first normaliz the database and provide a dashboard that could support the decision-making. **Repo:** https://github.com/PoYaSharonLin/PDDS/tree/Dash-Demo-Dev ## 02 Raw Data Overview **Data:** https://github.com/PoYaSharonLin/PDDS/blob/Dash-Demo-Dev/src/system_data.xlsx **Patient Table** ``` DataFrame from sheet: patient PatientID PatientName Age Gender BloodType PrimaryDiagnosis AssignedDoctors TotalExaminations LastVisitDate HospitalStayDays ContactNumber 0 P001 John Smith 45 Male O+ Hypertension D001, D003 5 2025-06-15 3 555-123-4567 1 P002 Mary Johnson 32 Female A- Diabetes D002 3 2025-07-01 1 555-987-6543 2 P003 Alex Brown 60 Male B+ Heart Disease D001, D004 8 2025-07-08 5 555-456-7890 3 P004 Sarah Davis 28 Female AB+ Asthma D003 2 2025-05-20 0 555-321-0987 4 P005 Emma Wilson 50 Female O- Arthritis D002, D004 4 2025-06-30 2 555-654-3210 ``` **Doctor Table** ``` DataFrame from sheet: doctor DoctorID DoctorName Specialty YearsExperience AssignedPatients ExaminationsOrdered AvgPatientAge Department ContactEmail 0 D001 Dr. Alice Lee Cardiology 12 P001, P003 10 52.5 Heart alice.lee@hospital.com 1 D002 Dr. Bob Carter Endocrynology 8 P002, P005 6 41.0 General bob.carter@hospital.com 2 D003 Dr. Clara Wong Pulmonology 15 P001, P004 4 36.5 Lung clara.wong@hospital.com 3 D004 Dr. David Kim Orthopedics 10 P003, P005 5 55.0 Bones david.kim@hospital.com ``` **Examination Table** ``` DataFrame from sheet: examination ExamID ExamType PatientIDs PatientNames DoctorID DoctorName ExamDate ExamCost ResultStatus DurationMinutes 0 E001 Blood Test P001, P002 John Smith, Mary Johnson D001 Dr. Alice Lee 2025-06-10 100 Normal, Normal 30 1 E002 ECG P003 Alex Brown D001 Dr. Alice Lee 2025-07-05 150 Abnormal 45 2 E003 X-Ray P004, P001 Sarah Davis, John Smith D003 Dr. Clara Wong 2025-05-18 200 Normal, Normal 20 3 E004 Blood Test P002, P005 Mary Johnson, Emma Wilson D002 Dr. Bob Carter 2025-06-28 100 Abnormal, Pending 30 4 E005 MRI P005, P003 Emma Wilson, Alex Brown D004 Dr. David Kim 2025-06-29 600 Pending, Pending 60 5 E006 Ultrasound P001 John Smith D003 Dr. Clara Wong 2025-06-12 120 Normal 40 6 E007 Blood Test P003, P004, P005 Alex Brown, Sarah Davis, Emma Wilson D001 Dr. Alice Lee 2025-07-07 150 Normal, Normal, Normal 30 ``` **Table Relationship** - 1 patient can have many doctors, and 1 doctor can have many patients (many-to-many) - 1 patient can have many examinations, and 1 examination can involve many patients (many-to-many) - 1 doctor can order many examinations, and 1 examination can be ordered by many doctors (many-to-many). :::warning πŸ€” **What are some issues of the current tables?** :::: spoiler **πŸ₯Έ Ans** 1. Many-to-many relation needs a separate table 2. Should be in attoic value e.g. PatientName, Date etc. ::: ## 03 Normalize Tables Using Python **Scipt:** ```bash python3 preprocessing/normalization.py ``` **Code:** https://github.com/PoYaSharonLin/PDDS/blob/Dash-Demo-Dev/preprocessing/normalization.py ### 1. Many-to-many relation needs a separate table | Table Name | Primary Key (PK) | | -------- | -------- | | patient | patientID | | patient_doctor | patientID & doctorID | **Code** ```python= patient = sheets["patient"].drop(columns=['AssignedDoctors']).copy() patient_doctor = sheets["patient"][["PatientID", "AssignedDoctors"]].copy() patient_doctor["AssignedDoctors"] = patient_doctor["AssignedDoctors"].str.split(",").apply(lambda id_list: [id.strip() for id in id_list]) # noqa: E501 # .str.split(",") # input: 'P001, P003' # output: ['P001', ' P003'] # apply(lambda id_list: [id.strip() for id in id_list]) # apply a function to id_list, what the function does is defined in [] # input: ['P001', ' P003'] # output: ['P001', 'P003'] patient_doctor = patient_doctor.explode("AssignedDoctors") # input: D001 with ['P001', 'P003'] # output: D001, P001 and D001, P003 patient_doctor = patient_doctor.reset_index(drop=True) ``` **Output** ``` print(patient_doctor) PatientID AssignedDoctors 0 P001 D001 1 P001 D003 2 P002 D002 3 P003 D001 4 P003 D004 5 P004 D003 6 P005 D002 7 P005 D004 ``` **ER Model** ![Untitled](https://hackmd.io/_uploads/S11oTM9Ile.png) ::::spoiler ER Model Code ```mermaid= erDiagram Patient ||--o{ Patient_Doctor : has Doctor ||--o{ Patient_Doctor : assigned_to Patient ||--o{ Patient_Examination : undergoes Examination ||--o{ Patient_Examination : includes Doctor ||--o{ Doctor_Examination : conducts Examination ||--o{ Doctor_Examination : performed_by Patient { string PatientID PK string PatientName int Age string Gender string BloodType string PrimaryDiagnosis int TotalExaminations date LastVisitDate int HospitalStayDays string ContactNumber } Doctor { string DoctorID PK string DoctorName string Specialty int YearsExperience float AvgPatientAge string Department string ContactEmail } Examination { string ExamID PK string ExamType date ExamDate int ExamCost int DurationMinutes } Patient_Doctor { string PatientID FK string AssignedDoctors FK } Patient_Examination { string ExamID FK string PatientIDs FK string ResultStatus } Doctor_Examination { string ExamID FK string DoctorID FK } ``` :::: ### 2. Should be in attoic value e.g. PatientName, Date etc. **Code** ```python= patient[['FirstName', 'LastName']] = patient['PatientName'].str.split(' ', expand=True) # noqa: E501 patient['LastVisitDate'] = pd.to_datetime(patient['LastVisitDate']) patient['Year'] = patient['LastVisitDate'].dt.year patient['Month'] = patient['LastVisitDate'].dt.month patient['Day'] = patient['LastVisitDate'].dt.day patient = patient.drop(['PatientName', 'LastVisitDate'], axis=1) print(patient) ``` **Output** ```python= PatientID Age Gender BloodType PrimaryDiagnosis TotalExaminations HospitalStayDays ContactNumber FirstName LastName Year Month Day 0 P001 45 Male O+ Hypertension 5 3 555-123-4567 John Smith 2025 6 15 1 P002 32 Female A- Diabetes 3 1 555-987-6543 Mary Johnson 2025 7 1 2 P003 60 Male B+ Heart Disease 8 5 555-456-7890 Alex Brown 2025 7 8 3 P004 28 Female AB+ Asthma 2 0 555-321-0987 Sarah Davis 2025 5 20 4 P005 50 Female O- Arthritis 4 2 555-654-3210 Emma Wilson 2025 6 30 ``` **ER Model** ![image](https://hackmd.io/_uploads/BkKqkK0Txx.png) ::::spoiler ER Model Code ``` erDiagram Patient ||--o{ Patient_Doctor : has Doctor ||--o{ Patient_Doctor : assigned_to Patient ||--o{ Patient_Examination : undergoes Examination ||--o{ Patient_Examination : includes Doctor ||--o{ Doctor_Examination : conducts Examination ||--o{ Doctor_Examination : performed_by Patient { string PatientID PK string FirstName string LastName int Age string Gender string BloodType string PrimaryDiagnosis int TotalExaminations int HospitalStayDays string ContactNumber int Year int Month int Day } Doctor { string DoctorID PK string Title string FirstName string LastName string Specialty int YearsExperience float AvgPatientAge string Department string ContactEmail } Examination { string ExamID PK string ExamType int ExamCost int DurationMinutes int Year int Month int Day } Patient_Doctor { string PatientID FK string AssignedDoctors FK } Patient_Examination { string ExamID FK string PatientID FK string ResultStatus } Doctor_Examination { string ExamID FK string DoctorID FK } ``` :::: ## 04 Normalization Check :::success **πŸ€– Prompt** does this ER model pass 1NF, 2NF, 3NF, and BCNF? If so, please provide the checklist ::: **Tables Check** | Table | 1NF | 2NF | 3NF | BCNF | Notes | |----------------------|-----|-----|-----|------|----------------------------------------------------------------------| | Patient | βœ… | βœ… | βœ… | βœ… | Atomic values, single PK (PatientID), no partial or transitive dependencies. | | Doctor | βœ… | βœ… | βœ… | βœ… | Atomic values, single PK (DoctorID), no partial or transitive dependencies. | | Examination | βœ… | βœ… | βœ… | βœ… | Atomic values, single PK (ExamID), no partial or transitive dependencies. | | Patient_Doctor | βœ… | βœ… | βœ… | βœ… | Junction table, composite PK, no non-key attributes. | | Patient_Examination | βœ… | βœ… | βœ… | βœ… | Junction table, composite PK, ResultStatus depends on full key. | | Doctor_Examination | βœ… | βœ… | βœ… | βœ… | Junction table, composite PK, no non-key attributes. | **Example** Patient Table Attributes: PatientID (PK), FirstName, LastName, Age, Gender, BloodType, PrimaryDiagnosis, TotalExaminations, HospitalStayDays, ContactNumber, Year, Month, Day - **1NF** - Atomic values: All attributes (e.g., FirstName, Age, BloodType) are atomic. The original PatientName and LastVisitDate were split into atomic components (FirstName, LastName, Year, Month, Day) in the provided code. - No repeating groups: No multi-valued attributes or lists (e.g., AssignedDoctors was moved to the Patient_Doctor table). - Primary key: PatientID is the primary key. - **Result: Satisfies 1NF.** - **2NF** - The primary key is a single attribute (PatientID), so there are no composite keys, and thus no partial dependencies are possible. - All non-key attributes (FirstName, LastName, Age, etc.) depend fully on PatientID. - **Result: Satisfies 2NF.** - **3NF** - Check for transitive dependencies: No non-key attribute depends on another non-key attribute. - For example, FirstName, LastName, Age, etc., depend only on PatientID. Year, Month, and Day (from LastVisitDate) describe the last visit date for the patient and depend only on PatientID, not on other non-key attributes. TotalExaminations might seem like it could depend on examinations, but it’s a summary attribute specific to the patient (e.g., a count of exams), so it depends on PatientID. - **Result: Satisfies 3NF, assuming no hidden dependencies (e.g., ContactNumber doesn’t depend on Age).** - **BCNF** - All functional dependencies must have a superkey on the left-hand side. Dependencies: PatientID β†’ {FirstName, LastName, Age, Gender, BloodType, PrimaryDiagnosis, TotalExaminations, HospitalStayDays, ContactNumber, Year, Month, Day}. - Since PatientID is the primary key (and thus a superkey), all dependencies satisfy BCNF. - **Result: Satisfies BCNF.** ## 05 Connect to SQLite **Scipt:** ``` python3 preprocessing/DBconnection.py ``` **Code:** https://github.com/PoYaSharonLin/PDDS/blob/Dash-Demo-Dev/preprocessing/DBconnection.py ```python conn = sqlite3.connect('hospital_database.db') cursor = conn.cursor() # Create tables in SQLite cursor.execute(''' CREATE TABLE IF NOT EXISTS patient ( PatientID INTEGER PRIMARY KEY, FirstName TEXT, LastName TEXT, Year INTEGER, Month INTEGER, Day INTEGER ) ''') cursor.execute(''' CREATE TABLE IF NOT EXISTS doctor ( DoctorID INTEGER PRIMARY KEY, Title TEXT, FirstName TEXT, LastName TEXT ) ''') cursor.execute(''' CREATE TABLE IF NOT EXISTS examination ( ExamID INTEGER PRIMARY KEY, ExamType TEXT, Year INTEGER, Month INTEGER, Day INTEGER ) ''') cursor.execute(''' CREATE TABLE IF NOT EXISTS patient_doctor ( PatientID INTEGER, DoctorID INTEGER, PRIMARY KEY (PatientID, DoctorID), FOREIGN KEY (PatientID) REFERENCES patient(PatientID), FOREIGN KEY (DoctorID) REFERENCES doctor(DoctorID) ) ''') cursor.execute(''' CREATE TABLE IF NOT EXISTS patient_examination ( ExamID INTEGER, PatientID INTEGER, ResultStatus TEXT, PRIMARY KEY (ExamID, PatientID), FOREIGN KEY (ExamID) REFERENCES examination(ExamID), FOREIGN KEY (PatientID) REFERENCES patient(PatientID) ) ''') cursor.execute(''' CREATE TABLE IF NOT EXISTS doctor_examination ( ExamID INTEGER, DoctorID INTEGER, PRIMARY KEY (ExamID, DoctorID), FOREIGN KEY (ExamID) REFERENCES examination(ExamID), FOREIGN KEY (DoctorID) REFERENCES doctor(DoctorID) ) ''') # Export DataFrames to SQLite patient.to_sql('patient', conn, if_exists='replace', index=False) doctor.to_sql('doctor', conn, if_exists='replace', index=False) examination.to_sql('examination', conn, if_exists='replace', index=False) patient_doctor.rename(columns={'AssignedDoctors': 'DoctorID'}).to_sql('patient_doctor', conn, if_exists='replace', index=False) # noqa: E501 patient_examination.rename(columns={'PatientIDs': 'PatientID'}).to_sql('patient_examination', conn, if_exists='replace', index=False) # noqa: E501 doctor_examination.to_sql('doctor_examination', conn, if_exists='replace', index=False) # noqa: E501 # Commit changes and close connection conn.commit() conn.close() ``` ## 06 Check SQLite Connection using command line - Go to ./src - Run `sqlite3 hospital_database.db` ![image](https://hackmd.io/_uploads/BJpLcR72gl.png) - Check SQLite3 connection - Run`.tables` should return all tables - Run `SELECT * FROM patient LIMIT 5;` ![image](https://hackmd.io/_uploads/rkl35C7nll.png) - Run other example queries ![image](https://hackmd.io/_uploads/Syy0qAX2lg.png) **SQL** https://github.com/PoYaSharonLin/PDDS/blob/Dash-Demo-Dev/query.sql ```sql= -- Insert Patient data BEGIN; INSERT INTO patient (FirstName, LastName, Year, Month, Day) VALUES ('John', 'Doe', 1985, 6, 15); COMMIT; SELECT * FROM patient; -- Retrieve all patients assigned to a specific doctor SELECT p.FirstName, p.LastName FROM patient p JOIN patient_doctor pd ON p.PatientID = pd.PatientID WHERE pd.DoctorID = "D001"; -- Count the number of examinations per patient SELECT p.FirstName, p.LastName, COUNT(pe.ExamID) AS ExamCount FROM patient p LEFT JOIN patient_examination pe ON p.PatientID = pe.PatientID GROUP BY p.PatientID, p.FirstName, p.LastName; -- Find all doctors who performed 'Blood Test' SELECT d.Title, d.FirstName, d.LastName FROM doctor d JOIN doctor_examination de ON d.DoctorID = de.DoctorID JOIN examination e ON de.ExamID = e.ExamID WHERE e.ExamType = 'Blood Test'; ``` ---- ## 06 Create DB and Check SQLite Connection using DB Browser **SQL:** https://github.com/PoYaSharonLin/PDDS/blob/database-demo/query.sql **1. Create new database** ![image](https://hackmd.io/_uploads/By1ozI1Ygg.png) **2. Import CSV as data table** ![image](https://hackmd.io/_uploads/ryq6GL1Yll.png) **3. Set table name & import CSV** ![image](https://hackmd.io/_uploads/rkAQQLyYxx.png) **4. Browse Data & Write Changes** ![image](https://hackmd.io/_uploads/ByEuQLJtlg.png) **5. Try out SQL Queries** Click "Execute SQL" and paste the code - INSERT ```SQL= INSERT INTO patient (FirstName, LastName, Year, Month, Day) VALUES ('John', 'Doe', 1985, 6, 15); SELECT * FROM patient; ``` Click the "Execute" button the execute the SQL. After executing the SQL, you should see that the new row is inserted ![image](https://hackmd.io/_uploads/S1a0XUytll.png) ![image](https://hackmd.io/_uploads/Hknn78yYel.png) - SELECT To select the patient who are under the same doctor, we can use the SQL below. ```SQL= SELECT p.FirstName, p.LastName FROM patient p JOIN patient_doctor pd ON p.PatientID = pd.PatientID WHERE pd.AssignedDoctors = 'D001'; ``` ![image](https://hackmd.io/_uploads/rytBELyYeg.png) :::warning **Verification** To verify whether the SQL is correct, let's first take a look at the `patient_doctor` table. **STEP 1** The patients whose doctor is D001 are P001 and P003 ![image](https://hackmd.io/_uploads/rJVvBI1Yge.png) **STEP 2** To see whether P001 and P003 are John and Alex, let's check the patient table. ![image](https://hackmd.io/_uploads/SJesY4L1tgl.png) Indeed, both Alex and John are under the same doctor D001, Alice Lee. :::