DB final - HackMD

DB final ===  ## Implementation 1. [PostgreSQL README by @xinhuang](/4YqotwHcTn6rbHq2lgEQzw) 2. [Getting started with Postgres Functions in plpython](https://www.crunchydata.com/blog/getting-started-with-postgres-functions-in-plpython) 3. [PlPython3u documentation](https://www.postgresql.org/docs/current/plpython.html) 4. [Enable plpython3u on MacOS](https://stackoverflow.com/questions/70650619/how-to-enable-plpython3u-on-macos) 5. [Return result set with plpython3 in postgres](https://stackoverflow.com/questions/16376732/how-to-return-resultset-with-plpython3-in-postgres) 6. [How to install a Python package that can work on PostgreSQL and plpython](https://stackoverflow.com/questions/69013111/how-to-install-a-python-package-on-linux-so-that-it-is-found-by-the-already-work) 7. [Install the Python package in the target directory using pip](https://stackoverflow.com/questions/2915471/install-a-python-package-into-a-different-directory-using-pip) 8. [Python, pip, site-packages的關係](https://medium.com/@will.wang/%E6%92%A5%E9%96%8B-python-pip-site-packages-%E7%9A%84%E8%97%8D%E8%89%B2%E8%9C%98%E8%9B%9B%E7%B6%B2-90e398bb3785) - [**DB final code submission**](/2x9Z4UWLTWC46AAcdMt-DA) ## Slides - [Google slides](https://docs.google.com/presentation/d/1zUaGSJpqksamsesEuuau27rMA1s1l5bVYprr43UgrFc/edit?usp=sharing) - [報告說明投影片by助教](https://docs.google.com/presentation/d/1QVDax3Hc0swllkKPpRAEaItQuwAxKooZwrEDxpdipKs/edit#slide=id.g23ddfc456e2_0_358) - [書面報告共編](https://docs.google.com/document/d/1Spe2ide7kNH3E9UbNnF4wDvlm8uZsSA3/edit?usp=drive_link&ouid=103984669321720374461&rtpof=true&sd=true) ## 想法 - **老師**：優化目前仍需 load 資料到 database 外的效能缺點，能否以內建 operation 去讀資料或 Join 資料 - **老師**：擴大 Fuzzy 的定義， string 的語義 match 程度等 - 應用在什麼類型的 ML tasks 或 data science tasks ，例如在 reinforcement learning 中用 auto-fuzzy join 可以改善資料前處理的效率？ - 幫助 user 快速了解 dataset 的內容與彼此的 joinability 之類的，可視覺化最好 - 有沒有 benchmark 可以評估？ - 在 ML 或 DS 的案例中，通常資料比較雜亂沒有結構，沒有預先設計 ER Model 的情況下，比較難直接使用，需要先做 Data Integration 或 Feature Engineering ；而在這兩個階段較常用到 `JOIN` 的 operation，將沒意義的 raw data 轉換成有意義的 knowledge。 - **雲行**：Fuzzy Join 更多 Tables ## 待解決 ### 三個以上的 sets 的分數、距離怎麼計算 - [Snowflake -- MinHash](https://docs.snowflake.com/en/user-guide/querying-approximate-similarity) - [Find Jaccard similarity between multiple lists](https://stackoverflow.com/questions/71374123/find-jaccard-similarity-between-multiple-list) - [Computing degree of similarity among a group of sets](https://stackoverflow.com/questions/2035326/computing-degree-of-similarity-among-a-group-of-sets) - [Jaccard Index between set and multiset](https://stats.stackexchange.com/questions/162508/jaccard-index-between-set-and-multiset) - [LCS](https://codeforces.com/blog/entry/92908) + Percentage Difference ### implement FJ in EXTENSION ```SQL CREATE OR REPLACE FUNCTION autofj() RETURNS TEXT AS $$ from autofj.datasets import load_data from autofj import AutoFJ left_table, right_table, gt_table = load_data("TennisTournament") fj = AutoFJ(precision_target=0.9) result = fj.join(left_table, right_table, "id") return result $$ LANGUAGE plpython3u; SELECT autofj(); ``` ## 測試 - 視覺化：`plpython3u` 內沒有包 `matplotlib` ，視覺化目前查到可用 [pgcharts](https://github.com/dimitri/pgcharts/)。 - (0529更): 可用`matplotlib`做視覺化 - 資料庫改寫: - [StellarSQL](https://tigercosmos.xyz/tags/stellarsql/) - [PostgreSQL Internals](https://edu.postgrespro.com/postgresql_internals-14_parts1-4_en.pdf) - [SQLite for Code Reading](https://github.com/davideuler/SQLite-2.5.0-for-code-reading) - 資料庫原 Fuzzy-like operation: - [fuzzystrmatch](https://www.postgresql.org/docs/current/fuzzystrmatch.html) - MindsDB: - [Introduction](https://medium.com/mindsdb/how-to-bring-your-own-machine-learning-model-to-databases-47a188d6db00) - [BYOM](https://docs.mindsdb.com/custom-model/byom) - [With NLP](https://docs.mindsdb.com/nlp/what-is-nlp) - [PostgresML](https://github.com/postgresml/postgresml#installation) - [Official Website](https://postgresml.org/) ## Paper - [Auto-FuzzyJoin](https://dl.acm.org/doi/pdf/10.1145/3448016.3452824) ## Github - [Auto-FuzzyJoin](https://github.com/chu-data-lab/AutomaticFuzzyJoin) #### Multi-Cols FJ Dataset - [Database Group Leipzig](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution) - [The Magellan Data Repository](https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository) ( using IMDB+Rotten Tomatoes input tables as example) - [RIDDLE](https://www.cs.utexas.edu/users/ml/riddle/data.html) #### Multi-Cols FJ Example ```python # import package and dataset import pandas as pd from autofj import AutoFJ imdb = pd.read_csv("./Rotten Tomatoes-IMDB/imdb.csv") rotten_tomatoes = pd.read_csv("./Rotten Tomatoes-IMDB/rotten_tomatoes.csv") # fuzzyjoin(Multi-columns) fj = AutoFJ(precision_target = 0.9) fj_column = ["Title", "Director", "Creators", "Cast"] result = fj.join(rotten_tomatoes, imdb, 'ID', fj_column + ["autofj_id"]) # 大概要跑1x分鐘 ``` #### Multi-Cols Parameters 1. *join_function_space:* Join複雜程度，可輸入list或dict 2. *distance_threshold_space:* 可輸入list 3. *column_weight_space:* 可輸入list ## Docs - [DB Final Proposal](https://docs.google.com/document/d/1OsS4Y0Te3aNL46Gdha7IKQP3Zy1314H4sXPXodOGOFs/edit) ## Testing Dataset :link: [dataset link](https://drive.google.com/drive/folders/1KBSJAvxCqLhI67yrNxJyNuWa_iXZXMvM?usp=drive_link) (6/7) 1. Rotten Tomatoes - IMDB > 來源: [The Magellan Data Repository](https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository) > 資料比數: 100 > 資料欄位: Id, Name, Director, Genre, Description, Release Date 2. Douban - Yahoo > 來源: [豆瓣](https://movie.douban.com/) & [Yahoo電影](https://movies.yahoo.com.tw/index.html) > 資料比數: 100 > 資料欄位: Id, Title, Director, Release, Description

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.