DB final
===
<!-- [TOC] -->
## Implementation
1. [PostgreSQL README by @xinhuang](/4YqotwHcTn6rbHq2lgEQzw)
2. [Getting started with Postgres Functions in plpython](https://www.crunchydata.com/blog/getting-started-with-postgres-functions-in-plpython)
3. [PlPython3u documentation](https://www.postgresql.org/docs/current/plpython.html)
4. [Enable plpython3u on MacOS](https://stackoverflow.com/questions/70650619/how-to-enable-plpython3u-on-macos)
5. [Return result set with plpython3 in postgres](https://stackoverflow.com/questions/16376732/how-to-return-resultset-with-plpython3-in-postgres)
6. [How to install a Python package that can work on PostgreSQL and plpython](https://stackoverflow.com/questions/69013111/how-to-install-a-python-package-on-linux-so-that-it-is-found-by-the-already-work)
7. [Install the Python package in the target directory using pip](https://stackoverflow.com/questions/2915471/install-a-python-package-into-a-different-directory-using-pip)
8. [Python, pip, site-packages的關係](https://medium.com/@will.wang/%E6%92%A5%E9%96%8B-python-pip-site-packages-%E7%9A%84%E8%97%8D%E8%89%B2%E8%9C%98%E8%9B%9B%E7%B6%B2-90e398bb3785)
- [**DB final code submission**](/2x9Z4UWLTWC46AAcdMt-DA)
## Slides
- [Google slides](https://docs.google.com/presentation/d/1zUaGSJpqksamsesEuuau27rMA1s1l5bVYprr43UgrFc/edit?usp=sharing)
- [報告說明投影片by助教](https://docs.google.com/presentation/d/1QVDax3Hc0swllkKPpRAEaItQuwAxKooZwrEDxpdipKs/edit#slide=id.g23ddfc456e2_0_358)
- [書面報告共編](https://docs.google.com/document/d/1Spe2ide7kNH3E9UbNnF4wDvlm8uZsSA3/edit?usp=drive_link&ouid=103984669321720374461&rtpof=true&sd=true)
## 想法
- **老師**:優化目前仍需 load 資料到 database 外的效能缺點,能否以內建 operation 去讀資料或 Join 資料
- **老師**:擴大 Fuzzy 的定義, string 的語義 match 程度等
- 應用在什麼類型的 ML tasks 或 data science tasks ,例如在 reinforcement learning 中用 auto-fuzzy join 可以改善資料前處理的效率?
- 幫助 user 快速了解 dataset 的內容與彼此的 joinability 之類的,可視覺化最好
- 有沒有 benchmark 可以評估?
- 在 ML 或 DS 的案例中,通常資料比較雜亂沒有結構,沒有預先設計 ER Model 的情況下,比較難直接使用,需要先做 Data Integration 或 Feature Engineering ;而在這兩個階段較常用到 `JOIN` 的 operation,將沒意義的 raw data 轉換成有意義的 knowledge。
- **雲行**:Fuzzy Join 更多 Tables
## 待解決
### 三個以上的 sets 的分數、距離怎麼計算
- [Snowflake -- MinHash](https://docs.snowflake.com/en/user-guide/querying-approximate-similarity)
- [Find Jaccard similarity between multiple lists](https://stackoverflow.com/questions/71374123/find-jaccard-similarity-between-multiple-list)
- [Computing degree of similarity among a group of sets](https://stackoverflow.com/questions/2035326/computing-degree-of-similarity-among-a-group-of-sets)
- [Jaccard Index between set and multiset](https://stats.stackexchange.com/questions/162508/jaccard-index-between-set-and-multiset)
- [LCS](https://codeforces.com/blog/entry/92908) + Percentage Difference
### implement FJ in EXTENSION
```SQL
CREATE OR REPLACE FUNCTION autofj()
RETURNS TEXT
AS $$
from autofj.datasets import load_data
from autofj import AutoFJ
left_table, right_table, gt_table = load_data("TennisTournament")
fj = AutoFJ(precision_target=0.9)
result = fj.join(left_table, right_table, "id")
return result
$$ LANGUAGE plpython3u;
SELECT autofj();
```
## 測試
- 視覺化:`plpython3u` 內沒有包 `matplotlib` ,視覺化目前查到可用 [pgcharts](https://github.com/dimitri/pgcharts/)。
- (0529更): 可用`matplotlib`做視覺化
- 資料庫改寫:
- [StellarSQL](https://tigercosmos.xyz/tags/stellarsql/)
- [PostgreSQL Internals](https://edu.postgrespro.com/postgresql_internals-14_parts1-4_en.pdf)
- [SQLite for Code Reading](https://github.com/davideuler/SQLite-2.5.0-for-code-reading)
- 資料庫原 Fuzzy-like operation:
- [fuzzystrmatch](https://www.postgresql.org/docs/current/fuzzystrmatch.html)
- MindsDB:
- [Introduction](https://medium.com/mindsdb/how-to-bring-your-own-machine-learning-model-to-databases-47a188d6db00)
- [BYOM](https://docs.mindsdb.com/custom-model/byom)
- [With NLP](https://docs.mindsdb.com/nlp/what-is-nlp)
- [PostgresML](https://github.com/postgresml/postgresml#installation)
- [Official Website](https://postgresml.org/)
## Paper
- [Auto-FuzzyJoin](https://dl.acm.org/doi/pdf/10.1145/3448016.3452824)
## Github
- [Auto-FuzzyJoin](https://github.com/chu-data-lab/AutomaticFuzzyJoin)
#### Multi-Cols FJ Dataset
- [Database Group Leipzig](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution)
- [The Magellan Data Repository](https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository) ( using IMDB+Rotten Tomatoes input tables as example)
- [RIDDLE](https://www.cs.utexas.edu/users/ml/riddle/data.html)
#### Multi-Cols FJ Example
```python
# import package and dataset
import pandas as pd
from autofj import AutoFJ
imdb = pd.read_csv("./Rotten Tomatoes-IMDB/imdb.csv")
rotten_tomatoes = pd.read_csv("./Rotten Tomatoes-IMDB/rotten_tomatoes.csv")
# fuzzyjoin(Multi-columns)
fj = AutoFJ(precision_target = 0.9)
fj_column = ["Title", "Director", "Creators", "Cast"]
result = fj.join(rotten_tomatoes, imdb, 'ID', fj_column + ["autofj_id"])
# 大概要跑1x分鐘
```
#### Multi-Cols Parameters
1. *join_function_space:* Join複雜程度,可輸入list或dict
2. *distance_threshold_space:* 可輸入list
3. *column_weight_space:* 可輸入list
## Docs
- [DB Final Proposal](https://docs.google.com/document/d/1OsS4Y0Te3aNL46Gdha7IKQP3Zy1314H4sXPXodOGOFs/edit)
## Testing Dataset
:link: [dataset link](https://drive.google.com/drive/folders/1KBSJAvxCqLhI67yrNxJyNuWa_iXZXMvM?usp=drive_link) (6/7)
1. Rotten Tomatoes - IMDB
> 來源: [The Magellan Data Repository](https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository)
> 資料比數: 100
> 資料欄位: Id, Name, Director, Genre, Description, Release Date
2. Douban - Yahoo
> 來源: [豆瓣](https://movie.douban.com/) & [Yahoo電影](https://movies.yahoo.com.tw/index.html)
> 資料比數: 100
> 資料欄位: Id, Title, Director, Release, Description