HackMD Error: 404 error

Python 讀寫 Google Sheets 教學

tags: `Python` `Google Sheets` `Tutorials`

Python 讀寫 Google Sheets 教學

前言

結合 Google Sheets 本身能直接分享、檢視、編輯的優點，拿來當作簡易的資料庫是不錯的選擇，或者是用來將持續變動的/重複輸出的資料直接呈現給客戶，他們不用總是要下載後開啟才能讀寫資料。

取得權限

在開始使用前要先取得讀寫 Google Sheets 的權限，進入下面連結輸入一些資訊後即可取得憑證 json 檔。
https://console.developers.google.com/flows/enableapi?apiid=sheets.googleapis.com

值得注意的是，同一個憑證可以開啟多種權限（如同時取得 Firebase 的權限）。

取得憑證 json 檔後你擁有的是「讀取」你的 Google Drive 中 Google Sheets 的權限，如果想要「編輯」 Google Sheets，甚至是讀寫其他人共用給你的 Google Sheets，最快的方式就是：

打開憑證的 json 檔，複製 client_email 中的那串 email。

將你要讀寫的 Google Sheets 共用給上面那串 email，權限選擇「可以編輯」，就完成啦！

套件選擇

Google 有自己的 API 來讀寫 Google Sheets，但看起來有點麻煩，非原生的有gspread或pygsheets套件，我剛開始是用gspread，後來發現pygsheets更簡單一點，主要原因是它可以直接輸入或輸出 dataframe，方便很多。

pygsheets 套件

使用憑證開啟 Google Sheets

取得憑證後只需要用 Google Sheets 的 ID 就可以連接，不過用 url 可以直接點過去看一下也是滿方便的。






import pygsheets

gc = pygsheets.authorize(service_account_file='path/to/key.json')

survey_url = 'https://docs.google.com/spreadsheets/d/1cNqBIzF_T8yZm5S1hj7E1MNVXQegVDVuJsUWVoDKIDw/'
sh = gc.open_by_url(survey_url)

可以跟下面的gspread套件比較，pygsheets不需要額外安裝認證套件，也不需特別指定 scope。

寫入方式





ws = sh.worksheet_by_title('demo')
ws.update_value('A1', 'test')

df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
ws.set_dataframe(df1, 'A2', copy_index=True, nan='')

讀取方式


val = ws.get_value('A1')
df2 = ws.get_as_df(start='A2', index_colum=1, empty_value='', include_tailing_empty=False) # index 從 1 開始算

讀取注意事項

要特別注意這個套件在get_as_df預設會自動轉換可能是數字的資料，所以即使資料內容是文字的 '0001' 也是會被轉成數字 1 ，只要改參數numerize=False就能避免這個問題。

如果想避免用get_as_df設定上的麻煩，最快的方法就是先輸出 csv 再讀進來。


ws.export(filename='df')
df3 = pd.read_csv('df.csv')

寫成 functions







def ws_to_df(self, **kwargs):
    worksheet_title = self.title
    self.export(filename=worksheet_title + '_df')
    df = pd.read_csv(worksheet_title + '_df.csv', **kwargs)
    return df

pygsheets.worksheet.Worksheet.ws_to_df = ws_to_df

使用範例


df4 = ws.ws_to_df()

文檔

其他更進階的就去看文檔吧：
https://pygsheets.readthedocs.io/en/stable/index.html

gspread 套件

其實直接用pygsheets就好了，不過既然之前都做了些研究，就順便寫一下。

使用憑證開啟 Google Sheets












import gspread
from oauth2client.service_account import ServiceAccountCredentials as SAC

# Google Sheets credentials
## 定義使用的範圍
scope = ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']
credentials = SAC.from_json_keyfile_name('path/to/key.json', scope)
gc = gspread.authorize(credentials)

survey_url = 'https://docs.google.com/spreadsheets/d/1cNqBIzF_T8yZm5S1hj7E1MNVXQegVDVuJsUWVoDKIDw/'
sh = gc.open_by_url(survey_url)

讀取注意事項

這個套件同樣也會自動轉換可能是數字的資料，但沒辦法改參數解決，所以為了讓輸出的資料維持原樣，必須要小改套件中的幾個 functions，如下：








































def numericise_m(value, empty2zero=False, default_blank="", allow_underscores_in_numeric_literals=False):
    if value is not None:
        if "_" in value and not allow_underscores_in_numeric_literals:
            return value

        if value == "":
            if empty2zero:
                value = 0
            else:
                value = default_blank

    return value


def numericise_all_m(input, empty2zero=False, default_blank="", allow_underscores_in_numeric_literals=False):
    return [numericise_m(s, empty2zero, default_blank, allow_underscores_in_numeric_literals) for s in input]


def get_all_records_m(
        self,
        empty2zero=False,
        head=1,
        default_blank=None,
        allow_underscores_in_numeric_literals=False,
):
    idx = head - 1

    data = self.get_all_values()
    keys = data[idx]
    values = [
        numericise_all_m(
            row,
            empty2zero,
            default_blank,
            allow_underscores_in_numeric_literals,
        )
        for row in data[idx + 1:]
    ]

    return [dict(zip(keys, row)) for row in values]

或許有比較好的改法，但當時想說能用就好，所以就沒有仔細去研究了。

這個套件也沒有可以直接轉成 dataframe 或 dictionary 的 function，所以一樣要自己寫：























def worksheet_to_df(worksheet, index):
    df = pd.DataFrame(get_all_records_m(worksheet))
    df = df.set_index(index, drop=False)

    # check if index duplicate
    duplicate_index = df.groupby(level=0).filter(lambda x: len(x) > 1).__len__()
    if duplicate_index > 0:
        raise IndexError('duplicate index!!')
    else:
        return df


def worksheet_to_dict(worksheet, index):
    df = pd.DataFrame(get_all_records_m(worksheet))
    df = df.set_index(index, drop=False)

    # check if index duplicate
    duplicate_index = df.groupby(level=0).filter(lambda x: len(x) > 1).__len__()
    if duplicate_index > 0:
        raise IndexError('duplicate index!!')
    else:
        df_dict = df.to_dict(orient='index')
        return df_dict

讀取方式


main_df = worksheet_to_df(sh.worksheet('main'), 'numbering')
main_dict = worksheet_to_dict(sh.worksheet('main'), 'numbering')

結論

gspread雖然老牌，但它沒有支援 dataframe 變得非常難用，還要自己寫加改 functions，幸好後來又找到pygsheets，真的方便許多，而且看了一下它套件裡的 functions，感覺應該是從gspread做改良。

補充說明

過去使用gspread，每次憑證連接的時間有限，所以過一段時間就要重連，對於拿來當作長時間連接的資料庫會有一些影響，雖然可以用定期重連的方式解決，但也還是比較麻煩。pygsheets可能也有這個問題。

Python 讀寫 Google Sheets 教學

tags: Python Google Sheets Tutorials

前言

取得權限

套件選擇

pygsheets 套件

使用憑證開啟 Google Sheets

寫入方式

讀取方式

讀取注意事項

文檔

gspread 套件

使用憑證開啟 Google Sheets

讀取注意事項

讀取方式

結論

補充說明

Read more

Python BeautifulSoup Notes

Selenium 爬蟲使用 Tor 匿蹤教學

結合 PyCharm 與 Anaconda

Python Selenium Notes

tags: `Python` `Google Sheets` `Tutorials`