【Python】如何計算行數與其效率分析

最近在訓練的時候，需要事前知道訓練集的大小來對 learning rate 進行 decay。但是，我的檔案大又多，想盡可能壓縮計算的時間。

行數計算

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

計數（圖片來源: Magda Ehlers | pexels）

readlines

計算行是最簡單、直覺的方法當然是用 readlines()：


count = len(open(file_name,'rb').readlines())

但這方法並不適用於大檔案，我的測試檔案有 187 份每一份的大小是 5G，用這方法讀應該會卡到往生

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

因此為了減輕讀取壓力，我設定了每次讀取位元組的限制：












start = time.time()

count = 0
with open(file_name, 'rb') as f :
    while True:
        lines = f.readlines(1024*8192)
        if not lines:
            break
        count += len(lines)
        
endtime = time.time()
print (count, (endtime - start))

一份檔案的讀取時間約：4.602344036102295 秒。

readline

除了用 readlines()，也可以用 readline()：

start = time.time()

count = 0
with open(file_name, 'rb') as f :
    while True:
        line = f.readline()
        if not line:
            break
        count +=1
endtime = time.time()
print (count, (endtime - start))

一份檔案的讀取時間約：4.239390134811401 秒，看起來稍稍快一點？

read

Python 讀檔三寶除了 readline() 和 readlines() 外，有個 read()，接下來試試用 read()。不過 read() 讀進來的也是整份文件，為了不讓記憶體爆掉，也是設了 chunks size。

start = time.time()
count = 0
with open(file_name, "rb") as reader:
    while True:
        data = reader.read(1024*8192)
        if not data:
            break
        count += data.count(b'\n')
endtime = time.time()
print ((endtime - start))

不過它讀進來的是字串，沒有行的概念，所以統計換行符號個數來計算行數。算下來讀取一份檔案約 4.244061231613159 秒。

組長問我用 rb 跟 r 來取檔案速度到底差多少？反正也就點工，趁出去吃飯的時候，順便跑了下。出來結果一份檔案要 16.770071983337402 秒，是用 rb 模式讀取時間的 4 倍。

迭代器

是說，BufferedReader 本身就是一個迭代器物件(iterator)，拿它來算行數也行。








start = time.time()

count = 0
with open(file_name,'rb') as f :
    for line in f :
        count += 1
endtime = time.time()
print ((endtime - start))

一份文件的讀取時間約 4.3013999462127686 秒。

有看到有人在把跟 enumerate 一起用，省掉 count 累加的部份，跑出來的時間 4.196686029434204 秒，倒是目前最低的。








start = time.time()
count=-1
for count, line in enumerate(open(file_name,'rb')):
    pass
count+=1

endtime = time.time()
print (endtime - start)

Multiprocess

看來 5G 檔案，4 秒左右是極限了。所以我把主意打到了 Multiprocess 上，雖然不一定又作用，畢竟讀寫頭順著不寫應該會比較快？

網路上看到現成的，我就不自己寫了：





















import multiprocessing as mp
from itertools import (takewhile,repeat)


def count_lines(file_name):
    count = 0    
    with open(file_name,'rb') as f:
        f = open(file_name, 'rb')
        bufgen = takewhile(lambda x: x,
                           (f.raw.read(1024 * 1024) for _ in repeat(None)))
        count += sum(buf.count(b'\n') for buf in bufgen if buf)

    return count


start = time.time()
pool = mp.Pool(processes=4)
asyncResult = pool.map_async(count_lines, file_names)
count = sum(asyncResult.get())    
endtime = time.time()
print (count, (endtime - start))

以一份檔案 4 秒讀取時間來算，187 份檔案預計會花上 12.5 分鐘。而，實測結果花了 10.3 分鐘。
P.S. 寫到這邊才想到，我又不重 CPU 計算的部份，應該開 Multithread 就好，不用開到 Multiprocess。

參考資料

vmele (2017-12-06)。optimization - Optimize file and number line count in Pytho 。檢自 Stack Overflow (2020-06-18)。

更新紀錄

最後更新日期：2020-08-10

2020-08-10 發布
2020-06-18 完稿

本文作者：辛西亞．Cynthia
本文連結：辛西亞的技能樹 / hackmd 版本
版權聲明：部落格中所有文章，均採用姓名標示-非商業性-相同方式分享 4.0 國際 (CC BY-NC-SA 4.0) 許可協議。轉載請標明作者、連結與出處！

行數計算

readlines

readline

read

迭代器

Multiprocess

參考資料

更新紀錄

Read more

Scale-Up Vs Scale-Out

URI、URL 與 URN

常用 LaTeX 數學符號指令

從人工智慧到機器學習、深度學習