Parallel data processing in Python

# Parallel data processing in Python This is a collaborative document for solving exercises during the SQL class. ## Roll call | Name | Course | Level of Python | | -------- | -------- | -------- | | Eszter | guest lecturer | expert | | Norbert | bioinformatics MSc student | advanced | | Máté | physics MSc student | intermediate | | Bence | physics MSc | expert | | Bendegúz | physics MSc student | intermediate | |Nikolett | physics PhD | advanced | | Bogdán | physics PhD | advanced | | Mirkó | physics PhD | advanced | | Bence | physics MSc | intermediate | | Andi | physics MSc | intermediate | | Adri | physics MSc | intermediate | |Martin| physics MSc | intermediate | |János| lecturer| expert| | Erik | physics MSc | advanced | | Dani | physics MSc | advanced | ## Exercise 1 Could you create a numpy matrix? Eszter - *very fast memory error :)* Sample solutions for creating an adjacency matrix for one layer: Hint: ``` # hint # i -> array for source nodes # j -> array for target nodes # data -> array of ones as long as i and j # only select edges for one layer! (filter the `edges` dataframe) A = scipy.sparse.csr_matrix((data,(i,j))) ``` Eszter ``` def create_layerwise_matrices(): layerwise_adj_matrices = {} for layer in layers.layerID: print(f"Creating layer {layer}.") i,j,data = edges[edges['layerID']==layer][['nodeID_source','nodeID_target','weight']].values.T layerwise_adj_matrices[layer] = scipy.sparse.csr_matrix( (data,(i,j)), shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1) ) return layerwise_adj_matrices time = %timeit -o create_layerwise_matrices() ``` Student solutions: ``` # Here comes your sample solutiondef create_layerwise_matrices(): layerwise_adj_matrices = {} # here comes your code MemoryError: Unable to allocate 31.9 TiB for an array with shape (5920304, 5920304) and data type uint8 This happened for multiple of us. (For both np and scipy.) I will try yours now. Scipy works. ``` Time it takes to create a matrix for all three layers: | Group | Time | | -------- | -------- | | Eszter | 10.4s | | Andi | 2.35 s | | Bence | 3.6s | |Bendegúz|3.53 s ± 63.7 ms| |Norbi |3.5s| | |Máté |12.7 s ± 384 ms | |Mirkó |Martin | 10.9 s| ## Exercise 2 Sample solutions for one combined adjacency matrix: ``` # Here comes your sample solution #%%timeit A = scipy.sparse.csr_matrix( (max(nodes.nodeID)+1,max(nodes.nodeID)+1), dtype='int32' ) layers['binary'] = 2**layers.index for l,b in zip(layers['layerID'],layers['binary']): A+=scipy.sparse.csr_matrix(layerwise_adj_matrices[l]>0,dtype='int32')*b A ``` Q: What are we storing in layerwise_adj_matrices? For me, that's a dicrionary that stores the three individual adjacency matrices in scipy.sparse.csr_matrix() classes. E.g. layerwise_adj_matrices[0] = # matrix corresponding to first layer layerwise_adj_matrices[1] = # matrix corresponding to second layer etc. ## Exercise 3 Size comparison: ``` !du -ha ... (Some of us are using Windows, and this is a Linux command.) import os os.listdir() = !ls >>> import os >>> os.path.getsize("/path/to/file.mp3") 2071611 ``` Speed: ``` % timeit -o -r1 ``` Memory: ``` memory_usage(functionname, interval = 0.01) ``` Size comparison of npz and original files: | Group | npz | compressed csv | original | | -------- | -------- | -------- | -------- | | Eszter | 58M | 58M | 162M | | Bence | 28M | 53M | 162M | | Máté | 29M | 56M | 162M | |Bendegúz |28M |54M |162M | |Nikolett |28M |56M |162M | | Andi |29M | 56M |162M | | Martin| 59M| 57M| 162M| | Mirkó | 28M | 53M | 162M | | Norbert | 29M | 56M | 162M | Speed comparison of loading of npz and original files: | Group | npz | compressed csv | original | | -------- | -------- | -------- | -------- | | Eszter | 1.2s | | 7.67s | | Máté | 1.41 s ± 95.5 ms | 23.4 s ± 1.17 s | 44.7 s | | Bence | 0.28s | 5.15s | 1.48s | | Martin | 1.1 s | 19.6 s | 5.72 s | | Bendegúz | 0.36 s | 2.85 s | 2.11 s | | Mirkó | 1.1 s ± 38.1 ms | 10.3 s ± 184 ms | 7.14 s ± 146 ms | | Norbert | 0.36 s | 2.02 s | 2.11 s | Some images on the comparison of memory use: ``` def load_npz(): scipy.sparse.load_npz('output/boston.npz') def load_pandas(): edges = pd.read_csv('BostonBomb2013/Dataset/BostonBomb2013_multiplex.edges',sep=' ',header=None).rename({0:'layerID',1:'nodeID_source',2:'nodeID_target',3:'weight'},axis=1) # memory measurement memory_npz = memory_usage(load_npz,interval=0.01) memory_pandas = memory_usage(load_pandas,interval=0.01) # plot the results fig, ax = plt.subplots() ax.plot(memory_npz, label="npz") ax.plot(memory_pandas, label="pandas") ax.set_xlabel('time') ax.set_ylabel('memory (MB)') ax.legend() ``` ![](https://i.imgur.com/joc8kON.png) ![Imgur](https://i.imgur.com/Zo5liu4.png) Conclusions: * Eszter: even though the file sizes are the same for the npz and gz cases, the speed and memory gain from using the scipy matrix is considerable. ## Exercise 4 Note down some of the file sizes and reading times that you got: | Group | Method | File size | Reading time | | -------- | -------- | -------- | -------- | | Eszter | csv.gz | 58M | 2.21 s | ## Exercise 5 ``` import dask.array as da ``` ``` np.max() -> da.max() ``` Try to exachange `max()` function in your code with `da.max()`, and do the line profiling again, and note down what you got! Compare `max(), np.max(), pd.df.max(), da.max().compute()`. * Máté * pandas total 7.16198 s, 0.7 % of the time * base total 12.6247 s, 50.6 % of the time * numpy total 5.84312 s, 1.5 % of the time * dusk total 7.51426 s, 15.4 % of the time * Norbert * Pandas: 1.31 s ± 144 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) * Numpy 1.32 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) * Dusk 1.54 s ± 173 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) * Builtin 3.79 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) * I think it would be different in larger dataset. * Martin * base 9.76667 s, 70.3% of the time * numpy * pandas * dask * Erik * line 8, `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)` 1243.2843 ms / linerun * exchanged to np.max gives 411.028 ms / run * exchanged to pd.df.max gives 303.485 ms / run * exchanged to da.max gives 514.781 ms / run * exchanged to dask.array.max().compute() gives 476.352 ms / run * exchanged to dask.dataframe.max() gives 374.426 ms / run * Bendegúz: For me da.max() throws the following error: ValueError: the 'keepdims' parameter is not supported in the pandas implementation of max() `htop` `top` Watch as multiple cores get switched on. ## Exercise 6 ``` # Hints %load_ext line_profiler %lprun -f create_layerwise_matrices create_layerwise_matrices() %prun -s cumtime create_layerwise_matrices() ``` Put here your observations / questions!!! ... Copy your slowest line and the function in which your code spends the most time here! * Eszter * line `#asdf`, function `#asdf` * Bendegúz * line `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`, pandas: 778 ms * Erik * line 8, `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)` 1243.2843 ms / linerun * exchanged to np.max gives 411.028 ms / run * exchanged to pd.df.max gives 303.485 ms / run * exchanged to da.max gives 514.781 ms / run * exchanged to da.max().compute() gives 476.352 ms / run * exchanged to dd.max() gives 364.267 ms / run * Bence * line `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`, function `create_layerwise_matrices` * Martin * `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)` * Mirkó * line 8 `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)` * exchanged to da.max().compute() gives 391.004 ms/run * exchanged to pd.DataFrame.max() gives 29.161 ms/run * Máté * ` shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)` * with pandas .max(): layerwise_adj_matrices[layer] = scipy.sparse.csr_matrix is the slowest 53.7% * Norbert * line 14 `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`, function `create_layerwise_matrices` * Bence * ` shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)` * Adri: my kernel is always restarting, I always have to start from the beginning * Your solutions are ok for me ## Excercise 7 Can you name one measurement that is easy in SQL on graphs? And one that is hard? * Eszter: easy / ??? / hard / ??? ## Feedback Have you learned anything new? * numba looks useful * dask, numba and line profiler will be very useful for the future at my work. * I've never heard of numba and that may will be huge help for my work. Also the memory useage tip is fascinating. For optimizing and debuging the lineprofiler is amazing. * line profiler is 10/10 knowledge * yes * Knowledge acquired here, will prove useful when working with big data. * Yes. * useful Was it too easy / just right / too hard? * sample code was needed for me * difficulty is just about right, need more time for the exercises though * just right Do you still have questions? If yes, either write here, or you can find me at `e.bokanyi@uva.nl`. Thanks for your attention!

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.