# Parallel data processing in Python
This is a collaborative document for solving exercises during the SQL class.
## Roll call
| Name | Course | Level of Python |
| -------- | -------- | -------- |
| Eszter | guest lecturer | expert |
| Norbert | bioinformatics MSc student | advanced |
| Máté | physics MSc student | intermediate |
| Bence | physics MSc | expert |
| Bendegúz | physics MSc student | intermediate |
|Nikolett | physics PhD | advanced |
| Bogdán | physics PhD | advanced |
| Mirkó | physics PhD | advanced |
| Bence | physics MSc | intermediate |
| Andi | physics MSc | intermediate |
| Adri | physics MSc | intermediate |
|Martin| physics MSc | intermediate |
|János| lecturer| expert|
| Erik | physics MSc | advanced |
| Dani | physics MSc | advanced |
## Exercise 1
Could you create a numpy matrix?
Eszter - *very fast memory error :)*
Sample solutions for creating an adjacency matrix for one layer:
Hint:
```
# hint
# i -> array for source nodes
# j -> array for target nodes
# data -> array of ones as long as i and j
# only select edges for one layer! (filter the `edges` dataframe)
A = scipy.sparse.csr_matrix((data,(i,j)))
```
Eszter
```
def create_layerwise_matrices():
layerwise_adj_matrices = {}
for layer in layers.layerID:
print(f"Creating layer {layer}.")
i,j,data = edges[edges['layerID']==layer][['nodeID_source','nodeID_target','weight']].values.T
layerwise_adj_matrices[layer] = scipy.sparse.csr_matrix(
(data,(i,j)),
shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)
)
return layerwise_adj_matrices
time = %timeit -o create_layerwise_matrices()
```
Student solutions:
```
# Here comes your sample solutiondef create_layerwise_matrices():
layerwise_adj_matrices = {}
# here comes your code
MemoryError: Unable to allocate 31.9 TiB for an array with shape (5920304, 5920304) and data type uint8
This happened for multiple of us. (For both np and scipy.)
I will try yours now. Scipy works.
```
Time it takes to create a matrix for all three layers:
| Group | Time |
| -------- | -------- |
| Eszter | 10.4s |
| Andi | 2.35 s |
| Bence | 3.6s |
|Bendegúz|3.53 s ± 63.7 ms|
|Norbi |3.5s|
|
|Máté |12.7 s ± 384 ms |
|Mirkó
|Martin | 10.9 s|
## Exercise 2
Sample solutions for one combined adjacency matrix:
```
# Here comes your sample solution
#%%timeit
A = scipy.sparse.csr_matrix(
(max(nodes.nodeID)+1,max(nodes.nodeID)+1),
dtype='int32'
)
layers['binary'] = 2**layers.index
for l,b in zip(layers['layerID'],layers['binary']):
A+=scipy.sparse.csr_matrix(layerwise_adj_matrices[l]>0,dtype='int32')*b
A
```
Q: What are we storing in layerwise_adj_matrices?
For me, that's a dicrionary that stores the three individual adjacency matrices in scipy.sparse.csr_matrix() classes. E.g.
layerwise_adj_matrices[0] = # matrix corresponding to first layer
layerwise_adj_matrices[1] = # matrix corresponding to second layer
etc.
## Exercise 3
Size comparison:
```
!du -ha ... (Some of us are using Windows, and this is a Linux command.)
import os
os.listdir() = !ls
>>> import os
>>> os.path.getsize("/path/to/file.mp3")
2071611
```
Speed:
```
% timeit -o -r1
```
Memory:
```
memory_usage(functionname, interval = 0.01)
```
Size comparison of npz and original files:
| Group | npz | compressed csv | original |
| -------- | -------- | -------- | -------- |
| Eszter | 58M | 58M | 162M |
| Bence | 28M | 53M | 162M |
| Máté | 29M | 56M | 162M |
|Bendegúz |28M |54M |162M |
|Nikolett |28M |56M |162M |
| Andi |29M | 56M |162M |
| Martin| 59M| 57M| 162M|
| Mirkó | 28M | 53M | 162M |
| Norbert | 29M | 56M | 162M |
Speed comparison of loading of npz and original files:
| Group | npz | compressed csv | original |
| -------- | -------- | -------- | -------- |
| Eszter | 1.2s | | 7.67s |
| Máté | 1.41 s ± 95.5 ms | 23.4 s ± 1.17 s | 44.7 s |
| Bence | 0.28s | 5.15s | 1.48s |
| Martin | 1.1 s | 19.6 s | 5.72 s |
| Bendegúz | 0.36 s | 2.85 s | 2.11 s |
| Mirkó | 1.1 s ± 38.1 ms | 10.3 s ± 184 ms | 7.14 s ± 146 ms |
| Norbert | 0.36 s | 2.02 s | 2.11 s |
Some images on the comparison of memory use:
```
def load_npz():
scipy.sparse.load_npz('output/boston.npz')
def load_pandas():
edges = pd.read_csv('BostonBomb2013/Dataset/BostonBomb2013_multiplex.edges',sep=' ',header=None).rename({0:'layerID',1:'nodeID_source',2:'nodeID_target',3:'weight'},axis=1)
# memory measurement
memory_npz = memory_usage(load_npz,interval=0.01)
memory_pandas = memory_usage(load_pandas,interval=0.01)
# plot the results
fig, ax = plt.subplots()
ax.plot(memory_npz, label="npz")
ax.plot(memory_pandas, label="pandas")
ax.set_xlabel('time')
ax.set_ylabel('memory (MB)')
ax.legend()
```


Conclusions:
* Eszter: even though the file sizes are the same for the npz and gz cases, the speed and memory gain from using the scipy matrix is considerable.
## Exercise 4
Note down some of the file sizes and reading times that you got:
| Group | Method | File size | Reading time |
| -------- | -------- | -------- | -------- |
| Eszter | csv.gz | 58M | 2.21 s |
## Exercise 5
```
import dask.array as da
```
```
np.max() -> da.max()
```
Try to exachange `max()` function in your code with `da.max()`, and do the line profiling again, and note down what you got! Compare `max(), np.max(), pd.df.max(), da.max().compute()`.
* Máté
* pandas total 7.16198 s, 0.7 % of the time
* base total 12.6247 s, 50.6 % of the time
* numpy total 5.84312 s, 1.5 % of the time
* dusk total 7.51426 s, 15.4 % of the time
* Norbert
* Pandas: 1.31 s ± 144 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
* Numpy 1.32 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
* Dusk 1.54 s ± 173 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
* Builtin 3.79 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
* I think it would be different in larger dataset.
* Martin
* base 9.76667 s, 70.3% of the time
* numpy
* pandas
* dask
* Erik
* line 8, `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)` 1243.2843 ms / linerun
* exchanged to np.max gives 411.028 ms / run
* exchanged to pd.df.max gives 303.485 ms / run
* exchanged to da.max gives 514.781 ms / run
* exchanged to dask.array.max().compute() gives 476.352 ms / run
* exchanged to dask.dataframe.max() gives 374.426 ms / run
* Bendegúz: For me da.max() throws the following error:
ValueError: the 'keepdims' parameter is not supported in the pandas implementation of max()
`htop`
`top`
Watch as multiple cores get switched on.
## Exercise 6
```
# Hints
%load_ext line_profiler
%lprun -f create_layerwise_matrices create_layerwise_matrices()
%prun -s cumtime create_layerwise_matrices()
```
Put here your observations / questions!!!
...
Copy your slowest line and the function in which your code spends the most time here!
* Eszter
* line `#asdf`, function `#asdf`
* Bendegúz
* line `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`, pandas: 778 ms
* Erik
* line 8, `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)` 1243.2843 ms / linerun
* exchanged to np.max gives 411.028 ms / run
* exchanged to pd.df.max gives 303.485 ms / run
* exchanged to da.max gives 514.781 ms / run
* exchanged to da.max().compute() gives 476.352 ms / run
* exchanged to dd.max() gives 364.267 ms / run
* Bence
* line `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`, function `create_layerwise_matrices`
* Martin
* `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`
* Mirkó
* line 8 `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`
* exchanged to da.max().compute() gives 391.004 ms/run
* exchanged to pd.DataFrame.max() gives 29.161 ms/run
* Máté
* ` shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`
* with pandas .max(): layerwise_adj_matrices[layer] = scipy.sparse.csr_matrix is the slowest 53.7%
* Norbert
* line 14 `shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`, function `create_layerwise_matrices`
* Bence
* ` shape=(max(nodes.nodeID)+1,max(nodes.nodeID)+1)`
* Adri: my kernel is always restarting, I always have to start from the beginning
* Your solutions are ok for me
## Excercise 7
Can you name one measurement that is easy in SQL on graphs? And one that is hard?
* Eszter: easy / ??? / hard / ???
## Feedback
Have you learned anything new?
* numba looks useful
* dask, numba and line profiler will be very useful for the future at my work.
* I've never heard of numba and that may will be huge help for my work. Also the memory useage tip is fascinating. For optimizing and debuging the lineprofiler is amazing.
* line profiler is 10/10 knowledge
* yes
* Knowledge acquired here, will prove useful when working with big data.
* Yes.
* useful
Was it too easy / just right / too hard?
* sample code was needed for me
* difficulty is just about right, need more time for the exercises though
* just right
Do you still have questions? If yes, either write here, or you can find me at `e.bokanyi@uva.nl`.
Thanks for your attention!