owned this note
owned this note
Published
Linked with GitHub
# Sorting Arrays
> Lee Tsung-Tang
> ###### tags: `python` `numpy` `Python Data Science Handbook` `sort`
{%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %}
[TOC]
> This section covers algorithms related to ==sorting== values in `NumPy arrays`.
> 在一般的CS課之中會學如`insertion sorts, selection sorts, merge sorts, quick sorts, bubble sorts`等排序的演算法
>
> 例如以下即是selection sort的範例
```python=
import numpy as np
def selection_sort(x):
for i in range(len(x)):
swap = i + np.argmin(x[i:])
(x[i], x[swap]) = (x[swap], x[i])
return x
x = np.array([2, 1, 4, 3, 5])
selection_sort(x)
array([1, 2, 3, 4, 5])
```
> selection sort很簡單易用,但是當資料大時速度卻太慢了。對N個value來說,它總共要跑 N 次loop,而每次loop都需要比較第i+1 ~ N的值才知道是否要交換。因此平均的時間複雜度為$O(N^2)$
> selection sort雖然慢但也好過 bogosort:
```python=
def bogosort(x):
# 0~(n-1)每個value要各自小於1~n對應的
# value,表示x由小排到大
while np.any(x[:-1] > x[1:]):
np.random.shuffle(x) # 隨機排序
return x
x = np.array([2, 1, 4, 3, 5])
bogosort(x)
array([1, 2, 3, 4, 5])
```
> 這個方法排序的平均時間複雜度為$O(N*N!)$,基本上不太實用
## Fast Sorting in NumPy: `np.sort` and `np.argsort`
> 雖然python 有 buuild-in sort,但效率較差。
> `np.sort`演算法的平均時間複雜度為$O(Nlog(N))$ (though mergesort and heapsort are also available)
> `np.sort()`可以不直接改變input,返回排序後的結果
```python=
x = np.array([2, 1, 4, 3, 5])
np.sort(x)
#array([1, 2, 3, 4, 5])
```
> 想要in-place sort可以對`array`用`sort method`
```python=
x.sort()
print(x)
#[1 2 3 4 5]
```
> `np.argsort()`不直接返回排序後的結果,而是返回==indices of the sorted elements==
```python=
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
print(i)
#[1 0 3 2 4]
```
> 可以用這個index對原本的array進行排序
```python=+
x[i]
#array([1, 2, 3, 4, 5])
```
### Sorting along rows or columns
> numpy的sorting algorithms可以在multidimensional array中用`axis`控制要對row或column排序
```python=
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)
#[[6 3 7 4 6 9]
# [2 6 7 4 3 7]
# [7 2 5 4 1 7]
# [5 1 4 0 9 5]]
# sort each column of X
np.sort(X, axis=0)
#array([[2, 1, 4, 0, 1, 5],
# [5, 2, 5, 4, 3, 7],
# [6, 3, 7, 4, 6, 7],
# [7, 6, 7, 4, 9, 9]])
# sort each row of X
np.sort(X, axis=1)
#array([[3, 4, 6, 6, 7, 9],
# [2, 3, 4, 6, 7, 7],
# [1, 2, 4, 5, 7, 7],
# [0, 1, 4, 5, 5, 9]])
```
:warning: 在做`np.sort()`排序時每一個row/column都被視為完全獨立的`array`,如果array間有所關係可能會因為排序亂掉
## Partial Sorts: Partitioning
> `np.partition()`會找出最小的k個value放在左側,其他值則保留在右側
```python=
x = np.array([7, 2, 3, 1, 6, 5, 4])
np.partition(x, 3)
array([2, 1, 3, 4, 6, 5, 7])
```
:a: `[2,1,3]`是array中最小的3個值,右邊則是其他值,組內的排序則是隨機的
> 同樣可以用`axis`參數控制對multidimensional array的column/row做排序
```python=+
np.partition(X, 2, axis=1) # 對列排序
array([[3, 4, 6, 7, 6, 9],
[2, 3, 4, 7, 6, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 9, 5]])
```
:a: 每列左側最小的兩個ele.都是整列array中最小的值
> 同樣的,`np.argpartition()`可以返回排序的index
## Example: k-Nearest Neighbors
> 使用`np.argsort()`找出nearest neighbors of each point in a set
> 一個二維的array
```python=
X = np.random.rand(4,2)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # Plot styling
plt.scatter(X[:, 0], X[:, 1], s=100);
```

> 接著用broadcasting ([Computation on Arrays: Broadcasting](/kEu31ha4QCmfqerRM-M-CA)) 與aggregation ([Aggregations: Min, Max, and Everything In Between](/bxIdXGwyT9-iYBETNLUvIQ)) 計算點跟點之間的square distances matrix:
>
```python=
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis=-1)
```
> 上面的計算可以拆成多個小步驟:
```python=
# for each pair of points, compute differences in their coordinates
differences = X[:, np.newaxis, :] - X[np.newaxis, :, :]
differences.shape
#(10, 10, 2)
```
```python=+
# square the coordinate differences
sq_differences = differences ** 2
sq_differences.shape
(10, 10, 2)
```
```python=+
# sum the coordinate differences to get the squared distance
dist_sq = sq_differences.sum(-1)
dist_sq.shape
(10, 10)
```
> 檢查對角線是否都=0 (自己與自己的距離)
```python=
dist_sq.diagonal()
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
```
> 接著用 `np.argsort` 排序,愈左側的值是離自己愈近的elements
:warning: 第一個column是自己
```python=
nearest = np.argsort(dist_sq, axis=1)
print(nearest)
[[0 3 9 7 1 4 2 5 6 8]
[1 4 7 9 3 6 8 5 0 2]
[2 1 4 6 3 0 8 9 7 5]
[3 9 7 0 1 4 5 8 6 2]
[4 1 8 5 6 7 9 3 0 2]
[5 8 6 4 1 7 9 3 2 0]
[6 8 5 4 1 7 9 3 2 0]
[7 9 3 1 4 0 5 8 6 2]
[8 5 6 4 1 7 9 3 2 0]
[9 7 3 0 1 4 5 8 6 2]]
```
> 如果只想知道最近的k個點,可以用`np.argpartition`找每列最近的k+1個點
```python=
K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)
```
> 將每個點與其最近的兩個點為edges繪製成網絡圖
```python=
plt.scatter(X[:, 0], X[:, 1], s=100)
# draw lines from each point to its two nearest neighbors
K = 2
for i in range(X.shape[0]):
for j in nearest_partition[i, :K+1]:
# plot a line from X[i] to X[j]
# use some zip magic to make it happen:
plt.plot(*zip(X[j], X[i]), color='black')
```

> numpy的特殊寫法(vectorized version)雖然較不直覺但速度很快Although the broadcasting and row-wise sorting of this approach might seem less straightforward than writing a loop, it turns out to be a very efficient way of operating on this data in Python.
>