Sorting Arrays

# Sorting Arrays > Lee Tsung-Tang > ###### tags: `python` `numpy` `Python Data Science Handbook` `sort` {%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %} [TOC] > This section covers algorithms related to ==sorting== values in `NumPy arrays`. > 在一般的CS課之中會學如`insertion sorts, selection sorts, merge sorts, quick sorts, bubble sorts`等排序的演算法 > > 例如以下即是selection sort的範例 ```python= import numpy as np def selection_sort(x): for i in range(len(x)): swap = i + np.argmin(x[i:]) (x[i], x[swap]) = (x[swap], x[i]) return x x = np.array([2, 1, 4, 3, 5]) selection_sort(x) array([1, 2, 3, 4, 5]) ``` > selection sort很簡單易用，但是當資料大時速度卻太慢了。對N個value來說，它總共要跑 N 次loop，而每次loop都需要比較第i+1 ~ N的值才知道是否要交換。因此平均的時間複雜度為$O(N^2)$ > selection sort雖然慢但也好過 bogosort: ```python= def bogosort(x): # 0~(n-1)每個value要各自小於1~n對應的 # value，表示x由小排到大 while np.any(x[:-1] > x[1:]): np.random.shuffle(x) # 隨機排序 return x x = np.array([2, 1, 4, 3, 5]) bogosort(x) array([1, 2, 3, 4, 5]) ``` > 這個方法排序的平均時間複雜度為$O(N*N!)$，基本上不太實用 ## Fast Sorting in NumPy: `np.sort` and `np.argsort` > 雖然python 有 buuild-in sort，但效率較差。 > `np.sort`演算法的平均時間複雜度為$O(Nlog(N))$ (though mergesort and heapsort are also available) > `np.sort()`可以不直接改變input，返回排序後的結果 ```python= x = np.array([2, 1, 4, 3, 5]) np.sort(x) #array([1, 2, 3, 4, 5]) ``` > 想要in-place sort可以對`array`用`sort method` ```python= x.sort() print(x) #[1 2 3 4 5] ``` > `np.argsort()`不直接返回排序後的結果，而是返回==indices of the sorted elements== ```python= x = np.array([2, 1, 4, 3, 5]) i = np.argsort(x) print(i) #[1 0 3 2 4] ``` > 可以用這個index對原本的array進行排序 ```python=+ x[i] #array([1, 2, 3, 4, 5]) ``` ### Sorting along rows or columns > numpy的sorting algorithms可以在multidimensional array中用`axis`控制要對row或column排序 ```python= rand = np.random.RandomState(42) X = rand.randint(0, 10, (4, 6)) print(X) #[[6 3 7 4 6 9] # [2 6 7 4 3 7] # [7 2 5 4 1 7] # [5 1 4 0 9 5]] # sort each column of X np.sort(X, axis=0) #array([[2, 1, 4, 0, 1, 5], # [5, 2, 5, 4, 3, 7], # [6, 3, 7, 4, 6, 7], # [7, 6, 7, 4, 9, 9]]) # sort each row of X np.sort(X, axis=1) #array([[3, 4, 6, 6, 7, 9], # [2, 3, 4, 6, 7, 7], # [1, 2, 4, 5, 7, 7], # [0, 1, 4, 5, 5, 9]]) ``` :warning: 在做`np.sort()`排序時每一個row/column都被視為完全獨立的`array`，如果array間有所關係可能會因為排序亂掉 ## Partial Sorts: Partitioning > `np.partition()`會找出最小的k個value放在左側，其他值則保留在右側 ```python= x = np.array([7, 2, 3, 1, 6, 5, 4]) np.partition(x, 3) array([2, 1, 3, 4, 6, 5, 7]) ``` :a: `[2,1,3]`是array中最小的3個值，右邊則是其他值，組內的排序則是隨機的 > 同樣可以用`axis`參數控制對multidimensional array的column/row做排序 ```python=+ np.partition(X, 2, axis=1) # 對列排序 array([[3, 4, 6, 7, 6, 9], [2, 3, 4, 7, 6, 7], [1, 2, 4, 5, 7, 7], [0, 1, 4, 5, 9, 5]]) ``` :a: 每列左側最小的兩個ele.都是整列array中最小的值 > 同樣的，`np.argpartition()`可以返回排序的index ## Example: k-Nearest Neighbors > 使用`np.argsort()`找出nearest neighbors of each point in a set > 一個二維的array ```python= X = np.random.rand(4,2) %matplotlib inline import matplotlib.pyplot as plt import seaborn; seaborn.set() # Plot styling plt.scatter(X[:, 0], X[:, 1], s=100); ``` ![](https://i.imgur.com/z2sO6FK.png) > 接著用broadcasting ([Computation on Arrays: Broadcasting](/kEu31ha4QCmfqerRM-M-CA)) 與aggregation ([Aggregations: Min, Max, and Everything In Between](/bxIdXGwyT9-iYBETNLUvIQ)) 計算點跟點之間的square distances matrix: > ```python= dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis=-1) ``` > 上面的計算可以拆成多個小步驟: ```python= # for each pair of points, compute differences in their coordinates differences = X[:, np.newaxis, :] - X[np.newaxis, :, :] differences.shape #(10, 10, 2) ``` ```python=+ # square the coordinate differences sq_differences = differences ** 2 sq_differences.shape (10, 10, 2) ``` ```python=+ # sum the coordinate differences to get the squared distance dist_sq = sq_differences.sum(-1) dist_sq.shape (10, 10) ``` > 檢查對角線是否都=0 (自己與自己的距離) ```python= dist_sq.diagonal() array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) ``` > 接著用 `np.argsort` 排序，愈左側的值是離自己愈近的elements :warning: 第一個column是自己 ```python= nearest = np.argsort(dist_sq, axis=1) print(nearest) [[0 3 9 7 1 4 2 5 6 8] [1 4 7 9 3 6 8 5 0 2] [2 1 4 6 3 0 8 9 7 5] [3 9 7 0 1 4 5 8 6 2] [4 1 8 5 6 7 9 3 0 2] [5 8 6 4 1 7 9 3 2 0] [6 8 5 4 1 7 9 3 2 0] [7 9 3 1 4 0 5 8 6 2] [8 5 6 4 1 7 9 3 2 0] [9 7 3 0 1 4 5 8 6 2]] ``` > 如果只想知道最近的k個點，可以用`np.argpartition`找每列最近的k+1個點 ```python= K = 2 nearest_partition = np.argpartition(dist_sq, K + 1, axis=1) ``` > 將每個點與其最近的兩個點為edges繪製成網絡圖 ```python= plt.scatter(X[:, 0], X[:, 1], s=100) # draw lines from each point to its two nearest neighbors K = 2 for i in range(X.shape[0]): for j in nearest_partition[i, :K+1]: # plot a line from X[i] to X[j] # use some zip magic to make it happen: plt.plot(*zip(X[j], X[i]), color='black') ``` ![](https://i.imgur.com/5OQknHZ.png) > numpy的特殊寫法(vectorized version)雖然較不直覺但速度很快Although the broadcasting and row-wise sorting of this approach might seem less straightforward than writing a loop, it turns out to be a very efficient way of operating on this data in Python. >

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.