# Pandas 套件用法速查
###### tags: `Python`, `Pandas`
https://vimsky.com/zh-tw/article/4361.html
## Drop rows
https://sparkbyexamples.com/pandas/pandas-drop-rows-from-dataframe/
## Convert date-type
https://stackoverflow.com/questions/15891038/change-column-type-in-pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
## Sort dataframe by series <str>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
## Sort dataframe by series <int/float>
依 價格 排序:
stock_prices, new_indices = zip(*sorted(zip(df["價格"].tolist(), df.index.tolist())))
## 1. Whether an element in specific **column** (type: `pd.Series`)?
col_A <- pd.Series
**"(value)" in col_A.unique()**
P.S. **"(index)" in col_A**
> https://stackoverflow.com/questions/21319929/how-to-determine-whether-a-pandas-column-contains-a-particular-value
## 2. Useful parameters of `read_csv`
pd.read_csv(csv_file_path, header=None, encoding="utf-8-sig", skiprows=3)
* header=None
Read CSV file **without header**.
In default, `read_csv` reads CSV file with header.
* encoding="utf-8-sig"
Read CSV file **with Chinese contexts**.
* skiprows=3
Read CSV file **neglecting first 3 rows**.
## 3. Get specific **column** data from `pd.DataFrame`
* Method 1. Using **index**
1.1 Column name included
df.iloc[:, **[1]**]
1.2 Column name not included (only values)
df.iloc[:, **1**]
P.S. index starts form **0**
* Method 2. Using **col. name**
1.1 Column name included
df.loc[:, **["B"]**]
1.2 Column name not included (only values)
df.loc[:, **"B"**]
## 4. Concatenate different `df`
### Type 1. Concanate **by columns**
- Common type
- As same as "concanating two `df` left and right"
- Usage:
df = pd.DataFrame()
pd.concat([df1, df2], **axis=1**)
P.S.
Columns of both df1 and df2 are set before.
Such as:
df1 = ...
df1.columns = ["dog", "cat", "duck"]
df2 = ...
df2.columns = ["tiger", "lion"]
### Type 2. Concanate **by indices (rows)**
- Default type
- As same as "concanating two `df` up and down"
- Usage:
pd.concat([df1, df2], **axis=0**)
## 5. Rolling (Moving)
E.g.
First, making a `df` with only 1 dimension.
df = pd.DataFrame({'B': [1, 2, 3, 4, 5]})
Rule:
If **win_size = k**,
then the first k elments would have no value (`NaN`)
untill **the k th element**.
Note:
If setting the new column in `df1`
with values of the new column are "column data of `df2`",
then the statement will get an error (but still works).
A better alternative is: using "**pd.concat()**" instead.
Such as:
pd.concat([`df1`, `df2`], axis=1)
if wanted to concate column data in RHS of `df1`
1. Moving Sum
Usage: **df.rolling(2).sum()**
Result:
B
0 NaN
1 3.0
2 5.0
3 7.0
4 9.0
2. Moving Average (MA)
Usage: **df.rolling(2, win_type='triang').sum()**
Result:
B
0 NaN
1 1.5
2 2.5
3 3.5
4 4.5
3. Moving High
Usage: **df.rolling(2).max()**
Result:
B
0 NaN
1 2.0
2 3.0
3 4.0
4 5.0
4. Moving Low
Usage: **df.rolling(2).min()**
Result:
B
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
## 6. Wich direction does pandas execute "sum()" operation in default, column or row?
ANS: **column**
E.g., df = pd.DataFrame({"A":[1,2,3], "B": [4,5,6]})
**df**
| (index) | A | B |
| -------- | -------- | -------- |
| 0 | 1 | 4 |
| 1 | 2 | 5 |
| 2 | 3 | 6 |
// data-type: DataFrame
**df.sum()**
| X | X |
-------- | -------- |
| A | 6 |
| B | 15 |
// data-type: Series
**df.sum().sum()**
21
// data-type: numpy.int64
## 7. How to indicate specific column by index and by column name?
Suppose A, B, C located respectively on col 1 (idx: 0), col 2 and col 3 in `df`.
* By index
df.iloc(:, **start_col_idx : end_col_idx+1**)
E.g.,
`df`.iloc(:, 0:1) // indicate column with index 0
`df`.iloc(:, 0:2) // indicate column with index 0, 1
`df`.iloc(:, 0:3) // indicate column with index 0, 1, 2
* By column name
df.iloc(:, **start_col_name : end_col_name**)
E.g.,
`df`.iloc(:, "A":"A") // indicate column with col name "A" (with header)
`df`.iloc(:, "A":"B") // indicate column with col name "A", "B"
`df`.iloc(:, "A":"C") // indicate column with col name "A", "B", "C"
`df`.iloc(:, "A") // indicate column with col name "A" (**without header/only values**)
## 8. Retrieve column index via column name
ANS:
df.columns.get_loc("A")
> Suppose DataFrame `df` has 3 columns: A, B, C
> and want to get the index of column "A".
## 9. Get the index of specific column
➡️ **.get_loc**
➡️ usage
col_name = "test"
df.columns.get_loc(col_name)
## 10. Convert "pandas series" into "numpy array"
e.g. A is a pandas series
➡️ A.values
## 11. Three ways to find the column index (pandas.DataFrame)
* Method 1
np.where(df.columns.values=="col_target")[0][0]
* Method 2
df.columns.tolist().index("col_target")
* Method 3
np.where((df.columns=="col_target")==True)[0][0]
## 12. Get the last string-splited result from a series
Example:
**pd_series.str.split('\\').str[-1]**
## 13. Convert dataframe into {key: {val_1: val_2}}
```
data = {
'KeyColumn': ['A', 'A', 'B', 'B', 'C', 'C'],
'Value1': [10, 20, 15, 25, 30, 40],
'Value2': [100, 200, 150, 250, 300, 400]
}
df = pd.DataFrame(data)
# Create the dictionary `info`
info = {}
# Iterate over unique values in 'KeyColumn'
for key in df['KeyColumn'].unique():
# Selecting rows with the current key
subset = df[df['KeyColumn'] == key]
# Add X to `info` with key as 'KeyColumn' value
info[key] = dict(zip(subset['Value1'], subset['Value2']))
print(info)
# {'A': {10: 100, 20: 200}, 'B': {15: 150, 25: 250}, 'C': {30: 300, 40: 400}}
```
## 14. Convert dataframe into {attr: {index: val}}
```
data = {
'Index': [1,2,3,4,5,6],
'Attribute1': [10, 20, 15, 25, 30, 40],
'Attribute2': [100, 200, 150, 250, 300, 400]
}
df = pd.DataFrame(data)
# Transform the DataFrame to a dictionary
info = df.set_index('Index').to_dict()
print(info)
# {'Attribute1': {1: 10, 2: 20, 3: 15, 4: 25, 5: 30, 6: 40},
'Attribute2': {1: 100, 2: 200, 3: 150, 4: 250, 5: 300, 6: 400}}
```
## 15. Divide dataframe `df` into different groups by two specific columns
```
# Sample DataFrame (replace this with your own data)
data = {'Column1': [1, 2, 1, 2, 3, 3],
'Column2': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)
# Group by two specific columns (Column1 and Column2)
grouped = df.groupby(['Column1', 'Column2'])
# Iterate through the groups
for group_name, group_df in grouped:
print(f"Group: {group_name}")
print(group_df)
print("---------------------")
```
## 16. Filter out empty records (assuming empty records have NaN values in any column)
df = df.dropna(how='all')
## 17. Delete records where the value in column 'A' contains "123"
df = df[~df['A'].str.contains("123")]
## 18. Delete records where the value in column 'A' starts with "123" and ends with "456"
df[~((df['A'].str.startswith("123"))&(df['A'].str.endswith("456")))]
## 19. For all records: if column `A` contains "\n", replacing with "the first element of the value applying .split('\n')
df['A'] = df['A'].apply(lambda x: x.split('\n')[0] if '\n' in x else x)