Pandas 套件用法速查

# Pandas 套件用法速查 ###### tags: `Python`, `Pandas` https://vimsky.com/zh-tw/article/4361.html ## Drop rows https://sparkbyexamples.com/pandas/pandas-drop-rows-from-dataframe/ ## Convert date-type https://stackoverflow.com/questions/15891038/change-column-type-in-pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html ## Sort dataframe by series <str> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html ## Sort dataframe by series <int/float> 依價格排序: stock_prices, new_indices = zip(*sorted(zip(df["價格"].tolist(), df.index.tolist()))) ## 1. Whether an element in specific **column** (type: `pd.Series`)? col_A <- pd.Series **"(value)" in col_A.unique()** 　　 P.S. **"(index)" in col_A** > https://stackoverflow.com/questions/21319929/how-to-determine-whether-a-pandas-column-contains-a-particular-value ## 2. Useful parameters of `read_csv` pd.read_csv(csv_file_path, header=None, encoding="utf-8-sig", skiprows=3) * header=None Read CSV file **without header**. In default, `read_csv` reads CSV file with header. * encoding="utf-8-sig" Read CSV file **with Chinese contexts**. * skiprows=3 Read CSV file **neglecting first 3 rows**. ## 3. Get specific **column** data from `pd.DataFrame` * Method 1. Using **index** 1.1 Column name included df.iloc[:, **[1]**] 　　 1.2 Column name not included (only values) df.iloc[:, **1**] 　 P.S. index starts form **0** * Method 2. Using **col. name** 1.1 Column name included df.loc[:, **["B"]**] 　　 1.2 Column name not included (only values) df.loc[:, **"B"**] ## 4. Concatenate different `df` ### Type 1. Concanate **by columns** - Common type - As same as "concanating two `df` left and right" - Usage: df = pd.DataFrame() pd.concat([df1, df2], **axis=1**) 　　 P.S. Columns of both df1 and df2 are set before. Such as: df1 = ... df1.columns = ["dog", "cat", "duck"] df2 = ... df2.columns = ["tiger", "lion"] 　　 ### Type 2. Concanate **by indices (rows)** - Default type - As same as "concanating two `df` up and down" - Usage: pd.concat([df1, df2], **axis=0**) ## 5. Rolling (Moving) E.g. First, making a `df` with only 1 dimension. df = pd.DataFrame({'B': [1, 2, 3, 4, 5]}) Rule: If **win_size = k**, then the first k elments would have no value (`NaN`) untill **the k th element**. Note: If setting the new column in `df1` with values of the new column are "column data of `df2`", then the statement will get an error (but still works). 　　 A better alternative is: using "**pd.concat()**" instead. Such as: pd.concat([`df1`, `df2`], axis=1) if wanted to concate column data in RHS of `df1` 1. Moving Sum　　 Usage:　**df.rolling(2).sum()** Result: B 0 NaN 1 3.0 2 5.0 3 7.0 4 9.0 　　 2. Moving Average (MA) Usage:　**df.rolling(2, win_type='triang').sum()** Result: B 0 NaN 1 1.5 2 2.5 3 3.5 4 4.5 　　 3. Moving High Usage:　**df.rolling(2).max()** Result: B 0 NaN 1 2.0 2 3.0 3 4.0 4 5.0 　　 4. Moving Low Usage:　**df.rolling(2).min()** Result: B 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 ## 6. Wich direction does pandas execute "sum()" operation in default, column or row? ANS: **column** E.g., df = pd.DataFrame({"A":[1,2,3], "B": [4,5,6]}) **df** | (index) | A | B | | -------- | -------- | -------- | | 0 | 1 | 4 | | 1 | 2 | 5 | | 2 | 3 | 6 | // data-type: DataFrame **df.sum()** | X | X | -------- | -------- | | A | 6 | | B | 15 | // data-type: Series **df.sum().sum()** 21 // data-type: numpy.int64 ## 7. How to indicate specific column by index and by column name? Suppose A, B, C located respectively on col 1 (idx: 0), col 2 and col 3 in `df`. * By index df.iloc(:, **start_col_idx : end_col_idx+1**) E.g., `df`.iloc(:, 0:1) // indicate column with index 0 `df`.iloc(:, 0:2) // indicate column with index 0, 1 `df`.iloc(:, 0:3) // indicate column with index 0, 1, 2 * By column name df.iloc(:, **start_col_name : end_col_name**) E.g., `df`.iloc(:, "A":"A") // indicate column with col name "A" (with header) `df`.iloc(:, "A":"B") // indicate column with col name "A", "B" `df`.iloc(:, "A":"C") // indicate column with col name "A", "B", "C" `df`.iloc(:, "A") // indicate column with col name "A" (**without header/only values**) ## 8. Retrieve column index via column name ANS: df.columns.get_loc("A") > Suppose DataFrame `df` has 3 columns: A, B, C > and want to get the index of column "A". ## 9. Get the index of specific column ➡️ **.get_loc** ➡️ usage col_name = "test" df.columns.get_loc(col_name) ## 10. Convert "pandas series" into "numpy array" e.g. A is a pandas series ➡️ A.values ## 11. Three ways to find the column index (pandas.DataFrame) * Method 1 np.where(df.columns.values=="col_target")[0][0] * Method 2 df.columns.tolist().index("col_target") * Method 3 np.where((df.columns=="col_target")==True)[0][0] ## 12. Get the last string-splited result from a series Example: **pd_series.str.split('\\').str[-1]** ## 13. Convert dataframe into {key: {val_1: val_2}} ``` data = { 'KeyColumn': ['A', 'A', 'B', 'B', 'C', 'C'], 'Value1': [10, 20, 15, 25, 30, 40], 'Value2': [100, 200, 150, 250, 300, 400] } df = pd.DataFrame(data) # Create the dictionary `info` info = {} # Iterate over unique values in 'KeyColumn' for key in df['KeyColumn'].unique(): # Selecting rows with the current key subset = df[df['KeyColumn'] == key] # Add X to `info` with key as 'KeyColumn' value info[key] = dict(zip(subset['Value1'], subset['Value2'])) print(info) # {'A': {10: 100, 20: 200}, 'B': {15: 150, 25: 250}, 'C': {30: 300, 40: 400}} ``` ## 14. Convert dataframe into {attr: {index: val}} ``` data = { 'Index': [1,2,3,4,5,6], 'Attribute1': [10, 20, 15, 25, 30, 40], 'Attribute2': [100, 200, 150, 250, 300, 400] } df = pd.DataFrame(data) # Transform the DataFrame to a dictionary info = df.set_index('Index').to_dict() print(info) # {'Attribute1': {1: 10, 2: 20, 3: 15, 4: 25, 5: 30, 6: 40}, 'Attribute2': {1: 100, 2: 200, 3: 150, 4: 250, 5: 300, 6: 400}} ``` ## 15. Divide dataframe `df` into different groups by two specific columns ``` # Sample DataFrame (replace this with your own data) data = {'Column1': [1, 2, 1, 2, 3, 3], 'Column2': ['A', 'B', 'A', 'B', 'A', 'B'], 'Value': [10, 20, 30, 40, 50, 60]} df = pd.DataFrame(data) # Group by two specific columns (Column1 and Column2) grouped = df.groupby(['Column1', 'Column2']) # Iterate through the groups for group_name, group_df in grouped: print(f"Group: {group_name}") print(group_df) print("---------------------") ``` ## 16. Filter out empty records (assuming empty records have NaN values in any column) df = df.dropna(how='all') ## 17. Delete records where the value in column 'A' contains "123" df = df[~df['A'].str.contains("123")] ## 18. Delete records where the value in column 'A' starts with "123" and ends with "456" df[~((df['A'].str.startswith("123"))&(df['A'].str.endswith("456")))] ## 19. For all records: if column `A` contains "\n", replacing with "the first element of the value applying .split('\n') df['A'] = df['A'].apply(lambda x: x.split('\n')[0] if '\n' in x else x)