Try   HackMD
tags: python bash PBS array.array() len() DataFrame.fillna() DataFrame.replace() DataFrame.rename() DataFrame.isin() DataFrame.drop_duplicates() DataFrame.sort_values() pd.DataFrame() DataFrame.info() DataFrame.isnull() DataFrame.notna() DataFrame.to_csv() DataFrame['column'].unique() numpy.c_ numpy.r_ from import DataFrame.values flatten() tuple() reduce() DataFrame.iloc() __init__() __name__ + lambda sum()

Data manipulation in Python


Application Program Interface (API)

What is an API? An API is a way of returning data by sending a HTTP or GET request to a website or server and it will send you a response back with data. Dash and python 2: Dash core components


Run Python files

Run a Python script on the cluster

Call for a python script from a PBS job

#!/bin/bash
#PBS -N # job name 
#PBS -q 
#PBS -l 
#PBS -S /bin/bash
#PBS -m 
#PBS -M #my email 
cd $PBS_O_WORKDIR
python3 my_python_script.py

Adrian Campos's example of calling a python script from a PBS job

#!/bin/bash
#PBS -N PRSsubmit
#PBS -r n
#PBS -l mem=32GB,walltime=10:00:00
module load python
cd $PBS_O_WORKDIR #must qsub from working directory!
jobName="prsSNRr10"
runPRS ../Snoring_QC_results.tsv ${jobName} -std -r10
echo "Finished Running PRS"

Run a python file on the terminal

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


pip.exe

pip is a python dependency manager.

Syntax pip freeze shows python packages that have been installed Syntax pip install -r D:\googleDrive\DataCamp_Dash-for-beginners\python-getting-started\requirements.txt will install packages and versions specified in the requirements.txt file Syntax pip list lists all the installed python packages. Note that if multiple python script folders (e.g., D:\Anaconda3\Scripts and D:\python_3.8.1\Scripts) have been added to the Path environmental variable, this will list the installed packages under the first directory. This was changed later to D:\python_3.8.1\Scripts;D:\Anaconda3;D:\Anaconda3\Scripts;D:\Anaconda3\Library\bin


Python module

A module is a single .py file (or files) that are imported under one import and used.

import aModuleName Here 'aModuleName' is just a regular .py file.

Python3 importlib.util.spec_from_file_location with relative path?


Import a module in a directory different from the current python script file

# A python file containing a function to be used in another python file
import itertools

def paste(*args, sep = ' ', collapse = None):
    """
    Port of paste from R
    Args:
        *args: lists to be combined
        sep: a string to separate the terms
        collapse: an optional string to separate the results
    Returns:
        A list of combined results or a string of combined results if collapse is not None
    """
    combs = list(itertools.product(*args))
    out = [sep.join(str(j) for j in i) for i in combs]
    if collapse is not None:
        out = collapse.join(out)
    return out

# Import the module above
import importlib.util
spec = importlib.util.spec_from_file_location(name='paste', location='D:/googleDrive/python/Python_functions/R-paste.py')
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

# Call the function with moduleName.functionName()
module.paste(['s'], range(1,11), sep = '')
Out[307]: ['s1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10']

module.paste(['s'], range(2), ['t'], range(3), sep = '')
Out[63]: ['s0t0', 's0t1', 's0t2', 's1t0', 's1t1', 's1t2']

module.paste(['s'], range(2), ['t'], range(3), sep = '', collapse = ':')
Out[64]: 's0t0:s0t1:s0t2:s1t0:s1t1:s1t2'

Python package

A package is a collection of modules in directories that give a package hierarchy. A package contains a distinct __init__.py file. from aPackageName import aModuleName Here 'aPackageName' is a folder with a __init__.py file and 'aModuleName', which is just a regular .py file. Therefore, the correct version of your proj-dir would be something like this,

proj-dir |init.py package1 |init.py |module1.py package2 |init.py |module2.py

Python3 importlib.util.spec_from_file_location with relative path?


Dash

Dash is a web application framework that provides pure Python abstraction around HTML, CSS, and JavaScript. Instead of writing HTML or using an HTML templating engine, you compose your layout using Python structures with the dash-html-components library. The source for this library is on GitHub

Here is an example of a simple HTML structure:

import dash_html_components as html

html.Div([
    html.H1('Hello Dash'),
    html.Div([
        html.P('Dash converts Python classes into HTML'),
        html.P('This conversion happens behind the scenes by Dash's JavaScript front-end')
    ])
])

which gets converted (behind the scenes) into the following HTML in your web-app:

<div>
    <h1>Hello Dash</h1>
    <div>
        <p>Dash converts Python classes into HTML</p>
        <p>This conversion happens behind the scenes by Dash's JavaScript front-end</p>
    </div>
</div>

<div> </div> are meant to divide down your elements to structure your page, and other tags like sections, articles headeretc are just a kind of aliases of the div we know, and let's face it with their names they are making our life easier!


gunicorn

The first few attempts to install the gunicorn package in cmd.exe with the following commands

# First try
D:\Anaconda3\Scripts>pip install gunicorn
# Second try
D:\Anaconda3\Scripts>pip --trusted-host pypi.python.org --trusted-host files.pythonhosted.org --trusted-host pypi.org install gunicorn

resulted in the following error: Could not fetch URL [need more reputation to post link]: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org (http://pypi.org)', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not avaiable.")) - skipping")

Following the steps at How do I set or change the PATH system variable? and Requests (Caused by SSLError(“Can't connect to HTTPS URL because the SSL module is not available.”) Error in PyCharm requesting website (https://stackoverflow.com/questions/54135206/requests-caused-by-sslerrorcant-connect-to-https-url-because-the-ssl-module), this issue has been resolved by adding the following the to the Path environmental variable ;D:\Anaconda3;D:\Anaconda3\Scripts;D:\Anaconda3\Library\bin;D:\python_3.8.1;D:\python_3.8.1\Scripts

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Jupyter Notebook

Change default display in a Jupyter notebook

# Display all output in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Widen the notebook with this code
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Special Python variables

__file__

__file__ is the pathname of the file from which the module was loaded, if it was loaded from a file. This means __file__ will only work when you run it as a script not in interpreter.(unless you import it in interpreter what does the __file__ variable mean/do?.

A = os.path.join(os.path.dirname(__file__), '..')
# A is the parent directory of the directory where program resides.

B = os.path.dirname(os.path.realpath(__file__))
# B is the canonicalised (?) directory where the program resides.

C = os.path.abspath(os.path.dirname(__file__))
# C is the absolute path of the directory where the program resides.

# Save the following code in a test.py at a different location, run the script and see the values
import os
print(__file__)
print(os.path.join(os.path.dirname(__file__), '..'))
print(os.path.dirname(os.path.realpath(__file__)))
print(os.path.abspath(os.path.dirname(__file__)))

__name__

__name__ is a special Python variable. It gets its value depending on how we execute the containing script. When a script is running directory, the value of __name__ is __main__. When a script is running by importing as a module in another script, the value of __name__ is the name of the imported script. What’s in a (Python’s) __name__? What does if __name__ == “__main__”: do?

# If you are running your module (the source file) as the main program, e.g.
python foo.py
# It's as if the interpreter inserts this at the top of your module when run as the main program.
__name__ = "__main__" 

# Suppose this is in some other main program.
import foo
# It's as if the interpreter inserts this at the top of your module when it's imported from another module.
__name__ = "foo"

# file two.py
import one

print("top-level in two.py")
one.func()

if __name__ == "__main__":
    print("two.py is being run directly") # This is printed when two.py is running as the main program
else:
    print("two.py is being imported into another module") # This is printed when two.py is running as an imported module 
# This if statement ensures that app.run_server() is only executed when the script is running as a main program.
if __name__ == '__main__':
    app.run_server(debug=True)

Data structure

Array

Create an array

array is a collection of elements of the same type. Syntax array.array(data_type, value_list) creates an array of the same type specified by data_type. Note array.array() takes no keyword arguments. Python Arrays

# Python program to demonstrate  
# Creation of Array  
  
# importing "array" for array creations 
import array as arr 
  
# creating an array with 3 integers
a = arr.array('i', [1, 2, 3]) 
  
# printing original array 
print ("The new created array is : ", end =" ") 
for i in range(0, 3): 
    print(a[i], end =" ") 
print() 
  
# creating an array with 3 floats (floats are double precision values in python)
b = arr.array('d', [2.5, 3.2, 3.3]) 
  
# printing original array 
print("The new created array is : ", end =" ") 
for i in range(0, 3): 
    print(b[i], end =" ") 

Concatenate arrays along row (first) or column axis

Syntax numpy.r_[array1,array2,array3] concatenates 3 arrays row-wise. The merged array grows in rows with column dimension unchanged numpy中np.c_和np.r_

Syntax numpy.c_[array1,array2,array3] concatenates 3 arrays column-wise. The merged array grows in columns with row dimension unchanged

Syntax array.shape shows the dimension of the array as (rows, columns)

import numpy as np
a = np.array([[1, 2, 3],[4,5,6]])
b = np.array([[0, 0, 0],[1,1,1]])
print("-------------------a------------------")
print(a)
print("The dimension of array a is: ", a.shape)
print("-------------------b------------------")
print(b)
print("The dimension of array b is: ", b.shape)
print("-------------------np.r_[a,b]--------------------")
print(np.r_[a,b])
print("The dimension of array np.r_[a,b] is: ", np.r_[a,b].shape)
print("--------------------np.c_[a,b]-------------------")
print(np.c_[a,b])
print("The dimension of array np.c_[a,b] is: ", np.c_[a,b].shape)

-------------------a------------------
[[1 2 3]
 [4 5 6]]
The dimension of array a is:  (2, 3)
-------------------b------------------
[[0 0 0]
 [1 1 1]]
The dimension of array b is:  (2, 3)
-------------------np.r_[a,b]--------------------
[[1 2 3]
 [4 5 6]
 [0 0 0]
 [1 1 1]]
The dimension of array np.r_[a,b] is:  (4, 3)
--------------------np.c_[a,b]-------------------
[[1 2 3 0 0 0]
 [4 5 6 1 1 1]]
The dimension of array np.c_[a,b] is:  (2, 6)

Dictionary

dictionary.keys() shows all the keys dictionary.values() shows all the values dictionary.items() returns items in a list format of (key, value) tuple pairs

sammy = {'username': 'sammy-shark', 'online': True, 'followers': 987}
sammy.keys()
Out[599]: dict_keys(['username', 'online', 'followers'])

sammy.values()
Out[600]: dict_values(['sammy-shark', True, 987])

sammy.items()
Out[601]: dict_items([('username', 'sammy-shark'), ('online', True), ('followers', 987)])

Create a dictionary with numeric keys and string values

# Method 1: Create a dictionary with numeric keys and string values
In [99]: marks={i:str(i) for i in range(1,6)}
Out[99]:{1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}

# Method 2: Create a dictionary with numeric keys and string values using dict(). Notice a tuple with 2 things are used inside the dict()
In [100]: dict((i,str(i)) for i in range(1,6))
Out[100]: {1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}

# Method 3
In [101]: {i: '{}'.format(i) for i in range(1,6)}
Out[101]: {1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}

List

Syntax list[0:10] or list[:10] will give you the first 10 elements of this list using slicing Python: Fetch first 10 results from a list [duplicate] Syntax list[-10:] will give you the last 10 elements of this list using slicing Syntax len(list) counts the number of items in the list

a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
a[-9:]
[4, 5, 6, 7, 8, 9, 10, 11, 12]

A List of Dictionaries

A list of dictionaries Book - 5.4) Python- Combining Lists and Dictionaries

Blacksburg_Forecast = [ { 'humidity' :  20, 'temperature' : 78, 'wind' :  7} ,
                        { 'humidity' :  50, 'temperature' : 61, 'wind' : 10} ,
                        { 'humidity' : 100, 'temperature' : 81, 'wind' :  5} ,
                        { 'humidity' :  90, 'temperature' : 62, 'wind' : 15} ,
                        { 'humidity' :  30, 'temperature' : 84, 'wind' : 19} ,
                        { 'humidity' :   0, 'temperature' : 66, 'wind' : 28} ,
                        { 'humidity' :   0, 'temperature' : 87, 'wind' : 12} ,
                        { 'humidity' :   0, 'temperature' : 68, 'wind' : 14} ,
                        { 'humidity' :   0, 'temperature' : 86, 'wind' :  4} ,
                        { 'humidity' :  60, 'temperature' : 68, 'wind' :  0}
                       ]

is graphically like a table, where each key

Each row in the list is an instance (i.e. dictionary)

# Iterate through each instance of the list and print the values of key wind
for forecast in Blacksburg_Forecast: 
     wind_speed = forecast['wind']  
     print(wind_speed) 

7
10
5
15
19
28
12
14
4
0

String


Conditions

2 conditions

for i in range(10):
    if i % 2 == 0: # % -- modulus operator -- returns the remainder after division
        print("{} is even".format(i))
    else:
        print("{} is odd".format(i))

3 or more conditions

# Example using elif as well
# Print the meteorological season for each month (loosely, of course, and in the Northern Hemisphere)
print("In the Northern Hemisphere: \n")
month_integer = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] # i.e., January is 1, February is 2, etc...
for month in month_integer:
    if month < 3:
        print("Month {} is in Winter".format(month))
    elif month < 6:
        print("Month {} is in Spring".format(month))
    elif month < 9:
        print("Month {} is in Summer".format(month))
    elif month < 12:
        print("Month {} is in Fall".format(month))
    else: # This will put 12 (i.e., December) into Winter
        print("Month {} is in Winter".format(month))

Manipulating a DataFrame (df)

Read a CSV file from Google drive

Change the sharing settings of the CSV file to public. Who has access : Anyone who has the link can view

The Link to share is like https://drive.google.com/file/d/1oNrjNmIF42SfIUzki-iyXnR04FJ2e4Jy/view?usp=sharing, which ends with usp=sharing

Replace the link to share with the value of orig_url in the following code

Pandas: How to read CSV file from google drive public?

import pandas as pd
import requests
from io import StringIO

orig_url='https://drive.google.com/file/d/1oNrjNmIF42SfIUzki-iyXnR04FJ2e4Jy/view?usp=sharing'

file_id = orig_url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
dfs = pd.read_csv(csv_raw)
print(dfs.head())

#   Study ID Study Hospital Study Group    variable  value
#0         1              A         TMT  Heart Rate  182.0
#1         2              A         TMT  Heart Rate  132.0
#2         3              A         TMT  Heart Rate  132.0
#3         4              A        CTRL  Heart Rate  132.0
#4         5              A         TMT  Heart Rate  138.0

List column names

Syntax DataFrame.columns.values


Add header to columns

Syntax DataFrame.columns= ['NewColumn1','NewColumn2','NewColumn3'] adds header rows to a 3 column DataFrame

How to add header row to a pandas DataFrame

    GDP= pd.read_excel('D:/Now/workshops/20191013_coursera_Introduction-to-Data-Science-in-Python/course1_downloads/gdplev.xls'
                      ,header=None
                      ,skiprows=8)
    # Select 5th and 7th columns
    t2= GDP.iloc[:,[4,6]]
    # Add header 
    t2.columns= ['YearQuarter','GDP']

Rename specific columns

syntax DataFrame.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2',...}) Renaming columns in pandas

# Rename column 'Energy Supply (petajoule)' as 'Energy Supply'
df= df.rename(columns={'Energy Supply (petajoule)':'Energy Supply'})

Change data types

DataFrame["A"]= pd.to_numeric(DataFrame["A"]) change the A column to numeric

# Check data types
data2.dtypes
#study_id            int64
#study_hospital     object
#study_group        object
#weight_kg         float64
#pathology          object
#hr                 object
#rr                 object
#spo2              float64

# Change data type for continuous variables to numeric
data2["hr"]=pd.to_numeric(data2["hr"])
data2["rr"]=pd.to_numeric(data2["rr"])

# Check data types again
data2.info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 300 entries, 0 to 299
#Data columns (total 8 columns):
#study_id          300 non-null int64
#study_hospital    300 non-null object
#study_group       300 non-null object
#weight_kg         300 non-null float64
#pathology         296 non-null object
#hr                296 non-null float64
#rr                296 non-null float64
#spo2              298 non-null float64
#dtypes: float64(4), int64(1), object(3)

Subset columns by a list

Syntax DataFrame[['column_1','column_2','column_3']] Selecting multiple columns in a pandas dataframe


Subset columns by positions

Syntax DataFrame.loc[:,np.r_[0,n1:n2]] selects all rows and column first, n1+1 ~ n2-1 Selecting multiple dataframe columns by position in pandas [duplicate]

import numpy as np
# Select all rows and column first, 51st ~ 59th
GDP2006_2015=GDP.iloc[:, np.r_[0,50:60]] # len(GDP2006_2015) 264

Combine multiple columns into a new column

DataFrame['NewColumn']= DataFrame['string_column1'] + DataFrame['string_column2']

DataFrame['NewColumn']= DataFrame['string_column1'] + " " +DataFrame['string_column2']

DataFrame['NewColumn']= DataFrame['numeric_column1'].apply(str) + DataFrame['numeric_column2'].apply(str) convert the column to string using .apply(str) before the concatenation

import pandas as pd
# create a new data frame
df = pd.DataFrame({'Last': ['Smith', 'Nadal', 'Federer'],
                   'First': ['Steve', 'Joe', 'Roger'],
                   'Age':[32,34,36]})
df["Name"] = df["First"] +" "+ df["Last"]

df
Out[316]: 
      Last  First  Age           Name
0    Smith  Steve   32    Steve Smith
1    Nadal    Joe   34      Joe Nadal
2  Federer  Roger   36  Roger Federer

Get unique rows

Syntax DataFrame.drop_duplicates() How to “select distinct” across multiple data frame columns in pandas?


Sort a df by multiple columns

Syntax DataFrame_sorted= DataFrame.sort_values(['column1','column2'], ascending=[True, False]) Syntax DataFrame.sort_values(['column1','column2'], ascending=[True, False], inplace=True) How to sort a dataFrame in python pandas by two or more columns?


Filtering a df with and, or , not operators

Syntax: DataFrame[(condition1) & (condition2) | (condition3) ~ (condition4)]

& means and | means or ~ means not

A condition can be specified as DataFrame['column'] == or DataFrame['column'] !=
Filtering Pandas Dataframe using OR statement


Filtering a df with in, not in operations

Syntax DataFrame[(condition1) ~(condition2)] filters the DataFrame with 1 positive condition and 1 negative condition.

condition1 can be written as DataFrame['column1'].isin(list1) == condition2 can be DataFrame['column2'].isin(list2) How to filter Pandas dataframe using 'in' and 'not in' like in SQL


Filtering a df by selecting rows based on 1 column and select other columns

df[df['column1']=='condition']['column2'] selects rows based on column1 and selects column2

# Folder locations
folder_path_project='D:/googleDrive/Udemy_Interactive-Python-Dashboards-with-Plotly-and-Dash/course_material/Plotly-Dashboards-with-Dash-master/'
folder_path_data= os.path.join(folder_path_project,'Data/')
folder_path_script=os.path.join(folder_path_project,'1-03E-LineChartExercises/')

# 
df= pd.read_csv(os.path.join(folder_path_data,'2010YumaAZ.csv'))

days = ['TUESDAY','WEDNESDAY','THURSDAY','FRIDAY','SATURDAY','SUNDAY','MONDAY']

# Use a for loop (or list comprehension to create traces for the data list)
data = []

for day in days:
    # Make a plot per day of week
    trace = go.Scatter(x=df['LST_TIME']
                      ,y=df[df['DAY']==day]['T_HR_AVG']
                      ,mode='lines'
                      ,name=day)
    # Add the current plot to the data list
    data.append(trace)


Identifying missing and non-missing values

Syntax DataFrame['column'].isnull().sum() counts number of missing values in the column Syntax DataFrame['column'].notna().sum() counts number of non-missing values in the column Syntax DataFrame.isnull().sum() counts total NaN at each column in a DataFrame Python Pandas : Count NaN or missing values in DataFrame ( also row & column wise)

# Count the number and percentage of missing values at each column
count_NaN_series= df.isnull().sum()
percent_NaN_series= count_NaN_series/df.shape[0]*100
# Create a DataFrame from Series data
frame={'NaN_count': count_NaN_series, 'NaN_percent':percent_NaN_series}
df_NaN= pd.DataFrame(frame)

Replacing missing values

Syntax DataFrame.fillna(value={'column1': value1, 'column2': value2},...) replaces missing values in column1 with value1 and missing values in column2 with value2 HOW TO USE THE PANDAS FILLNA METHOD

The replaced values can be DataFrame['column'].mean(), DataFrame['column'].median(), DataFrame['column'].mode()

DataFrame.replace(to_replace = np.nan, value =-99999) replaces NaN values with -99999

df_fillna =df.fillna(value={'qualification_edu_6138_zscore': df['qualification_edu_6138_zscore'].mean(),'complete_alcohol_unitsweekly':df['complete_alcohol_unitsweekly'].mean(),'caffeine.per.day': df['caffeine.per.day'].mean(),'age':df['age'].mean(),'overall_health_rating': df['overall_health_rating'].median(),'merged_pack_years_20161':df['merged_pack_years_20161'].mean(),'merged_pack_years_20161_zscore':df['merged_pack_years_20161_zscore'].mean(),'3456-0.0':df['3456-0.0'].mean()})

Replacing string

DataFrame.replace(to_replace="A", value="a") replaces "A" with "a" across all columns

DataFrame.replace(to_replace=["A","B"], value="ab") replaces "A" and "B" with "a" across all columns

Python | Pandas dataframe.replace()

# Replace string "missing" with blank
data2= data.replace("missing","")

Convert DataFrame to tuples

# Find the row with max value in % Renewable, return 2 columns
## The values attribute returns the values as a np array (e.g. [['Brazil' 69.64803]] ). It's a list of lists where
## each list element is from a row (e.g. [[row 1 values],[row 2 values]])
t6= t3.loc[t3['% Renewable']==t3['% Renewable'].max(), ['Country','% Renewable']].values
print(t6) # [['Brazil' 69.64803]]
# unlist the list of list
print(t6.flatten()) # ['Brazil' 69.64803]
# unlist the list of list, convert the list to a tuple
print(tuple(t6.flatten())) # ('Brazil', 69.64803)

Subset rows based on conditions in 1 column

DataFrame[condition]

DataFrame[(condition 1) | (condition 2)]

DataFrame[(condition 1) & (condition 2)]

# Select rows from Rank=1 to Rank=15
ScimEn_rank1_15=ScimEn[ScimEn['Rank'] <= 15] # len(ScimEn_rank1_15) 15

Reshpae data from wide to long

pd.melt(DataFrame, id_vars=['g1','g2','g3'], value_vars=['m1','m2','m3']) reshapes the data from wide to long format. The m1, m2, m3 columns are grouped into 2 new columns variable and value. The variable contains the name of m1, m2, m3 and the value column contains their values.

# Resetting index really needed before data reshape?
data = data.reset_index(level=0, drop=True).reset_index()

# Reshape 3 value_vars columns into 2 new columns- variable and value
data2 = pd.melt(data, id_vars=['Study ID','Study Hospital','Study Group'], value_vars=['Heart Rate','Respiratory Rate','spo2'])

Join > 2 DataFrames

from functools import reduce

# Suppose you want to join single columns from 3 dataframes energy, GDP and ScimEn
## energy.shape (265, 5)
## GDP.shape (264, 60)
## ScimEn.shape (191, 8)

# Subset country columns from the 3 data sets
# Keep the column names identical for merging purposes
d1= energy.loc[:,['Country']]
d2= GDP.loc[:,['Country Name']]
d2= d2.rename(columns={'Country Name':'Country'})
d3= ScimEn.loc[:,['Country']]

# Compile the list of dataframes you want to merge
data_frames = [d1,d2,d3]

# Full join the 3 data sets
outer_join= reduce(lambda left, right: pd.merge(left, right, how="outer", on="Country"), data_frames)  
len(outer_join) # 356 

# Inner join the 3 data sets
inner_join= reduce(lambda left, right: pd.merge(left, right, how="inner", on="Country"), data_frames)  
len(inner_join) # 162

obs_diff_outer_inner= len(outer_join) - len(inner_join)
obs_diff_outer_inner # 194

Convert DataFrame to dictionary

mydict= dict(zip(DataFrame['columnA'], DataFrame['columnB'])) creates a new dictionary using columnA as key and columnB as value. python pandas dataframe columns convert to dict key and value

import os
import pandas as pd

# Folder locations
folder_path_project='D:/googleDrive/Udemy_Interactive-Python-Dashboards-with-Plotly-and-Dash/course_material/Plotly-Dashboards-with-Dash-master/'
folder_path_data=os.path.join(folder_path_project,'Data/')

# Download world cities from the Internet
world_cities= pd.read_csv(os.path.join(folder_path_data,'worldcities.csv'))
# Use the country column as key and iso2 column as value to create a new dictionary country_dict
country_dict=dict(zip(world_cities['country'], world_cities['iso2']))

Export a DataFrame

DataFrame.to_csv(path_or_buf='filepath',sep=',',na_rep='',header=True) export a python object as a file pandas.DataFrame.to_csv

# Export cleaned data as a csv
data3.to_csv('D:/googleDrive/Job/UQ_509098_Data-analyst_Paediatric-Critical-Care-Research-Group/Practical-Assessment/cleaned_data_long.csv', index=False)

List comprehension

# If you used to do it like this:
new_list = []
for i in old_list:
    if filter(i):
        new_list.append(expressions(i))

# You can obtain the same thing using list comprehension:
new_list = [expression(i) for i in old_list if filter(i)]

Get column names with a list comprehension

list_of_pop_col= [col for col in df2.columns if col.startswith('POP')]

Create a list of dictionaries by a list comprehension

import pandas as pd
import os

# Folder locations
folder_path_project='D:/googleDrive/Udemy_Interactive-Python-Dashboards-with-Plotly-and-Dash/course_material/Plotly-Dashboards-with-Dash-master/'
folder_path_script=os.path.join(folder_path_project,'2-08-MultipleInputs/')

# Get data
df = pd.read_csv(os.path.join(folder_path_data,'mpg.csv'))

# Create the content for a dropdown menu
dropdown_items = df.columns

# Create a list of dictionaries, which are used as parameters of dash_core_components.Dropdown options=
x=[{'label': i.title(), 'value': i} for i in dropdown_items]


Read all files in a folder and combine them to a single file with a list comprehension

# Change directory to the data folder
os.chdir('GWAS_calc/GWAS_out')
# Obtain file names that end with .dosage
dosageFiles=glob.glob('*.dosage')

# Print a message with current time
print("Compiling dosage results this may take a while %s"%(datetime.datetime.now()))

# Import all the .dosage files and concatenate them to a dataframe called finalDF
finalDF=pd.concat((pd.read_csv(f,header=0,index_col='SNP',sep='\s+') for f in dosageFiles))

# Print another message with current time
print("Finished compiling dosage at %s"%(datetime.datetime.now()))

Make 1 plot per unique value of a column

# Perform imports here:
import pandas as pd
import os
import plotly.offline as pyo
import plotly.graph_objs as go

# Folder locations
folder_path_project='D:/googleDrive/Udemy_Interactive-Python-Dashboards-with-Plotly-and-Dash/course_material/Plotly-Dashboards-with-Dash-master/'
folder_path_data= os.path.join(folder_path_project,'Data/')
folder_path_script=os.path.join(folder_path_project,'1-03E-LineChartExercises/')

# Create a pandas DataFrame from 2010YumaAZ.csv
## Column LST_TIME and T_HR_AVG are used as x and y in the line plot
## Column DAY is used to make one plot per day of week
df= pd.read_csv(os.path.join(folder_path_data,'2010YumaAZ.csv'))
print(df.head())

days = ['TUESDAY','WEDNESDAY','THURSDAY','FRIDAY','SATURDAY','SUNDAY','MONDAY']

# Method 1: making 1 plot per week day using a for loop
data = []

for day in days:
    # Subset data from the day of week
    df2=df[df['DAY']==day]
    # Make a plot per day of week
    trace = go.Scatter(x=df2.LST_TIME
                      ,y=df2.T_HR_AVG
                      ,mode='lines'
                      ,name=day)
    # Add the current plot to the data list
    data.append(trace)

# Method 2: making 1 plot per week day using a list comprehension
data= [{
    'x': df['LST_TIME'],
    'y': df[df['DAY']==day]['T_HR_AVG'],
    'name': day
    } for day in df['DAY'].unique()]

# Define the layout
layout= go.Layout(title='Daily averaged temperature'
                ,xaxis=dict(title='Hours')
                ,yaxis=dict(title='Averaged temperature oC'))

#Pass plot data list and the layout variable to a go.Figure object
figu= go.Figure(data=data, layout=layout)

# Create a fig from data and layout, and plot the figures by passing the go.Figure object to a plot call. Export the scatter plot as a html file
output_file_path=os.path.join(folder_path_script,'line-plot_hourly-avg-temp.html')
pyo.plot(figu, filename=output_file_path)